C2W2.Assignment.Parts-of-Speech Tagging (POS).Part2_assignment 2: parts-of-speech tagging (pos)-优快云博客

理论课：C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models

文章目录

2 Hidden Markov Models
- 2.1 Generating Matrices

理论课： C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models
Part1
Part3

2 Hidden Markov Models

本节将使用维特比算法实现HMM：

HMM 是自然语言处理中最常用的算法之一，也是该领域中许多深度学习技术的基础。
除了POS，HMM 还用于语音识别、语音合成等。
完成这部分作业，可在Part 1 使用的相同数据集上获得 95% 的准确率。

马尔可夫模型中包含若干状态以及这些状态之间的转换概率。

在本例中，状态就是词性标签。
马尔可夫模型中状态的转换可以使用一个转换矩阵 A表示。
隐马尔可夫模型增加了一个观测或发射矩阵B，它描述了当我们处于特定状态时可见观测的概率。观测发射矩阵就是语料库中的单词。
被隐藏的状态是该单词的词性标签。

2.1 Generating Matrices

Creating the `A` transition probabilities matrix

在Part 1中已经计算了emission_counts, transition_counts, tag_counts，使用这三个矩阵可以很容易的构造出A和B

A	…	RBS	RP	SYM	TO	UH	…
…	…	…	…	…	…	…	…
RBS	…	2.217069e-06	2.217069e-06	2.217069e-06	0.008870	2.217069e-06	…
RP	…	3.756509e-07	7.516775e-04	3.756509e-07	0.051089	3.756509e-07	…
SYM	…	1.722772e-05	1.722772e-05	1.722772e-05	0.000017	1.722772e-05	…
TO	…	4.477336e-05	4.472863e-08	4.472863e-08	0.000090	4.477336e-05	…
UH	…	1.030439e-05	1.030439e-05	1.030439e-05	0.061837	3.092348e-02	…
…	…	…	…	…	…	…

注意：以上的计算结果示例是经过平滑后的。每个单元格都给出了从一个词性标签到另一个词性标签的概率。

如：从词性标签 TO 到 RP 的概率为 4.47e-8。
每一行的总和必须等于 1，因为模型假设下一个 POS 标记必须是表中可用列之一。

平滑处理的方法如下：
$P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_{i}) + \alpha }{C(t_{i-1}) +\alpha * N}\tag{3}$

$N$ 是标签总数
$C(t_{i-1}, t_{i})$ 是 “transition_counts ”字典中元组（前一个 POS、当前 POS）的计数。
$C(t_{i-1})$ 是 “tag_counts ”字典中前一个 POS 的计数。
$\alpha$ 是一个平滑参数。

Exercise 03

根据公式3实现函数create_transition_matrix
函数主要目的是创建一个转移矩阵 A，这个矩阵通常用于隐马尔可夫模型（HMM）中，表示状态（在这里是词性标签）之间转移的概率。以下是代码的详细解释：

输入参数：

alpha：平滑参数，用于在计算概率时避免零概率问题，提高模型的鲁棒性。
tag_counts：一个字典，映射每个词性标签到它的出现次数。
transition_counts：一个字典，其键是前一个词性标签和当前词性标签的元组，值是这些转换发生的次数。

输出：

A：一个维度为 (num_tags, num_tags) 的转移矩阵。

函数逻辑：

获取所有唯一的词性标签，并对它们进行排序，存储在 all_tags 列表中。
计算词性标签的种类数量，存储在变量 num_tags 中。
初始化转移矩阵 A，其大小为 (num_tags, num_tags)，初始值为零。
获取所有唯一的转换元组，这些元组是 transition_counts 字典的键，存储在 trans_keys 集合中。
遍历转移矩阵 A 的每一行（代表前一个状态的词性标签）：
- 对于矩阵 A 的每一列（代表当前状态的词性标签）：
  - 初始化当前转换的计数 count 为零。
  - 定义一个转换元组 key，它包含前一个词性标签和当前词性标签。
  - 检查 key 是否存在于 transition_counts 字典中：
    - 如果存在，从字典中获取对应的计数 count。
  - 从 tag_counts 字典中获取前一个词性标签的计数 count_prev_tag。
  - 应用平滑技术，使用公式3来计算转移概率，并将其赋值给矩阵 A 的对应位置。
返回填充好的转移矩阵 A。

# UNQ_C3 GRADED FUNCTION: create_transition_matrix
def create_transition_matrix(alpha, tag_counts, transition_counts):
    ''' 
    Input: 
        alpha: number used for smoothing
        tag_counts: a dictionary mapping each tag to its respective count
        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
    Output:
        A: matrix of dimension (num_tags,num_tags)
    '''
    # Get a sorted list of unique POS tags
    all_tags = sorted(tag_counts.keys())
    
    # Count the number of unique POS tags
    num_tags = len(all_tags)
    
    # Initialize the transition matrix 'A'
    A = np.zeros((num_tags,num_tags))
    
    # Get the unique transition tuples (previous POS, current POS)
    trans_keys = set(transition_counts.keys())
    
    ### START CODE HERE ### 
    
    # Go through each row of the transition matrix A
    for i in range(num_tags):
        
        # Go through each column of the transition matrix A
        for j in range(num_tags):

            # Initialize the count of the (prev POS, current POS) to zero
            count = 0
        
            # Define the tuple (prev POS, current POS)
            # Get the tag at position i and tag at position j (from the all_tags list)
            key = (all_tags[i],all_tags[j]) # tuple of form (tag,tag)

            # Check if the (prev POS, current POS) tuple 
            # exists in the transition counts dictionary
            if transition_counts: # Replace None in this line with the proper condition.
                
                # Get count from the transition_counts dictionary 
                # for the (prev POS, current POS) tuple
                count = transition_counts[key]               

            # Get the count of the previous tag (index position i) from tag_counts
            count_prev_tag = tag_counts[all_tags[i]]
            
            # Apply smoothing using count of the tuple, alpha, 
            # count of previous tag, alpha, and total number of tags
            A[i,j] = (count + alpha) / (count_prev_tag + alpha*num_tags)

    ### END CODE HERE ###
    return A

测试：

alpha = 0.001
A = create_transition_matrix(alpha, tag_counts, transition_counts)
# Testing your function
print(f"A at row 0, col 0: {A[0,0]:.9f}")
print(f"A at row 3, col 1: {A[3,1]:.4f}")

print("View a subset of transition matrix A")
A_sub = pd.DataFrame(A[30:35,30:35], index=states[30:35], columns = states[30:35] )
print(A_sub)

结果：

A at row 0, col 0: 0.000007040
A at row 3, col 1: 0.1691
View a subset of transition matrix A
              RBS            RP           SYM        TO            UH
RBS  2.217069e-06  2.217069e-06  2.217069e-06  0.008870  2.217069e-06
RP   3.756509e-07  7.516775e-04  3.756509e-07  0.051089  3.756509e-07
SYM  1.722772e-05  1.722772e-05  1.722772e-05  0.000017  1.722772e-05
TO   4.477336e-05  4.472863e-08  4.472863e-08  0.000090  4.477336e-05
UH   1.030439e-05  1.030439e-05  1.030439e-05  0.061837  3.092348e-02

Create the `B` emission probabilities matrix

B矩阵对应的公式为：
$P(w_i | t_i) = \frac{C(t_i, word_i)+ \alpha}{C(t_{i}) +\alpha * N}\tag{4}$

$C(t_i,word_i)$ 是训练数据中 $tag_i$ 调节下 $word_i$ 观测到的次数（存储在emission_counts字典中）。
$C(t_i)$ 是 $tag_i$ 在训练数据中出现的总次数（存储在tag_counts字典中）。
$N$ 是词汇表中的单词数
$\alpha$ 是一个平滑参数。

B矩阵维度为(num_tags, N)，下面是一个示例

B	…	725	adroitly	engineers	promoted	synergy	…
CD	…	8.201296e-05	2.732854e-08	2.732854e-08	2.732854e-08	2.732854e-08	…
NN	…	7.521128e-09	7.521128e-09	7.521128e-09	7.521128e-09	2.257091e-05	…
NNS	…	1.670013e-08	1.670013e-08	4.676203e-04	1.670013e-08	1.670013e-08	…
VB	…	3.779036e-08	3.779036e-08	3.779036e-08	3.779036e-08	3.779036e-08	…
RB	…	3.226454e-08	6.456135e-05	3.226454e-08	3.226454e-08	3.226454e-08	…
RP	…	3.723317e-07	3.723317e-07	3.723317e-07	3.723317e-07	3.723317e-07	…
…	…	…	…	…	…	…	…

Exercise 04

完成create_emission_matrix函数，其主要目的是创建一个发射矩阵 B，这个矩阵通常用于隐马尔可夫模型（HMM）中，表示在给定状态（在这里是词性标签）下观测到某个观测值（在这里是单词）的概率。以下是代码的详细解释：

输入参数：

alpha：平滑参数，用于在计算概率时避免零概率问题，提高模型的鲁棒性。
tag_counts：一个字典，映射每个词性标签到它的出现次数。
emission_counts：一个字典，其键是词性标签和单词的元组，值是这些发射发生的次数。
vocab：一个字典，其键是词汇表中的单词，值是索引。在这个函数中，vocab 被当作列表处理。

输出：

B：一个维度为 (num_tags, len(vocab)) 的发射矩阵。

函数逻辑：

获取词性标签的数量，存储在变量 num_tags 中。
获取所有唯一的词性标签，并对它们进行排序，存储在 all_tags 列表中。
获取词汇表中单词的总数，存储在变量 num_words 中。
初始化发射矩阵 B，其大小为 (num_tags, num_words)，初始值为零。
获取所有唯一的 (词性标签, 单词) 元组，这些元组是 emission_counts 字典的键，存储在 emis_keys 集合中。
遍历发射矩阵 B 的每一行（代表词性标签）：
- 对于矩阵 B 的每一列（代表词汇表中的单词）：
  - 初始化当前发射的计数 count 为零。
  - 定义一个 (词性标签, 单词) 元组 key。
  - 检查 key 是否存在于 emission_counts 字典中：
    - 如果存在，从字典中获取对应的计数 count。
  - 从 tag_counts 字典中获取当前词性标签的计数 count_tag。
  - 应用平滑技术，使用公式4来计算观测概率，并将其赋值给矩阵 B 的对应位置。
返回填充好的观测概率矩阵 B。

# UNQ_C4 GRADED FUNCTION: create_emission_matrix

def create_emission_matrix(alpha, tag_counts, emission_counts, vocab):
    '''
    Input: 
        alpha: tuning parameter used in smoothing 
        tag_counts: a dictionary mapping each tag to its respective count
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        vocab: a dictionary where keys are words in vocabulary and value is an index.
               within the function it'll be treated as a list
    Output:
        B: a matrix of dimension (num_tags, len(vocab))
    '''
    
    # get the number of POS tag
    num_tags = len(tag_counts)
    
    # Get a list of all POS tags
    all_tags = sorted(tag_counts.keys())
    
    # Get the total number of unique words in the vocabulary
    num_words = len(vocab)
    
    # Initialize the emission matrix B with places for
    # tags in the rows and words in the columns
    B = np.zeros((num_tags, num_words))
    
    # Get a set of all (POS, word) tuples 
    # from the keys of the emission_counts dictionary
    emis_keys = set(list(emission_counts.keys()))
    
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    
    # Go through each row (POS tags)
    for i in range(num_tags): # complete this line
        
        # Go through each column (words)
        for j in range(num_words): # complete this line

            # Initialize the emission count for the (POS tag, word) to zero
            count = 0
                    
            # Define the (POS tag, word) tuple for this row and column
            key =  (all_tags[i],vocab[j])

            # check if the (POS tag, word) tuple exists as a key in emission counts
            if key in emission_counts.keys(): # complete this line
        
                # Get the count of (POS tag, word) from the emission_counts d
                count = emission_counts[key]
                
            # Get the count of the POS tag
            count_tag = tag_counts[all_tags[i]]
                
            # Apply smoothing and store the smoothed value 
            # into the emission matrix B for this row and column
            B[i,j] = (count + alpha) / (count_tag+ alpha*num_words)

    ### END CODE HERE ###
    return B

测试：

# creating your emission probability matrix. this takes a few minutes to run. 
alpha = 0.001
B = create_emission_matrix(alpha, tag_counts, emission_counts, list(vocab))

print(f"View Matrix position at row 0, column 0: {B[0,0]:.9f}")
print(f"View Matrix position at row 3, column 1: {B[3,1]:.9f}")

# Try viewing emissions for a few words in a sample dataframe
cidx  = ['725','adroitly','engineers', 'promoted', 'synergy']

# Get the integer ID for each word
cols = [vocab[a] for a in cidx]

# Choose POS tags to show in a sample dataframe
rvals =['CD','NN','NNS', 'VB','RB','RP']

# For each POS tag, get the row number from the 'states' list
rows = [states.index(a) for a in rvals]

# Get the emissions for the sample of words, and the sample of POS tags
B_sub = pd.DataFrame(B[np.ix_(rows,cols)], index=rvals, columns = cidx )
print(B_sub)

结果：

View Matrix position at row 0, column 0: 0.000006032
View Matrix position at row 3, column 1: 0.000000720
              725      adroitly     engineers      promoted       synergy
CD   8.201296e-05  2.732854e-08  2.732854e-08  2.732854e-08  2.732854e-08
NN   7.521128e-09  7.521128e-09  7.521128e-09  7.521128e-09  2.257091e-05
NNS  1.670013e-08  1.670013e-08  4.676203e-04  1.670013e-08  1.670013e-08
VB   3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08
RB   3.226454e-08  6.456135e-05  3.226454e-08  3.226454e-08  3.226454e-08
RP   3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07