自然语言处理基础-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_35687547/article/details/101903470

原文链接：http://chenhao.space/post/c939a57a.html

定义

自然语言处理是一门融语言学、计算机科学、人工智能于一体的科学，解决的是”让机器可以理解自然语言“。

发展阶段：

1950年代，基于规则的方式；

1970年代，统计语言学；

2003年，神经网络。

主要研究方向：

词法短语：分词，词性标注，命名实体识别，组块分析，Term权重，Term紧密度

句法语义：语言模型，依存句法分析，词义消歧，语义角色标注，深层语义分析

篇章理解：文本分类、聚类，文本摘要，文本生成，篇章关系识别，篇章衔接关系，指带消歧，语义表示，语义匹配，主题模型，情感分析，舆情监控

系统应用：信息抽取，只是图谱(表示，建图，补全，推理等)，信息检索(索引，召回，排序等)，Query分析，自动问答，智能对话，阅读理解，机器翻译，语音识别、合成，OCR，图像文字生成

词法阶段的工具：

NLTK

官网地址：http://www.nltk.org
Python上著名的自然语言处理库，具有一下优点：
- 自带语料库，词性分类库
- 自带分词，POS（词性标注），NER（命名实体识别）等功能
- 强大的社区支持

词法（处理流水线）

Raw_Text表示一句话或者一个文本；Tokenize表示分词；POS Tag表示词性标注；Lemma/Stemming表示词的泛化，比如am, is, are可以转化成be这种形式，或 worked, working转化成work的形式；stopwords表示停用词；最后转化成一个Word_List。

Tokenize

吧长句子拆成有“意义”的小部件

import nltk
sentence = "hello, world"
tokens = nltk.word_tokenize(sentence)
tokens

# output
['hello', ',', 'world']

jieba中文分词工具

词性归一化：

Stemming 词干提取：一般来说，就是把不影响词性的inflection的小尾巴砍掉

walking 砍ing = walk

walked 砍ed = walk
Lemmatization 词性归一：把各种类型的词的变形，都归为一个形式

went 归一 = go

are 归一 = be

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
snowball_stemmer.stem('maximum')

# output
'maximum'

snowball_stemmer.stem('presumably')

# output
'presum'

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('dogs')

# output
'dog'

wordnet_lemmatizer.lemmatize('churches')

# output
'church'

wordnet_lemmatizer.lemmatize('aardwolves')

# output
'aardwolf'


# 没有 POS Tag，默认是 NN 名词
wordnet_lemmatizer.lemmatize('are')

# output
'are'

wordnet_lemmatizer.lemmatize('is')

# output
'is'


# 加上 POS Tag
wordnet_lemmatizer.lemmatize('is', pos='v')

# output
'be'

wordnet_lemmatizer.lemmatize('are', pos='v')

# output
'be'

词性标注

import nltk
text = nltk.word_tokenize("what does the fox say")
text

# output
['what', 'does', 'the', 'fox', 'say']

# 词性标注
nltk.pos_tag(text)

# output
[('what', 'WDT'),
 ('does', 'VBZ'),
 ('the', 'DT'),
 ('fox', 'NNS'),
 ('say', 'VBP')]

命名实体识别

# NER
from nltk import ne_chunk, pos_tag, word_tokenize
sentence = "John studies at Stanford University."
ner = ne_chunk(pos_tag(word_tokenize(sentence)))

print(ner), type(ner)

# output
(S
  (PERSON John/NNP)
  studies/NNS
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  ./.)
(None, nltk.tree.Tree)


[" ".join(w for w, t in elt) for elt in ner if isinstance(elt, nltk.Tree)]

# output
['John', 'Stanford University']

停用词

# 下载stopwords词库，nltk.down('stopwords')

# stopwords
from nltk.corpus import stopwords
# 先token一把，得到一个word_list
# ...
# 然后filter一把
filter_words = [word for word in word_list if word not in stopwords.words('english')]

篇章理解：情感分析

最简单的sentiment dictionary

like 1

good 2

bad -2

terrible -3

类似于关键词打分机制

比如：AFINN-111

下载地址: https://gist.github.com/damianesteban/06e8be3225f641100126

# 篇章理解：情感分析

sentiment_dictionary = {}
for line in open('./AFINN-111.txt'):
    word, score = line.split('\t')
    sentiment_dictionary[word] = int(score)
    
# 把这个打分表记录在Dict上以后
# 跑一遍整个句子，把对应的值想加
words = ['like', 'love', 'beautiful']
total_score = sum(sentiment_dictionary.get(word, 0) for word in words)

'''
描述
Python 字典(Dictionary) get() 函数返回指定键的值，如果值不在字典中返回默认值。

语法
get()方法语法：
dict.get(key, default=None)

参数
key -- 字典中要查找的键。
default -- 如果指定键的值不存在时，返回该默认值值。

返回值
返回指定键的值，如果值不在字典中返回默认值None。
'''

# 有值就是Dict中的值，没有就是0

# 于是你就得到了一个 sentiment score

total_score

# output
8

显然这个方法太Navie

新词怎么办？

特殊词汇怎么办？

更深层次的玩意怎么办？

改进：

# 改进

from nltk.classify import NaiveBayesClassifier

# 随手造点训练集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'

def preprocess(s):
    # Func: 句子处理
    # 这里简单的用了split()，把句子中每个单词分开
    # 显然，还有更多的processing method可以用
    return {word: True for word in s.lower().split()}
    # return 长这样：
    # {'this': True, 'is': True, 'a': True, 'good': True, 'book': True}
    # 其中，前一个叫fname，对应每个出现的文本单词；
    # 后一个叫fval，指的是每个文本单词对应的值。
    # 这里我们用最简单的 True 来表示这个词【出现在当前的句子中】的意义。
    # 当然，我们可以升级这个方程，让它带有更牛的fval，比如word2vec
    
    
# 把训练集给做成标准形式
training_data = [[preprocess(s1), 'pos'],  # pos, neg 为label
                 [preprocess(s2), 'pos'],
                 [preprocess(s3), 'neg'],
                 [preprocess(s4), 'neg']]

# 喂给model
model = NaiveBayesClassifier.train(training_data)

# 打出结果
print(model.classify(preprocess('this is a good book')))


# output
pos

# 先把数据都读进来
pos_data = []
with open('PATH_TO_rt-polarity-pos.txt, encoding='latin-1') as f:
    for line in f:
        pos_data.append([preprocess(line), 'pos'])
          
neg_data = []
with open('PATH_TO_rt-polarity-neg.txt, encoding='latin-1') as f:
    for line in f:
          neg_data.append([preprocess(line), 'neg']) 
          
# 把测试集和训练集分开
training_data = pos_data[:4000] + neg_data[:4000]
testing_data = pos_data[4000:] + neg_data[4000:]
          
# 引入model
model = NaiveBayesClassifier.train(training_data)
          
# 测试
print(model.classify(preprocess('this is a bad movie')))

词袋模型（BOW）

词袋模型能够把一个句子转化为向量表示，是比较简单直白的一种方法，它不考虑句子中单词的顺序，只考虑词表（vocabulary）中单词在这个句子中的出现次数。

缺点：

词汇：词汇表需要精心设计，最重要的是为了管理大小，这会影响文档表示的稀疏性。
稀疏性：由于计算原因（空间和时间复杂性）以及信息原因，稀疏表示更难以建模，其中挑战是模型在如此大的代表空间中利用如此少的信息。
含义：丢弃单词顺序忽略了上下文，而忽略了文档中单词的含义（语义）。上下文和意义可以为模型提供很多东西，如果建模可以说出不同排列的相同单词之间的区别（“this is interesting” vs “is this interesting”），同义词(“old bike” vs “used bike”)，以及更多例子。

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    "John likes to watch movies, Mary likes movies too",
    "John also likes to watch football games",
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

# output
['also', 'football', 'games', 'john', 'likes', 'mary', 'movies', 'to', 'too', 'watch']
[[0 0 0 1 2 1 2 1 1 1]
 [1 1 1 1 1 0 0 1 0 1]]

TF-IDF

TF: Term Frequency，衡量一个term在文档中出现的有多频繁。

TF(t) = (t出现在文档中的次数) / (文档中的term总数)

IDF: Inverse Document Frequency，衡量一个term有多重要。

有些词出现的很多，但是明显不是很有用。比如：‘is’, ‘the’, 'and’之类的。

为了平衡，我们把罕见的词的重要性（weight）提高，把常见词的重要性降低。

IDF(t) = In(文档总数 / 含有t的文档总数)

TF-IDF = TF * IDF

举例：

一个文档中有100个单词，其中单词baby出现了3次。

那么，TF(baby) = (3 / 100) = 0.03

如果我们有10M的文档，baby出现在其中的1000个文档中。

则，IDF(baby) = In(10000000 / 1000)

所以，TF-IDF(baby) = TF(baby) * IDF(baby) = 0.03 * 4 = 0.12

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X)
print(X.toarray())


# output
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
  (0, 1)	0.46979138557992045
  (0, 2)	0.5802858236844359
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 8)	0.38408524091481483
  (1, 5)	0.5386476208856763
  (1, 1)	0.6876235979836938
  (1, 6)	0.281088674033753
  (1, 3)	0.281088674033753
  (1, 8)	0.281088674033753
  (2, 4)	0.511848512707169
  (2, 7)	0.511848512707169
  (2, 0)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 3)	0.267103787642168
  (2, 8)	0.267103787642168
  (3, 1)	0.46979138557992045
  (3, 2)	0.5802858236844359
  (3, 6)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 8)	0.38408524091481483
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]