Tokenization(分词)
分词是指按照特定需求,把文本切分成一个字符串序列(其元素一般称为token,或者叫词语)。一般来说,我们要求序列的元素有一定的意义。
Sentence Tokenization
下图为使用nltk进行句子分词

import nltk
ulysses = "Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."
doc = nltk.sent_tokenize(ulysses)
for s in doc:
print(">",s)
Word Tokenization
There are different methods for tokenizing text into words, such as:
1. TreebankWordTokenizer
2. WordPunctTokenizer
3. WhitespaceTokenizer
不同种类的词汇分组

from nltk import word_tokenize
sentence = "Mary had a little lamb it's fleece was white as snow."
# Default Tokenization
tree_tokens = word_tokenize(sentence) # nltk.download('punkt') for this
# Other Tokenizers
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)
print("DEFAULT: ", tree_tokens)
print("PUNCT : ", punct_tokens)
Part of speech tagging
将一段话分成不同词性的词语

sentence = "Mary had a little lamb it's fleece was white as snow."
# Default Tokenization
tree_tokens = word_tokenize(sentence) # nltk.download('punkt') for this
# Other Tokenizers
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)
print("DEFAULT: ", tree_tokens)
print("PUNCT : ", punct_tokens)
pos = nltk.pos_tag(tree_tokens)
print(pos)
pos_punct = nltk.pos_tag(punct_tokens)
print(pos_punct)
import re
regex = re.compile("^N.*") #check whether a string starts with N
nouns = []
for l in pos:
if regex.match(l[1]):
nouns.append(l[0])
print("Nouns:", nouns)
Stemming(词干提取)
词干提取 – Stemming
词干提取是去除单词的前后缀得到词根的过程。
大家常见的前后词缀有“名词的复数”、“进行式”、“过去分词”…
中文通常是不需要这个步骤的,因为没有前后缀,也是有不同的stemming方法

import nltk
from nltk import word_tokenize
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer("english")
sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
tokens2 = word_tokenize(sentence2)
print("\n",sentence2)
for stemmer in [porter, lancaster, snowball]:
print([stemmer.stem(t) for t in tokens2])
Lemmatisation(词性还原)
词形还原 – Lemmatisation
词形还原是基于词典,将单词的复杂形态转变成最基础的形态。
词形还原不是简单地将前后缀去掉,而是会根据词典将单词进行转换。比如“drove”会转换为“drive”。

import nltk
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer("english")
sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
tokens2 = word_tokenize(sentence2)
print("\n",sentence2)
for stemmer in [porter, lancaster, snowball]:
print([stemmer.stem(t) for t in tokens2])
wnl = nltk.WordNetLemmatizer()
tokens2_pos = nltk.pos_tag(tokens2) #nltk.download("averaged_perceptron_tagger")
print([wnl.lemmatize(t) for t in tokens2])