-
下载:
import nltk nltk.download()
0. 语法知识
- N:名词, V:动词,ADJ:形容词,ADV:副词,
- proper noun:专有名词
- pronoun:代词,he/her/I/their
- CNJ:连词,and/or/but/if/while/although
- DET:determiner,限定词,the/a/some/most/every/no
- EX:existential,there/there’s
- MOD:情态动词,UH:Interjection,情态动词;
- VD:past tense,VG:现在时,VN:完成时
1. 语料库的查看
-
brown:布朗语料库;
-
categories:分类
-
stents:句子
-
words:单词
>> from nltk.corpus import brown # 文本类型 >> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >> len(brown.sents()) 57340 >> len(brown.words()) 1161192
2. 词干提取与词形归一
-
词干提取(Stemming):walking ⇒ walk;walked ⇒ walk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer porter_stem = PorterStemmer() porter_stem.stem('walking')
-
词形归一(Lemmatization):went ⇒ go;are ⇒ be
from nltk.stem import WordNetLemmatizer lemma = WordNetLemmatizer() lemma.lemmatize('dogs')
注意词性的问题,不指定 POS,默认是名词
>> lemma.lemmatize('went') 'went' >> lemma.lemmatize('went', pos='v') 'go' >> lemma.lemmatize('are', pos='v') 'be' # be 动词也是动词
3. pos tags 与 stopwords
-
pos tags
words = nltk.word_tokenize('what does the fox say') nltk.pos_tags(words) [('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]
-
stopwords
from nltk.corpus import stopwords stopwords.words('english')
stopwords 支持的语言:
['dutch', 'german', 'hungarian', 'romanian', 'kazakh', 'turkish', 'russian', 'README', 'italian', 'english', 'greek', 'norwegian', 'portuguese', 'finnish', 'danish', 'french', 'swedish', 'azerbaijani', 'spanish', 'indonesian', 'arabic', 'nepali']
4. 文本处理 pipeline
