nltk: natural language toolkit 是一套基于python的自然语言处理工具集
安装
pip install nltk
在python解释器中
instal nltk
nltk.download()
可以选择下载语料库
Text
一个text里哪些词可以概述这个text的大致内容呢?
1.1搜索
concordance(“word”):显示word出现的上下文,第一次使用可能会比较慢,因为要建立随后的索引,后续索引就会非常快速
from nltk.book import *
text1.concordance('monstrous')
结果:
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . … This came towards us ,
ON OF THE PSALMS . ” Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .’” CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
similar(‘word’)找出与word相关的单词
common_contexts([“word1”, “word2”])可以找出多个单词相似的上下文,使用逗号分开,包含在方括号中
dispersion_plot([])用离散图判断词在文本的位置即偏移量
1.2计数
len(text1)计算长度
print(len(text1))#text1的单词和标点总数 即token数
print(len(set(text1)))#set去掉重复,计数text1中去重后的单词数
print(len(text1[0]))#text1中第一个单词的长度
count(‘word’)单词word在文本中出现了几次
1.3频率
FreqDist()计数每个词出现的次数
fdist1 = FreqDist(text1)
print(fdist1)
print(fdist1.most_common(20))
key = fdist1.keys()
print(list(key)[:5])
ite = fdist1.items()
print(list(ite)[:5])
结果:
[(‘,’, 18713), (‘the’, 13721), (‘.’, 6862), (‘of’, 6536), (‘and’, 6024), (‘a’, 4569), (‘to’, 4542), (‘;’, 4072), (‘in’, 3916), (‘that’, 2982), (“’”, 2684), (‘-‘, 2552), (‘his’, 2459), (‘it’, 2209), (‘I’, 2124), (’s’, 1739), (‘is’, 1695), (‘he’, 1661), (‘with’, 1659), (‘was’, 1632)]
[‘[‘, ‘Moby’, ‘Dick’, ‘by’, ‘Herman’]
[(‘[‘, 3), (‘Moby’, 84), (‘Dick’, 84), (‘by’, 1137), (‘Herman’, 1)]
显示单词的累计频率图:
fdist1.plot(20,cumulative = True)
1.4细粒度选词
v = set(text1)
long_words=[w for w in v if len(w)>7]
sorted(long_words)
通常太短的单词不具有代表性,所以选择长度长一些的单词;但是过长的单词很可能是罕见词,所以最好挑选长度够长而且出现频率够大的词
v = set(text1)
fdq = FreqDist(text1)
print(sorted(w for w in v if len(w)>7 and fdq[w]>7))
1.5搭配和双连词
red wine是一个搭配,而the wine就不是
为了找到搭配先用bigrams()找到一组组的单词对儿
print(list(bigrams(['more','is','said','than','done'])))
结果为:
[(‘more’, ‘is’), (‘is’, ‘said’), (‘said’, ‘than’), (‘than’, ‘done’)]
找到频率较高的这样的双连词就可以认为是搭配,词语搭配出现的频率可能比我们期待的单独的单词频率要高,collections()函数就是找搭配的
print(text4.collocations())
结果为:
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
1.6 NLTK频率分布中定义的函数
Python中字符串的一些常用函数,可能作为筛选条件