NLTK学习

最新推荐文章于 2024-01-17 16:16:17 发布

BeforeEasy

最新推荐文章于 2024-01-17 16:16:17 发布

阅读量492

点赞数

CC 4.0 BY-SA版权

分类专栏： nltk python 文章标签： python 自然语言处理

本文链接：https://blog.youkuaiyun.com/BeforeEasy/article/details/79365504

python 同时被 2 个专栏收录

13 篇文章

订阅专栏

nltk

8 篇文章

订阅专栏

nltk: natural language toolkit 是一套基于python的自然语言处理工具集

安装

pip install nltk

在python解释器中

instal nltk
nltk.download()

可以选择下载语料库

Text

一个text里哪些词可以概述这个text的大致内容呢？

1.1搜索

concordance(“word”):显示word出现的上下文，第一次使用可能会比较慢，因为要建立随后的索引，后续索引就会非常快速

from nltk.book import *
text1.concordance('monstrous')

结果：
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . … This came towards us ,
ON OF THE PSALMS . ” Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .’” CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

similar(‘word’)找出与word相关的单词
common_contexts([“word1”, “word2”])可以找出多个单词相似的上下文,使用逗号分开，包含在方括号中
dispersion_plot([])用离散图判断词在文本的位置即偏移量
这里写图片描述

1.2计数

len(text1)计算长度

print(len(text1))#text1的单词和标点总数 即token数
print(len(set(text1)))#set去掉重复，计数text1中去重后的单词数
print(len(text1[0]))#text1中第一个单词的长度

count(‘word’)单词word在文本中出现了几次

1.3频率

FreqDist()计数每个词出现的次数

fdist1 = FreqDist(text1)
print(fdist1)
print(fdist1.most_common(20))
key = fdist1.keys()
print(list(key)[:5])
ite = fdist1.items()
print(list(ite)[:5])

结果：

[(‘,’, 18713), (‘the’, 13721), (‘.’, 6862), (‘of’, 6536), (‘and’, 6024), (‘a’, 4569), (‘to’, 4542), (‘;’, 4072), (‘in’, 3916), (‘that’, 2982), (“’”, 2684), (‘-‘, 2552), (‘his’, 2459), (‘it’, 2209), (‘I’, 2124), (’s’, 1739), (‘is’, 1695), (‘he’, 1661), (‘with’, 1659), (‘was’, 1632)]
[‘[‘, ‘Moby’, ‘Dick’, ‘by’, ‘Herman’]
[(‘[‘, 3), (‘Moby’, 84), (‘Dick’, 84), (‘by’, 1137), (‘Herman’, 1)]

显示单词的累计频率图：

fdist1.plot(20,cumulative = True)

这里写图片描述

1.4细粒度选词

v = set(text1)
long_words=[w for w in v if len(w)>7]
sorted(long_words)

通常太短的单词不具有代表性，所以选择长度长一些的单词；但是过长的单词很可能是罕见词，所以最好挑选长度够长而且出现频率够大的词

v = set(text1)
fdq = FreqDist(text1)
print(sorted(w for w in v if len(w)>7 and fdq[w]>7))

1.5搭配和双连词

red wine是一个搭配，而the wine就不是
为了找到搭配先用bigrams()找到一组组的单词对儿

print(list(bigrams(['more','is','said','than','done'])))

结果为：
[(‘more’, ‘is’), (‘is’, ‘said’), (‘said’, ‘than’), (‘than’, ‘done’)]
找到频率较高的这样的双连词就可以认为是搭配，词语搭配出现的频率可能比我们期待的单独的单词频率要高，collections()函数就是找搭配的

print(text4.collocations())

结果为：
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties