nlp nltk note（1）

最新推荐文章于 2023-03-31 14:15:19 发布

Pinaceae

最新推荐文章于 2023-03-31 14:15:19 发布

阅读量418

点赞数

分类专栏： NLP 文章标签： nlp

NLP 专栏收录该内容

6 篇文章

订阅专栏

本文介绍如何使用Python的NLTK库进行文本处理，包括下载语料库、文本基本操作、词汇统计分析等，并展示了如何利用这些功能进行文本探索。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.下载语料库
import nltk

nltk.download()

2.文本简易处理
from nltk.book import *

text1

text1.concordance(“”)

text1.similar(“”)

text5.common_contexts([“boy”,”girl”])

text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”])

text3.generate()nltk2.0之后版本作者在代码中注释掉了此功能

len(text1)

sorted(set(text3))

len(sorted(set(text3)))

我们需要确保Python 使用的是浮点除法
from future import division
len(text3) / len(set(text3)) set表示获取词汇表，sorted（ser（））表示词汇表的排序

100*text3.count(“love”)/len(text3)

def lexical_diversity(text):
return len(text) / len(set(text))

def percentage(text): return 100*text.count(“word”)/len(text)不知道为什么用第二个变量表示词测试说第二个变量没有定义

sent1 = [‘Call’, ‘me’, ‘Ishmael’, ‘.’]

sent4 + sent1

sent1.append(“Some”)

text[153]

text4.index(‘awaken’)即寻找awaken第一次出现的位置

text5[16715:16735]索引index从零开始，Python 从计算机内存中的链表获取内容的时候，我们要告诉它向前多少个元素。因此，向前0 个元素使它留在第一个元素上。

sent = [‘word1’, ‘word2’, ‘word3’, ‘word4’, ‘word5’,
… ‘word6’, ‘word7’, ‘word8’, ‘word9’, ‘word10’]
sent[0]
‘word1’
sent[9]
‘word10’

类似于数组越界

sent[10]
Traceback (most recent call last):
File “”, line 1, in ?
IndexError: list index out of range

按照惯例，m:n 表示元素m…n-1。

my_sent = [‘Bravely’, ‘bold’, ‘Sir’, ‘Robin’, ‘,’, ‘rode’,
… ‘forth’, ‘from’, ‘Camelot’, ‘.’]
noun_phrase = my_sent[1:4]
noun_phrase
[‘bold’, ‘Sir’, ‘Robin’]

我们可以把词用链表连接起来组成单个字符串，或者把字符串分割成一个链表，如下面
所示：

’ ‘.join([‘Monty’, ‘Python’])
‘Monty Python’
‘Monty Python’.split()
[‘Monty’, ‘Python’]

fdist1 = FreqDist(text1)
fdist1

vocabulary1 = fdist1.keys()
vocabulary1[:50]
[‘,’, ‘the’, ‘.’, ‘of’, ‘and’, ‘a’, ‘to’, ‘;’, ‘in’, ‘that’, “’”, ‘-‘,’his’, ‘it’, ‘I’, ‘s’, ‘is’, ‘he’, ‘with’, ‘was’,
‘as’, ‘”’, ‘all’, ‘for’,’this’, ‘!’, ‘at’, ‘by’, ‘but’, ‘not’, ‘–’, ‘him’, ‘from’, ‘be’, ‘on’,’so’, ‘whale’, ‘one’,
‘you’, ‘had’, ‘have’, ‘there’, ‘But’, ‘or’, ‘were’,’now’, ‘which’, ‘?’, ‘me’, ‘like’]
fdist1[‘whale’]
906

第一次调用FreqDist 时，传递文本的名称作为参数。我们可以看到已经被计算出来
的《白鲸记》中的总的词数（“结果”）——高达260,819。表达式keys()为我们提供了文本中所有不同类型的链表，我们可以通过切片看看这个链表的前50 项。

fdist1.plot(50, cumulative=True) 0 个最常用词的累积频率图

V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)
[‘CIRCUMNAVIGATION’, ‘Physiognomically’, ‘apprehensiveness’, ‘cannibalistically’,
‘characteristically’, ‘circumnavigating’, ‘circumnavigation’, ‘circumnavigations’,
‘comprehensiveness’, ‘hermaphroditical’, ‘indiscriminately’, ‘indispensableness’,
‘irresistibleness’, ‘physiognomically’, ‘preternaturalness’, ‘responsibilities’,
‘simultaneousness’, ‘subterraneousness’, ‘supernaturalness’, ‘superstitiousness’,
‘uncomfortableness’, ‘uncompromisedness’, ‘undiscriminating’, ‘uninterpenetratingly’]

通过python函数可以细粒度地选择词。

bigrams([‘more’, ‘is’, ‘said’, ‘than’, ‘done’])
[(‘more’, ‘is’), (‘is’, ‘said’), (‘said’, ‘than’), (‘than’, ‘done’)]

text4.collocations()

[w.upper() for w in text1]
[‘[‘, ‘MOBY’, ‘DICK’, ‘BY’, ‘HERMAN’, ‘MELVILLE’, ‘1851’, ‘]’, ‘ETYMOLOGY’, ‘.’, …]