1.下载语料库
import nltk
nltk.download()
2.文本简易处理
from nltk.book import *
text1
text1.concordance(“”)
text1.similar(“”)
text5.common_contexts([“boy”,”girl”])
text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”])
text3.generate()nltk2.0之后版本作者在代码中注释掉了此功能
len(text1)
sorted(set(text3))
len(sorted(set(text3)))
我们需要确保Python 使用的是浮点除法
from future import division
len(text3) / len(set(text3)) set表示获取词汇表,sorted(ser())表示词汇表的排序
100*text3.count(“love”)/len(text3)
def lexical_diversity(text):
return len(text) / len(set(text))
def percentage(text): return 100*text.count(“word”)/len(text)不知道为什么用第二个变量表示词测试说第二个变量没有定义
sent1 = [‘Call’, ‘me’, ‘Ishmael’, ‘.’]
sent4 + sent1
sent1.append(“Some”)
text[153]
text4.index(‘awaken’)即寻找awaken第一次出现的位置
text5[16715:16735]索引index从零开始,Python 从计算机内存中的链表获取内容的时候,我们要告诉它向前多少个元素。因此,向前0 个元素使它留在第一个元素上。
sent = [‘word1’, ‘word2’, ‘word3’, ‘word4’, ‘word5’,
… ‘word6’, ‘word7’, ‘word8’, ‘word9’, ‘word10’]
sent[0]
‘word1’
sent[9]
‘word10’
类似于数组越界
sent[10]
Traceback (most recent call last):
File “”, line 1, in ?
IndexError: list index out of range
按照惯例,m:n 表示元素m…n-1。
my_sent = [‘Bravely’, ‘bold’, ‘Sir’, ‘Robin’, ‘,’, ‘rode’,
… ‘forth’, ‘from’, ‘Camelot’, ‘.’]
noun_phrase = my_sent[1:4]
noun_phrase
[‘bold’, ‘Sir’, ‘Robin’]
我们可以把词用链表连接起来组成单个字符串,或者把字符串分割成一个链表,如下面
所示:
’ ‘.join([‘Monty’, ‘Python’])
‘Monty Python’
‘Monty Python’.split()
[‘Monty’, ‘Python’]fdist1 = FreqDist(text1)
fdist1
vocabulary1 = fdist1.keys()
vocabulary1[:50]
[‘,’, ‘the’, ‘.’, ‘of’, ‘and’, ‘a’, ‘to’, ‘;’, ‘in’, ‘that’, “’”, ‘-‘,’his’, ‘it’, ‘I’, ‘s’, ‘is’, ‘he’, ‘with’, ‘was’,
‘as’, ‘”’, ‘all’, ‘for’,’this’, ‘!’, ‘at’, ‘by’, ‘but’, ‘not’, ‘–’, ‘him’, ‘from’, ‘be’, ‘on’,’so’, ‘whale’, ‘one’,
‘you’, ‘had’, ‘have’, ‘there’, ‘But’, ‘or’, ‘were’,’now’, ‘which’, ‘?’, ‘me’, ‘like’]
fdist1[‘whale’]
906第一次调用FreqDist 时,传递文本的名称作为参数。我们可以看到已经被计算出来
的《白鲸记》中的总的词数(“结果”)——高达260,819。表达式keys()为我们提供了文本中所有不同类型的链表,我们可以通过切片看看这个链表的前50 项。
fdist1.plot(50, cumulative=True) 0 个最常用词的累积频率图
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)
[‘CIRCUMNAVIGATION’, ‘Physiognomically’, ‘apprehensiveness’, ‘cannibalistically’,
‘characteristically’, ‘circumnavigating’, ‘circumnavigation’, ‘circumnavigations’,
‘comprehensiveness’, ‘hermaphroditical’, ‘indiscriminately’, ‘indispensableness’,
‘irresistibleness’, ‘physiognomically’, ‘preternaturalness’, ‘responsibilities’,
‘simultaneousness’, ‘subterraneousness’, ‘supernaturalness’, ‘superstitiousness’,
‘uncomfortableness’, ‘uncompromisedness’, ‘undiscriminating’, ‘uninterpenetratingly’]通过python函数可以细粒度地选择词。
bigrams([‘more’, ‘is’, ‘said’, ‘than’, ‘done’])
[(‘more’, ‘is’), (‘is’, ‘said’), (‘said’, ‘than’), (‘than’, ‘done’)]text4.collocations()
[w.upper() for w in text1]
[‘[‘, ‘MOBY’, ‘DICK’, ‘BY’, ‘HERMAN’, ‘MELVILLE’, ‘1851’, ‘]’, ‘ETYMOLOGY’, ‘.’, …]