第02章 获得文本语料和词汇资源
- 什么是有用的文本语料和词汇资源,我们如何使用Python 获取它们?
- 哪些Python 结构最适合这项工作?
- 编写Python 代码时我们如何避免重复的工作?
2.1 获取文本语料库
古腾堡语料库
import nltk
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
nltk.corpus.gutenberg.fileids()
['austen-emma.txt',
'austen-persuasion.txt',
'austen-sense.txt',
'bible-kjv.txt',
'blake-poems.txt',
'bryant-stories.txt',
'burgess-busterbrown.txt',
'carroll-alice.txt',
'chesterton-ball.txt',
'chesterton-brown.txt',
'chesterton-thursday.txt',
'edgeworth-parents.txt',
'melville-moby_dick.txt',
'milton-paradise.txt',
'shakespeare-caesar.txt',
'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt',
'whitman-leaves.txt']
emma = nltk.corpus.gutenberg.words('austen-emma.txt') #简·奥斯丁的《爱玛》
len(emma)
192427
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity `
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ;
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
expected by the best judges , for surprize -- but there was great joy . Mr .
sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
. It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai
调用了NLTK 中的corpus 包中的gutenberg 对象的words()函数。但因为总是要输入这么长的名字很繁琐,Python 提供了另一个版本的import 语句
from nltk.corpus import gutenberg
gutenberg.fileids()
['austen-emma.txt',
'austen-persuasion.txt',
'austen-sense.txt',
'bible-kjv.txt',
'blake-poems.txt',
'bryant-stories.txt',
'burgess-busterbrown.txt',
'carroll-alice.txt',
'chesterton-ball.txt',
'chesterton-brown.txt',
'chesterton-thursday.txt',
'edgeworth-parents.txt',
'melville-moby_dick.txt',
'milton-paradise.txt',
'shakespeare-caesar.txt',
'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt',
'whitman-leaves.txt']
emma = gutenberg.words("austen-emma.txt")
#这个程序显示每个文本的三个统计量:平均词长、平均句子长度和本文中每个词出现的平均次数(我们的词汇多样性得分)
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid)) #
num_sents = len(gutenberg.sents(fileid))
num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt
平均词长似乎是英语的一个一般属性,因为它的值总是4。(事实上,平均词长是3 而不是4,因为num_chars 变量计数了空白字符。)相比之下,平均句子长度和词汇多样性看上去是作者个人的特点。
len(gutenberg.raw('blake-poems.txt')) #raw()函数给我们没有进行过任何语言学处理的文件的内容,词汇个数,包括词之间的空格。
38153
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')#sents()函数把文本划分成句子,其中每一个句子是一个词链表。
macbeth_sentences
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]
macbeth_sentences[1603]
['The', 'hart', 'is', 'sorely', 'charg', "'", 'd']
longest_len = max(len(s) for s in macbeth_sentences)
longest_len_sent = [s for s in macbeth_sentences if len(s) == longest_len]
print(longest_len_sent)
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]
网络和聊天文本
from nltk.corpus import webtext
for fileid in webtext.fileids():
print(fileid, webtext.raw(fileid)[:65], '...') #Firefox 交流论坛,在纽约无意听到的对话,《加勒比海盗》的电影剧本,个人广告和葡萄酒的评论
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there! [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml') #例如:10-19-20s_706posts.xml 包含2006 年10 月19 日从20 多岁聊天室收集的706 个帖子。
print(chatroom[123])
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']
布朗语料库
布朗语料库是第一个百万词级的英语电子语料库的,由布朗大学于1961 年创建。这个语料库包含500 个不同来源的文本,按照文体分类,如:新闻、社论等。
表2-1. 布朗语料库每一部分的示例文档
ID | 文件 | 文体 | 描述 |
---|---|---|---|
A16 | ca16 | 新闻 | news Chicago Tribune: Society Reportage |
B02 | cb02 | 社论 | editorial Christian Science Monitor: Editorials |
C17 | cc17 | 评论 | reviews Time Magazine: Reviews |
D12 | cd12 | 宗教 | religion Underwood: Probing the Ethics of Realtors |
E36 | ce36 | 爱好 | hobbies Norling: Renting a Car in Europe |
F25 | cf25 | 传说 | lore Boroff: Jewish Teenage Culture |
G22 | cg22 | 纯文学 | belles_lettres Reiner: Coping with Runaway Technology |
H15 | ch15 | 政府 | government US Office of Civil and Defence Mobilization: The Family Fallout Shelter |
J17 | cj19 | 博览 | learned Mosteller: Probability with Statistical Applications |
K04 | ck04 | 小说 | fiction W.E.B. Du Bois: Worlds of Color |
L13 | cl13 | 推理小说 | mystery Hitchens: Footsteps in the Night |
M01 | cm01 | 科幻 | science_fiction Heinlein: Stranger in a Strange Land |
N14 | cn15 | 探险 | adventure Field: Rattlesnake Ridge |
P12 | cp12 | 言情 | romance Callaghan: A Passion in Rome |
R06 | cr06 | 幽默 | humor Thurber: The Future, If Any, of Comedy |
from nltk.corpus import brown
brown.categories()
['adventure',
'belles_lettres',
'editorial',
'fiction',
'government',
'hobbies',
'humor',
'learned',
'lore',
'mystery',
'news',
'religion',
'reviews',
'romance',
'science_fiction']
brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。
让我们来比较不同文体中的情态动词的用法
from nltk.corpus import brown
news_text = brown.words(categories='news')
fdist = nltk.FreqDist([w.lower() for w in news_text])
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
print(m + ':', fdist[m], end=' ')
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
modals = ['what', 'when', 'where', 'who', 'why']
for m in modals:
print(m + ':', fdist[m], end=' ')
what: 95 when: 169 where: 59 who: 268 why: 14
统计每一个感兴趣的文体。我们使用NLTK 提供的带条件的频率分布函数。
cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance',