《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第02章获得文本语料和词汇资源

最新推荐文章于 2025-05-11 22:38:50 发布

原创

最新推荐文章于 2025-05-11 22:38:50 发布 · 3.4k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#Python 自然语言处理语料库中文资源

第02章获得文本语料和词汇资源

2.1 获取文本语料库
2.2 条件频率分布
2.3 更多关于Python代码重用
2.4 词典资源
2.5 WordNet
2.6 小结

什么是有用的文本语料和词汇资源，我们如何使用Python 获取它们？
哪些Python 结构最适合这项工作？
编写Python 代码时我们如何避免重复的工作？

2.1 获取文本语料库

古腾堡语料库

import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

emma = nltk.corpus.gutenberg.words('austen-emma.txt') #简·奥斯丁的《爱玛》

len(emma)

emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")

Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
 to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
 expected by the best judges , for surprize -- but there was great joy . Mr . 
 sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
 . It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

调用了NLTK 中的corpus 包中的gutenberg 对象的words()函数。但因为总是要输入这么长的名字很繁琐，Python 提供了另一个版本的import 语句

from nltk.corpus import gutenberg

gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

emma = gutenberg.words("austen-emma.txt")

#这个程序显示每个文本的三个统计量：平均词长、平均句子长度和本文中每个词出现的平均次数（我们的词汇多样性得分）
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) 
    num_words = len(gutenberg.words(fileid)) #    
    num_sents = len(gutenberg.sents(fileid)) 
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt

平均词长似乎是英语的一个一般属性，因为它的值总是4。（事实上，平均词长是3 而不是4，因为num_chars 变量计数了空白字符。）相比之下，平均句子长度和词汇多样性看上去是作者个人的特点。

len(gutenberg.raw('blake-poems.txt')) #raw()函数给我们没有进行过任何语言学处理的文件的内容,词汇个数，包括词之间的空格。

macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')#sents()函数把文本划分成句子，其中每一个句子是一个词链表。
macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

macbeth_sentences[1603]

['The', 'hart', 'is', 'sorely', 'charg', "'", 'd']

longest_len = max(len(s) for s in macbeth_sentences)

longest_len_sent = [s for s in macbeth_sentences if len(s) == longest_len]
print(longest_len_sent)

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

网络和聊天文本

from nltk.corpus import webtext

for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...') #Firefox 交流论坛，在纽约无意听到的对话，《加勒比海盗》的电影剧本，个人广告和葡萄酒的评论

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

from nltk.corpus import nps_chat

chatroom = nps_chat.posts('10-19-20s_706posts.xml') #例如：10-19-20s_706posts.xml 包含2006 年10 月19 日从20 多岁聊天室收集的706 个帖子。
print(chatroom[123])

['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

布朗语料库

布朗语料库是第一个百万词级的英语电子语料库的，由布朗大学于1961 年创建。这个语料库包含500 个不同来源的文本，按照文体分类，如：新闻、社论等。

表2-1. 布朗语料库每一部分的示例文档

ID	文件	文体	描述
A16	ca16	新闻	news Chicago Tribune: Society Reportage
B02	cb02	社论	editorial Christian Science Monitor: Editorials
C17	cc17	评论	reviews Time Magazine: Reviews
D12	cd12	宗教	religion Underwood: Probing the Ethics of Realtors
E36	ce36	爱好	hobbies Norling: Renting a Car in Europe
F25	cf25	传说	lore Boroff: Jewish Teenage Culture
G22	cg22	纯文学	belles_lettres Reiner: Coping with Runaway Technology
H15	ch15	政府	government US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17	cj19	博览	learned Mosteller: Probability with Statistical Applications
K04	ck04	小说	fiction W.E.B. Du Bois: Worlds of Color
L13	cl13	推理小说	mystery Hitchens: Footsteps in the Night
M01	cm01	科幻	science_fiction Heinlein: Stranger in a Strange Land
N14	cn15	探险	adventure Field: Rattlesnake Ridge
P12	cp12	言情	romance Callaghan: A Passion in Rome
R06	cr06	幽默	humor Thurber: The Future, If Any, of Comedy

from nltk.corpus import brown

brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]

brown.sents(categories=['news', 'editorial', 'reviews'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。

让我们来比较不同文体中的情态动词的用法

from nltk.corpus import brown

news_text = brown.words(categories='news')

fdist = nltk.FreqDist([w.lower() for w in news_text])

modals = ['can', 'could', 'may', 'might', 'must', 'will']

for m in modals:
    print(m + ':', fdist[m], end=' ')

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

modals = ['what', 'when', 'where', 'who', 'why']

for m in modals:
    print(m + ':', fdist[m], end=' ')

what: 95 when: 169 where: 59 who: 268 why: 14

统计每一个感兴趣的文体。我们使用NLTK 提供的带条件的频率分布函数。

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance',

最低0.47元/天解锁文章

200万优质内容无限畅学