第03章 加工原料文本
import nltk
from nltk import word_tokenize
3.1 从网络和硬盘访问文本
电子书
http://www.gutenberg.org/catalog 上浏览25,000 本免费在线书籍的目录,获得ASCII 码文本文件的URL。包括中文。
from urllib import request
url ="http://www.gutenberg.org/files/25196/25196-0.txt" #编号2554 的文本是《百家姓》
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)
str
len(raw)
20497
raw[600:800]
'百家姓\r\n\r\n趙錢孫李 周吳鄭王 馮陳褚衛 蔣沈韓楊\r\n朱秦尤許 何呂施張 孔曹嚴華 金魏陶薑\r\n戚謝鄒喻 柏水竇章 雲蘇潘葛 奚範彭郎\r\n魯韋昌馬 苗鳳花方 俞任袁柳 酆鮑史唐\r\n費廉岑薛 雷賀倪湯 滕殷羅畢 郝鄔安常\r\n\r\n樂於時傅 皮卞齊康 伍餘元蔔 顧孟平黃\r\n和穆蕭尹 姚邵堪汪 祁毛禹狄 米貝明臧\r\n計伏成戴 談宋茅龐 熊紀舒屈 項祝董梁\r\n杜阮藍閔 席季麻強 賈路婁危 江童顏郭\r\n梅盛'
对于语言处理,我们要将字符串分解为词和标点符号,这一步被称为分词,它产生我们所熟悉的结构,一个词汇和标点符号的链表。
tokens = nltk.word_tokenize(raw)
type(tokens)
list
len(tokens)
3542
tokens[:5]
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of']
text = nltk.Text(tokens) #创建一个NLTK 文本
type(text)
nltk.text.Text
text[:5]
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of']
text.collocations()
Project Gutenberg-tm; Project Gutenberg; Literary Archive; Archive
Foundation; United States; Gutenberg Literary; electronic works;
Gutenberg-tm electronic; set forth; public domain; electronic work;
Gutenberg-tm License; Bai Jia; Jia Xing; copyright holder; PROJECT
GUTENBERG; BAI JIA; EBOOK BAI; JIA XING; Plain Vanilla
方法find()和rfind()(反向的find)帮助我们得到字符串切片需要用到的正确的索引值
raw.find("朱")
628
raw.rfind("周")
612
raw[612:629]
'周吳鄭王 馮陳褚衛 蔣沈韓楊\r\n朱'
处理的HTML
url = "https://www.baidu.com"
html = request.urlopen(url).read().decode('utf8')
html[:60]
'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.'
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
tokens = word_tokenize(raw)
tokens[:5]
['location.replace', '(', 'location.href.replace', '(', '``']
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('replace')
no matches
更多更复杂的有关处理HTML 的内容,可以使用http://www.crummy.com/software/BeautifulSoup/上的Beautiful Soup 软件包。
处理搜索引擎的结果
- 优势
1.规模
2.非常容易使用
- 缺点
1.首先,允许的搜索方式的范围受到严格限制。不同于本地驱动器中的语料库,你可以编写程序来搜索任意复杂的模式,搜索引擎一般只允许你搜索单个词或词串,有时也允许使用通配符。
2.其次,搜索引擎给出的结果不一致,并且在不同的时间或在不同的地理区域会给出非常不同的结果。
3.最后,搜索引擎返回的结果中的标记可能会不可预料的改变,基于模式的方法定位特定的内容将无法使用。
处理RSS 订阅
博客圈是文本的重要来源,无论是正式的还是非正式的。Universal Feed Parser 的第三方Python 库http://feedparser.org/ 可以访问博客的内容
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']
'Language Log'
len(llog.entries)
13
post = llog.entries[2]
post.title
'Miscellaneous bacteria'
content = post.content[0].value
content[:70]
'<p>Jeff DeMarco spotted this menu item at the Splendid China attractio'
raw = BeautifulSoup(content).get_text()
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
word_tokenize(raw)[10:15]
['attraction', 'in', 'Shenzhen', ':', 'zá']
读取本地文件
import os
os.listdir('.')
['.ipynb_checkpoints',
'document.txt',
'NLP',
'output.txt',
'readme.txt.txt',
'Steven Bird-2009-Natural Language Processing with Python.pdf',
'Steven Bird-2015-Natural Language Processing with Python.pdf',
'textproc.py',
'__pycache__',
'目录.txt',
'第01章 语言处理与Python.ipynb',
'第02章 获得文本语料和词汇资源.ipynb',
'第03章 加工原料文本.ipynb']
f = open('document.txt')
f = open('document.txt', 'rU')
C:\Program Files\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: DeprecationWarning: 'U' mode is deprecated
if __name__ == '__main__':
for line in f:
print(line.strip())
从PDF、MS Word 及其他二进制格式中提取文本
ASCII 码文本和HTML 文本是人可读的格式。文字常常以二进制格式出现,如PDF 和MSWord,只能使用专门的软件打开。第三方函数库如pypdf 和pywin32 提供了对这些格式的访问。
捕获用户输入
s = input("Enter some text: ")
Enter some text: sdasdassadsa
print("You typed", len(word_tokenize(s)), "words.")
You typed 1 words.
NLP 的流程
NLP处理流程:打开一个URL,读里面HTML 格式的内容,去除标记,并选择字符的切片,然后分词,是否转换为nltk.Text 对象是可选择的。我们也可以将所有词汇小写并提取词汇表。
from bs4 import BeautifulSoup
url = "https://www.baidu.com/"
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html).get_text()
raw = raw[:500]
tokens = word_tokenize(raw)
tokens = tokens[:390]
text = nltk.Text(tokens)
words = [w.lower() for w in text]
vocab = sorted(set(words))
vocab[:5]
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
["''", '(', ')', ',', '//']
3.2 字符串最底层的文本处理
字符串的基本操作
monty = 'Monty Python' #单引号
monty
'Monty Python'
circus = "Monty Python's Flying Circus" #双引号
circus
"Monty Python's Flying Circus"
circus = 'Monty Python\'s Flying Circus' #如果一个字符串中包含一个单引号,我们必须在单引号前加反斜杠或者也可以将这个字符串放入双引号中
circus
"Monty Python's Flying Circus"
circus = 'Monty Python's Flying Circus'
File "<ipython-input-6-d481af75953d>", line 1
circus = 'Monty Python's Flying Circus'
^
SyntaxError: invalid syntax
couplet = "Shall I compare thee to a Summer's day?"\
"Thou are more lovely and more temperate:" #使用反斜杠
print(couplet)
Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:
couplet = ("Rough winds do shake the darling buds of May,"
"And Summer's lease hath all too short a date:") #或者括号
print(couplet)
Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:""" #三重引号的字符串
print(couplet)
Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:
'very' + 'very' + 'very' #加法或连接
'veryveryvery'
'very' * 3 # 乘法
'veryveryvery'
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:
print(b)
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryveryveryvery', ' veryveryveryveryvery', ' veryveryveryvery', ' veryveryvery', ' veryvery', ' very']
[' very', ' veryvery', ' veryveryvery', ' veryveryveryvery', ' veryveryveryveryvery', ' veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', ' veryveryveryve