Chapter 3:
This chapter describes the skill to process raw text.
Some important point:
1. Access text from web and disk : api such as urlopen(), open(), read(), write() and some string operation . Also some tool to process text of html.
2. Text processing with Unicode : file/terminal(specific encoding) -> In-memory program including python processing(Unicode) -> file/terminal (specific encoding)
3. Regular expressions : re.search, find, findall, replace, splite and so on (remember to add r charater for raw text of regular expression).
Another api in nltk is nltk.regexp_tokenize() which is similar to findall.
Useful for finding word stems and searching tokenized text.
4. Normalizing Text and Segmentation : Stemmers, Lemmatization, Sentence Segmantation, Word Segmantation.
本篇博客详细介绍了如何从网络和磁盘获取原始文本,并通过Unicode编码进行内存处理。重点讲述了正则表达式的使用,包括搜索、查找、替换等功能。此外,还涉及了文本的规范化与分段,如词干提取、词形还原、句子和单词的分割等关键技术。
1051

被折叠的 条评论
为什么被折叠?



