Python自然语言处理学习笔记(27)：3.11 深入阅读

最新推荐文章于 2025-12-11 20:49:41 发布

转载最新推荐文章于 2025-12-11 20:49:41 发布 · 100 阅读

文章标签：

本博客深入探讨自然语言处理领域的深入阅读材料，包括Python编程资源、正则表达式、Unicode处理、文本预处理方法等。通过链接提供丰富的在线资源，如Jurafsky & Martin的书籍章节，Mertz的著作，以及关于处理标准和非标准词汇的参考资料。介绍了处理中文文本的SIGHAN关注点、词法分析、句法分析、语义分析、分词、词性标注等关键概念，并讨论了模拟退火算法、词组识别、词义消歧等高级话题。

3.11 Further Reading 深入阅读

Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. Remember to consult the Python reference materials at http://docs.python.org/ . (For example, this documentation covers “universal newline support,” explaining how to work with the different newline conventions used by various operating systems.)

For more examples of processing words with NLTK, see the tokenization, stemming, and corpus HOWTOs at http://www.nltk.org/howto . Chapters 2 and 3 of (Jurafsky &Martin, 2008) contain more advanced material on regular expressions and morphology.

For more extensive discussion of text processing with Python, see (Mertz, 2003). For information about normalizing non-standard words, see (Sproat et al., 2001).

There are many references for regular expressions, both practical and theoretical. For an introductory tutorial to using regular expressions in Python, see Kuchling’s Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/. For a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python, see (Friedl, 2002). Other presentations include Section 2.1 of (Jurafsky & Martin, 2008), and Chapter 3 of (Mertz, 2003).

There are many online resources for Unicode. Useful discussions of Python’s facilities

for handling Unicode are:

• PEP-100 http://www.python.org/dev/peps/pep-0100/

• Jason Orendorff, Unicode for Programmers,

http://www.jorendorff.com/articles/uni code/

• A. M. Kuchling, Unicode HOWTO,

http://www.amk.ca/python/howto/unicode

•Frederik Lundh, Python Unicode Objects,

http://effbot.org/zone/unicode-objects.htm

• Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html

The problem of tokenizing Chinese text is a major focus of SIGHAN, the ACL Special Interest Group on Chinese Language Processing (http://sighan.org/). Our method for segmenting English text follows (Brent & Cartwright, 1995); this work falls in the area of language acquisition (Niyogi, 2006).

Collocations are a special case of multiword expressions. A multiword expression is a small phrase whose meaning and other properties cannot be predicted from its words alone, e.g., part-of-speech (Baldwin & Kim, 2010).

Simulated annealing is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an analogy with annealing in metallurgy. The technique is described in many Artificial Intelligence texts.

The approach to discovering hyponyms in text using search patterns like x and other ys is described by (Hearst, 1992).