Python自然语言处理学习笔记(27):3.11 深入阅读

本博客深入探讨自然语言处理领域的深入阅读材料,包括Python编程资源、正则表达式、Unicode处理、文本预处理方法等。通过链接提供丰富的在线资源,如Jurafsky & Martin的书籍章节,Mertz的著作,以及关于处理标准和非标准词汇的参考资料。介绍了处理中文文本的SIGHAN关注点、词法分析、句法分析、语义分析、分词、词性标注等关键概念,并讨论了模拟退火算法、词组识别、词义消歧等高级话题。

3.11 Further Reading 深入阅读

 

Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. Remember to consult the Python reference materials at http://docs.python.org/ . (For example, this documentation covers “universal newline support,” explaining how to work with the different newline conventions used by various operating systems.)

For more examples of processing words with NLTK, see the tokenization, stemming, and corpus HOWTOs at http://www.nltk.org/howto . Chapters 2 and 3 of (Jurafsky &Martin, 2008) contain more advanced material on regular expressions and morphology.

For more extensive discussion of text processing with Python, see (Mertz, 2003). For information about normalizing non-standard words, see (Sproat et al., 2001).

There are many references for regular expressions, both practical and theoretical. For an introductory tutorial to using regular expressions in Python, see Kuchling’s Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/. For a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python, see (Friedl, 2002). Other presentations include Section 2.1 of (Jurafsky & Martin, 2008), and Chapter 3 of (Mertz, 2003).

There are many online resources for Unicode. Useful discussions of Python’s facilities

for handling Unicode are:

PEP-100  http://www.python.org/dev/peps/pep-0100/ 

Jason Orendorff, Unicode for Programmers,

 http://www.jorendorff.com/articles/uni code/  

A. M. Kuchling, Unicode HOWTO,

http://www.amk.ca/python/howto/unicode

Frederik Lundh, Python Unicode Objects,

http://effbot.org/zone/unicode-objects.htm

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html

 

The problem of tokenizing Chinese text is a major focus of SIGHAN, the ACL Special Interest Group on Chinese Language Processing (http://sighan.org/). Our method for segmenting English text follows (Brent & Cartwright, 1995); this work falls in the area of language acquisition (Niyogi, 2006).

Collocations are a special case of multiword expressions. A multiword expression is a small phrase whose meaning and other properties cannot be predicted from its words alone, e.g., part-of-speech (Baldwin & Kim, 2010).

Simulated annealing is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an analogy with annealing in metallurgy. The technique is described in many Artificial Intelligence texts.

The approach to discovering hyponyms in text using search patterns like x and other ys is described by (Hearst, 1992).

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值