1. Basic Text Processing
1. Regular Expressions
2. Word tokenization
3. Word Normalization and Stemming
4. Sentence Segmentation and Decision Trees
2. Minimum Edit Distance
1. Definition of Minimum Edit Distance
2. Computing Minimum Edit Distance
3. Backtrace for Computing Alignments
4. Weighted Minimum Edit DIstance
5. Minimum Edit Distance in Computational Biology
3. Language Modeling
1. Introduction to N-grams
2. Estimating N-gram Probabilities
3. valuation and perplexity
4. Generalization and zeros
Smoothing
: Add-one(Laplace) smoothing
Add-1, Add-k, Unigram prior Smoothing 贴图(3-68)
Interpolation(插值法), Backoff, and Web-Scale LMs
总结: Add-1,文本分类可以,语言模型不适用; 插值法比较常用;web-scale的N-grams 使用简单的回退,trigram->bigram->unigram
Advanced: Good Turing Smoothing
Advanced: Kneser-Ney Smoothing