1. 按空格/符号分词
pattern = r'''(?x) # set flag to allow verbose regexps ([A-Z]\.)+ # abbreviations, e.g. U.S.A. | \w+(-\w+)* # words with optional internal hyphens | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% | \.\.\. # ellipsis | [][.,;"'?():-_`] # these are separate tokens ''' re.findall(pattern,text)
2. 排除stop word
stopword就是类似 a/an/and/are/then 的这类高频词,高频词会对基于词频的算分公式产生极大的干扰,所以需要过滤
3. 提取词干(Stemming)
Porter Stemmer
代码(python):https://tartarus.org/martin/PorterStemmer/python.txt
简单词汇处理前后的对比:http://snowball.tartarus.org/algorithms/porter/diffs.txt