在进行文本分类之前,需要对文本进行预处理。中文文本和英文文本预处理的方式有所差别。
(1)英文文本预处理
文本预处理过程大致分为以下几点:
1、英文缩写替换
预处理过程中需要把英文缩写进行替换,如it’s和it is是等价的,won’t和will not也是等价的,等等。
text = "The story loses its bite in a last-minute happy ending that's even less plausible than the rest of the picture ."
text.replace("that's", "that is")
‘The story loses its bite in a last-minute happy ending that is even less plausible than the rest of the picture .’
2、转换为小写字母
文本中英文有大小写之分,例如"Like"和"like",这两个字符串是不同的,但在统计单词时希望把它们认为是相同的单词。
text = "This is a film well worth seeing , talking and singing heads and all ."
text.lower()
‘this is a film well worth seeing , talking and singing heads and all .’
3、删除标点符号、数字及其它特殊字符
标点符号、数字及其它特殊字符对文本分类不起作用,把它们删除可以减少特征的维度。一般用正则化方法删除这些字符。
import re
text = "disney has always been hit-or-miss when bringing beloved kids' books to the screen . . . tuck everlasting is a little of both ."
text = re.sub("[^a-zA-Z]", " ", text)
# 删除多余的空格
' '.join(text.split())
‘disney has always been hit or miss when bringing beloved kids books to the screen tuck everlasting is a little of both’
4、分词
英文文本的分词和中文文本的分词方法不同,英文文本分词方法可以根据所提供的文本进行选择,如果文本中单词和标点符号或者其它字符是以空格隔开的,例如"a little of both .",那么可以直接使用split()方法;如果文本中单词和标点符号没有用空格隔开,例如"a little of both.",可以使用nltk库中的word_tokenize()方法。nltk库安装也比较简单,在windows下,用pip install nltk进行安装即可。
# 单词和标点符号用空格隔开
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness ."
text.split()
[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]
# 单词和标点符号没有用空格隔开
from nltk.tokenize import word_tokenize
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness."
word_tokenize(tex