文本分类（1）-文本预处理_文本分类 it's 替换为 tis-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_44766179/article/details/89855100

在进行文本分类之前，需要对文本进行预处理。中文文本和英文文本预处理的方式有所差别。
（1）英文文本预处理
文本预处理过程大致分为以下几点：
1、英文缩写替换
预处理过程中需要把英文缩写进行替换，如it’s和it is是等价的，won’t和will not也是等价的，等等。

text = "The story loses its bite in a last-minute happy ending that's even less plausible than the rest of the picture ."

text.replace("that's", "that is")

‘The story loses its bite in a last-minute happy ending that is even less plausible than the rest of the picture .’
2、转换为小写字母
文本中英文有大小写之分，例如"Like"和"like"，这两个字符串是不同的，但在统计单词时希望把它们认为是相同的单词。

text = "This is a film well worth seeing , talking and singing heads and all ."

text.lower()

‘this is a film well worth seeing , talking and singing heads and all .’
3、删除标点符号、数字及其它特殊字符
标点符号、数字及其它特殊字符对文本分类不起作用，把它们删除可以减少特征的维度。一般用正则化方法删除这些字符。

import re

text = "disney has always been hit-or-miss when bringing beloved kids' books to the screen . . . tuck everlasting is a little of both ."
text = re.sub("[^a-zA-Z]", " ", text)

# 删除多余的空格
' '.join(text.split())

‘disney has always been hit or miss when bringing beloved kids books to the screen tuck everlasting is a little of both’
4、分词
英文文本的分词和中文文本的分词方法不同，英文文本分词方法可以根据所提供的文本进行选择，如果文本中单词和标点符号或者其它字符是以空格隔开的，例如"a little of both ."，那么可以直接使用split()方法；如果文本中单词和标点符号没有用空格隔开，例如"a little of both."，可以使用nltk库中的word_tokenize()方法。nltk库安装也比较简单，在windows下，用pip install nltk进行安装即可。

# 单词和标点符号用空格隔开
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness ."
text.split()

[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]

# 单词和标点符号没有用空格隔开
from nltk.tokenize import word_tokenize

text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness."
word_tokenize(tex