文本分类(1)-文本预处理

在进行文本分类之前,需要对文本进行预处理。中文文本和英文文本预处理的方式有所差别。
(1)英文文本预处理
文本预处理过程大致分为以下几点:
1、英文缩写替换
预处理过程中需要把英文缩写进行替换,如it’s和it is是等价的,won’t和will not也是等价的,等等。

text = "The story loses its bite in a last-minute happy ending that's even less plausible than the rest of the picture ."

text.replace("that's", "that is")

‘The story loses its bite in a last-minute happy ending that is even less plausible than the rest of the picture .’
2、转换为小写字母
文本中英文有大小写之分,例如"Like"和"like",这两个字符串是不同的,但在统计单词时希望把它们认为是相同的单词。

text = "This is a film well worth seeing , talking and singing heads and all ."

text.lower()

‘this is a film well worth seeing , talking and singing heads and all .’
3、删除标点符号、数字及其它特殊字符
标点符号、数字及其它特殊字符对文本分类不起作用,把它们删除可以减少特征的维度。一般用正则化方法删除这些字符。

import re

text = "disney has always been hit-or-miss when bringing beloved kids' books to the screen . . . tuck everlasting is a little of both ."
text = re.sub("[^a-zA-Z]", " ", text)

# 删除多余的空格
' '.join(text.split()) 

‘disney has always been hit or miss when bringing beloved kids books to the screen tuck everlasting is a little of both’
4、分词
英文文本的分词和中文文本的分词方法不同,英文文本分词方法可以根据所提供的文本进行选择,如果文本中单词和标点符号或者其它字符是以空格隔开的,例如"a little of both .",那么可以直接使用split()方法;如果文本中单词和标点符号没有用空格隔开,例如"a little of both.",可以使用nltk库中的word_tokenize()方法。nltk库安装也比较简单,在windows下,用pip install nltk进行安装即可。

# 单词和标点符号用空格隔开
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness ."
text.split()

[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]

# 单词和标点符号没有用空格隔开
from nltk.tokenize import word_tokenize

text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness."
word_tokenize(tex
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值