-
去除所有标点符号,需要去除的标点符号是如下几种: , . ! ? : ;
-
所有数字包括小数,整数,负数都替换成一个替代字符串: ==NUMBER==
-
所有大写字母全部转成小写
-
去除每行起始的所有空格
-
连续的空格缩短为单独的空格(除每行起始连续空格,见以上规则4)
- 如果经过上述处理导致一行为空,则在此行处放置标记字符串:[REMOVED]
在python (二)中,我用正则表达式处理的文本,如果不用正则表达式,要如何处理这段文本呢?
#!/usr/bin/env python
#coding:utf-8
import string
def noreto(stol):
s = stol.lower().lstrip()
s = s.replace(',','')
s = s.replace('.','')
s = s.replace('!','')
s = s.replace('?','')
s = s.replace(':','')
s = s.replace(';','')
s = replaceall(' ','',s)
l = s.split(' ')
if len(l) == 0:
l.append(0,'[REMOVED]')
finaly = ' '.join(l)
#print finaly
with open('test_output.txt','a') as fwuck:
fwuck.write(finaly)
def replaceall(mul,sing,str):
while str.find(mul) > -1:
str = str.replace(mul,'')
return str
with open('test_input.txt','r') as fruck:
for line in fruck:
noreto(line)
还差一个替换数字为==NUMBER==