文本分割是自然语言理解数据预处理中的重要步骤,本段程序实现的是用",。?!…”分割文章,并且分割子句单句成行
import re
pattern = r"([,。?!…]+)" #正则匹配模式,用+表示至少一个字符
flags = [",","。","?","!","…","……"]
sentence_txt = []
with open("./test.txt","r",encoding="utf-8") as reader_file:
for line in reader_file:#一行就是一篇文章
spilt_list = re.split(pattern=pattern, string=line)
segment = ""
for segment_i in spilt_list:
segment += segment_i
if segment_i in flags :
#去除分割子句中的空格,\n,\t等符号,并加上"\r"回车符换行
sentence_txt.append("".join(segment.split())+"\r")
segment = ""
sentence_txt.append("\r")
with open("./spilt.txt","w",encoding="utf-8") as writer_file:
writer_file.writelines(sentence_txt)
print(sentence_txt.__len__())
测试文本
我叫琼八蛋,我已经毕业很多年了,至于你要问我毕业的大学的话,非常抱歉,我很不乐意说,因为我的大学是非常神圣的,向别人说的时候我都会忍住。 以前在小学的时候,很多人都会问我:呃,蛋啊,你怎么会取穷八蛋这个名字。以前还小,我都很老实的告诉他们:是我爸爸取的。 但是