Messy codes in files with coding=gb18030

最新推荐文章于 2025-09-07 01:02:55 发布

转载最新推荐文章于 2025-09-07 01:02:55 发布 · 79 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/cyno/p/4160627.html

文章标签：

#python

本文描述了使用含有GB18030编码的语料库进行术语抽取实验时遇到的问题，包括gedit无法读取文件、实验意外终止及Python错误等，并探讨了解决这些问题的方法。

Messy codes in files with coding=gb18030

The corpus we used to have TermExtraction experiment in has a coding 'gb18030', not entirely gb18030. So it occurs us lots of troubles. The gb18030 is the coding that Chinese character national coding standard. The corpus is bilingual corpus with parallel in Chinese and English. If we ignore the wrong types and delete them, parallel will disappear.The followings are the troubles: