Messy codes in files with coding=gb18030

本文描述了使用含有GB18030编码的语料库进行术语抽取实验时遇到的问题,包括gedit无法读取文件、实验意外终止及Python错误等,并探讨了解决这些问题的方法。

The corpus we used to have TermExtraction experiment in has a coding 'gb18030', not entirely gb18030. So it occurs us lots of troubles. The gb18030 is the coding that Chinese character national coding standard. The corpus is bilingual corpus with parallel in Chinese and English. If we ignore the wrong types and delete them, parallel will disappear.The followings are the troubles:

  • gedit can't read the file with only several sentence being wrong coding.
  • TermExtraction will abort in accident.
  • python errors: invalid types....
    :(
posted on 2014-12-12 22:49 cynorr 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/cyno/p/4160627.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值