[python]读入日语文件

最新推荐文章于 2022-08-12 18:26:15 发布

原创最新推荐文章于 2022-08-12 18:26:15 发布 · 2.6k 阅读

0 ·

CC 4.0 BY-SA版权

python 专栏收录该内容

26 篇文章

订阅专栏

本文讲述了在使用Python读取日语UTF-8编码的文件时遇到的解码错误和解决方案。作者首先描述了尝试使用不同方法读取文件导致的问题，如`nan`出现和`UnicodeDecodeError`。通过咨询和查阅资料，最终发现文件可能是以`Shift_JIS`编码，而正确读取和解码日语内容的方法是明确指定编码方式，例如使用`encoding='Shift_JIS'`参数。

https://stackoverflow.com/questions/147741/character-reading-from-file-in-python
https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte

>>> file -I t_city_mesh000
t_city_mesh000: text/plain; charset=utf-8

我在拿python读一个utf8文件的时候，报错

data = np.loadtxt('../t_city_mesh000',delimiter=",")
#第一列是日语，第一列是int数字

UnicodeEncodeError: 'decimal' codec can't encode characters in position 0-4: invalid decimal Unicode string
然后试着改用了np.gentxt, 里面的日语就都变成nan了，
再问了前辈后他问我dtype有没有用
于是我用了下面的写法，好像也没有出什么问题。

data = np.loadtxt('../t_city_mesh000',delimiter=",",dtype=[('col1', 'S10'), ('col2', 'int32')],unpack=True)

可是在我尝试将load完的日语decode的时候出错了。

>>> import numpy as np
>>> data = np.loadtxt('../t_city_mesh000',delimiter=",",dtype=[('col1', 'S10'), ('col2', 'int32')],unpack=True)
>>> data
[array(['\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9',
       '\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9',
       '\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9', ...,
       '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
       '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
       '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82'], dtype='|S10'), array([533941234, 533941243, 533941244, ..., 533962031, 533962032,
       533962041], dtype=int32)]
>>> data[0][0]
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9'
>>> data[0][0].decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/na.rong/.pyenv/versions/2.7.15/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: unexpected end of data

data[0][0]最后的\xe9不是utf8???
我查了一下，发现了这个
说是\xe9不是utf8而是ISO-8859-1

>>> import numpy as np
>>> data=np.loadtxt('../t_city_mesh000',delimiter=",",encoding='ISO-8859-1',dtype=str,unpack=True)
>>> data
array([['\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82',
        '\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82',
        '\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82',
        ..., '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
        '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
        '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82'],
       ['533941234', '533941243', '533941244', ..., '533962031',
        '533962032', '533962041']], dtype='|S15')
>>> data[0][0]
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82'
>>> data[0][0].decode('utf8')
u'\u3042\u304d\u308b\u91ce\u5e02'
>>> print(data[0][0].decode('utf8'))
あきる野市

我发现我特别蠢，
其实只要

data=np.loadtxt('../t_city_mesh000',delimiter=",",encoding='utf8',dtype=str)

就可以了。。。