https://stackoverflow.com/questions/147741/character-reading-from-file-in-python
https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte
>>> file -I t_city_mesh000
t_city_mesh000: text/plain; charset=utf-8
我在拿python读一个utf8文件的时候,报错
data = np.loadtxt('../t_city_mesh000',delimiter=",")
#第一列是日语,第一列是int数字
UnicodeEncodeError: 'decimal' codec can't encode characters in position 0-4: invalid decimal Unicode string
然后试着改用了np.gentxt
, 里面的日语就都变成nan了,
再问了前辈后他问我dtype
有没有用
于是我用了下面的写法,好像也没有出什么问题。
data = np.loadtxt('../t_city_mesh000',delimiter=",",dtype=[('col1', 'S10'), ('col2', 'int32')],unpack=True)
可是在我尝试将load完的日语decode的时候出错了。
>>> import numpy as np
>>> data = np.loadtxt('../t_city_mesh000',delimiter=",",dtype=[('col1', 'S10'), ('col2', 'int32')],unpack=True)
>>> data
[array(['\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9',
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9',
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9', ...,
'\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
'\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
'\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82'], dtype='|S10'), array([533941234, 533941243, 533941244, ..., 533962031, 533962032,
533962041], dtype=int32)]
>>> data[0][0]
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9'
>>> data[0][0].decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/na.rong/.pyenv/versions/2.7.15/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: unexpected end of data
data[0][0]
最后的\xe9
不是utf8???
我查了一下,发现了这个
说是\xe9
不是utf8而是ISO-8859-1
>>> import numpy as np
>>> data=np.loadtxt('../t_city_mesh000',delimiter=",",encoding='ISO-8859-1',dtype=str,unpack=True)
>>> data
array([['\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82',
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82',
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82',
..., '\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
'\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82',
'\xe9\x9d\x92\xe6\xa2\x85\xe5\xb8\x82'],
['533941234', '533941243', '533941244', ..., '533962031',
'533962032', '533962041']], dtype='|S15')
>>> data[0][0]
'\xe3\x81\x82\xe3\x81\x8d\xe3\x82\x8b\xe9\x87\x8e\xe5\xb8\x82'
>>> data[0][0].decode('utf8')
u'\u3042\u304d\u308b\u91ce\u5e02'
>>> print(data[0][0].decode('utf8'))
あきる野市
我发现我特别蠢,
其实只要
data=np.loadtxt('../t_city_mesh000',delimiter=",",encoding='utf8',dtype=str)
就可以了。。。