lotrus28..
5
尝试读取带有空格,逗号和引号的制表符分隔表时,我遇到了类似的问题:
1115794 4218 "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180 "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444 2328 "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""
import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
这说明它与C解析引擎(默认引擎)有关。也许更改为python会改变一切
counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')
Segmentation fault (core dumped)
现在,这是一个不同的错误。
如果我们继续尝试从表中删除空格,则python-engine的错误再次更改:
1115794 4218 "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180 "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444 2328 "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""
_csv.Error: ' ' expected after '"'
很明显,熊猫在解析我们的行时遇到问题。要使用python引擎解析表,我需要事先删除表中的所有空格和引号。同时,C引擎即使连续出现逗号也不断崩溃。
为了避免创建带有替换的新文件,我这样做是因为表很小:
from io import StringIO
with open(path_counts) as f:
input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')
tl; dr
更改解析引擎,请尝试避免数据中出现任何非限定性的引号/逗号/空格。