大文本矩阵的转置

最新推荐文章于 2021-04-28 18:03:49 发布

原创

最新推荐文章于 2021-04-28 18:03:49 发布 · 1.9k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#awk #python #文本处理

处理近1G的大文本矩阵时，尝试使用numpy加载失败，因内存限制，改为Python脚本逐行读取，耗时一个多小时。最终采用Awk实现，单次操作耗时10秒，但全量转换仍需1.5小时。问题在于无法找到更优算法。后续遇到列数过多导致Awk字段限制，通过sed和paste命令解决。

得到一个近1G的文本矩阵，想来个行列转换

刚开始很甜地觉得用个小包就能搞定

手头正在用numpy 于是来loadtxt

于是。。。卡了。。。本子果然是战斗力只有5的渣渣

接着随手写了个python脚本

想着不就是内存之类的不够用吗我一行行地读不行吗！

def revefile(filename):
  f=open(filename,'r')
  # Although I plan to use numpy,but I found it's...too big...

  os.mkdir('revetmp')
  print 'begin read file...'
  f=open('filname','r')
  count=0
  for s in f:
    count+=1
    s=s.replace('\t','\n').replace('\r','')
    w=open('revetmp/t_'+str(count)+'.txt','wb+')
    w.writelines(s)
    w.close()
    if(count%10000==0):print count
  print count
  f.close()

  print 'try to paste files....'
  os.system('paste -d\t revetmp/* > reversefile.txt')
  os.system('rm -rf revetmp')

答案是不行啊

280多万行缓慢地爬了一个多小时（刚开始看速度不快就不能报以希望的。

最致命的一点是paste那行bash命令长度超过最大limit（噗

虽然是py死忠也不得不觉得。。。（一定是我技术不到家

所以既然是矩阵文件那就用awk吗！