测试数据,大小:1G 总行数:5193911行
实现方式:codecs.open(); open(); enumerate()
常识:codecs open()时默认mode为'rb'; 直接使用open()打开默认方式为'r'
先给出结论:1、同一种打开方式下mode=‘rb'性能高于mode='r'
2、不同打开方式(codecs.open(), open()) 相同的mode时,二者性能大致相当,codecs稍快,但差别在毫秒级
3、codecs.open()在使用encoding时,性能严重下降
4、enumerate性能优于codecs.open()和open(),同理,'rb'优于’r'
5、若enumerate结合codecs.open()的encoding则性能降低20倍以上
codecs.open(filename, encoding='utf-8').readlines()
5193911
5.54100012779
codecs.open(filename, 'rb').readlines()
5193374
2.2539999485
open(filename, 'rb').readlines()
5193374
2.34699988365
open(filename, 'r').readlines()
5193374
3.16700005531
codecs.open(filename, 'r').readlines()
5193374
3.19799995422
for count, line in enumerate(open(filename)):
pass
print count + 1
5193374
2.67100000381
for count, line in enumerate(open(filename, 'rb')):
pass
print count + 1
5193374
1.78999996185
for count, line in enumerate(codecs.open(filename)):
pass
print count + 1
5193374
1.69200015068
for count, line in enumerate(codecs.open(filename, 'rb')): =====codecs.open(filename, mode) 默认mode为'rb'
pass
print count + 1
5193374
1.67999982834
for count, line in enumerate(codecs.open(filename, 'r')):
pass
print count + 1
5193374
2.53399991989
for count, line in enumerate(codecs.open(filename, 'r', encoding='utf-8')):
pass
print count + 1
5193911
41.236000061
for count, line in enumerate(codecs.open(filename, 'rb', encoding='utf-8')):
pass
print count + 1
5193911
40.9140000343
最终结论:在非unicode时,行数统一,但统一得不合理(具体原因待求证),只有在unicode时readlines()的len才是正确的结果
所以在求文件行数时,若内存允许,最好使用codecs.open(filename, encoding=‘utf-8').readlines()