Given the input:
x = ['foo bar', 'bar blah', 'black sheep']
I could do this to get the count of each word in the list of space delimited string:
from itertools import chain
from collections import Counter
c = Counter(chain(*map(str.split, x)))
Or I could simple iterate through and get:
c = Counter()
for sent in x:
for word in sent.split():
c[word]+=1
[out]:
Counter({'bar': 2, 'sheep': 1, 'blah': 1, 'foo': 1, 'black': 1})
The question is which is more efficient if the input list of string is extremely huge? Are there other ways to achieve the same counter object?
Imagine it's a text file object that has billions of lines with 10-20 words each.
解决方案
The answer to your question is profiling.
Following are some profiling tools:
print time.time() in strategic places. (or use Unix time)
heapy tracks all objects inside Python’s memory (good for memory leaks)
For long-running systems, use dowser: allows live objects introspection (web browser interface)
examine Python bytecode with dis
本文讨论了在处理包含大量文本文件的列表时,使用itertools库的Counter函数与迭代计数两种方法的效率。通过 profiling 工具比较,探讨了在处理海量数据时哪种方法更优,并介绍了常用的内存监控和系统监控工具。
1426

被折叠的 条评论
为什么被折叠?



