既然英文才是程序员的母语,就尝试着用英文写博文吧。。
Iterators
Iteration is actually the process of iterating over an iterable object, common iterable objects are Dict, String, File, etc.
The iteration consumes the contents in its targeted iterable object.
Functions like sum(), min(), list(), tuple() and in operator makes an iterable object not iterable.
To make a list iterable, we can simply call iter(item_list), and then call next() on it, all elements will be returned.
Any object has iter() and next() is considered as Iterable.
# in Operator
for x in obj:
# statements
# What's inside
_iter = iter(obj)
while 1:
try:
x = _iter.next()
except StopIteration:
break
# statements
Generator
Generator might be a easier-used Iterator。
def countdown(n):
print "Counting down from", n
while n > 0:
yield n
n -= 1
# Note that two lines below didn't start calling countdown until the next() was called.
# yield produced the n, but suspend the whole function until next time next() was called.
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.next()
Counting down from 10
10
>>> x.next()
9
...
>>> x.next()
1
# When x returns, a next() will raise exception.
>>> x.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
>>>
Python 3.4 version below
def countdown(n):
print("Counting down from", n)
while n>0:
yield n
n -= 1
return 'exits'
>>> x= countdown(3)
>>> x
<generator object countdown at 0x101bd7288>
>>> next(x)
counting down 3
3
>>> next(x)
2
>>> next(x)
1
>>> next(x)
# In Python 3.4, Generator Function can also return some value, and the value will be something like error message in the raised exception later.
# This feature is considered as Syntax Error in Python 2.7.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration: exits
Generators vs. Iterators
- Generator Function isn't just an iterable object.
- Operations on generators are always one-time, once a whole iteration was done, you have to call the generator function again.
- Unlike generators, Iterators like list and dict can be iterated unlimited times.
Generator Expressions
Variable b is an Generator below.
>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8
When list a is super large, the use of generator can save a lot memory actually, simply because it doesn't store another big list in memory.
>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]
A generator example
We now have a 1Gb access.log from nginx, the problem here is to sum up sizes of all the packets.
Every line of access.log looks like this below:
xx.xx.xx.xx - - [01/Jul/2014:10:06:06 +0800] "GET /share/ajax/?image_id=xxx&user_id=xxx HTTP/1.1" 200 72 "http://www.baidu.com/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"
We have two solutions, one was implemented by Generator, and the other simply use for-loop.
import cProfile, pstats, StringIO
def gene():
with open('access.log', 'r') as f:
lines = (line.split(' ', 11)[9] for line in f)
sizes = (int(size) for size in lines if not size == '-')
print "Generators Result: ", sum(sizes)
pr = cProfile.Profile()
pr.enable()
gene()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()
def loop():
size_sum = 0
with open('access.log', 'r') as f:
for line in f.readlines():
size = line.split(' ', 11)[9]
if not size == '-':
size_sum += int(size)
print "Forloop Result: ", size_sum
pr = cProfile.Profile()
pr.enable()
loop()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()
Sh4n3@Macintosh:~% python ger.py
Generators Result: 13678125506
12481726 function calls in 41.487 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 41.487 41.487 ger.py:3(gene)
1 1.864 1.864 41.487 41.487 {sum}
4160297 17.209 0.000 39.623 0.000 ger.py:6(<genexpr>)
4160713 11.972 0.000 22.414 0.000 ger.py:5(<genexpr>)
4160712 10.442 0.000 10.442 0.000 {method 'split' of 'str' objects}
1 0.000 0.000 0.000 0.000 {open}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Forloop Result: 13678125506
4160716 function calls in 142.672 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 84.979 84.979 142.672 142.672 ger.py:9(loop)
1 47.609 47.609 47.609 47.609 {method 'readlines' of 'file' objects}
4160712 10.084 0.000 10.084 0.000 {method 'split' of 'str' objects}
1 0.000 0.000 0.000 0.000 {open}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
So the result here shows the generator version is 3x faster than the for-loop version.
Reference
- Generator Tricks for Systems Programmers
- For loop faster than generator expression?
- Transforming Code into Beautiful, Idiomatic Python