Notes on Generator 1

本文详细介绍了Python中的迭代器和生成器的概念、使用方法及其与列表、字典等内置类型的区别。通过实例展示了如何使用生成器进行内存高效的数据处理,包括对大量数据的聚合操作和迭代过程中的性能优化。

既然英文才是程序员的母语,就尝试着用英文写博文吧。。


Iterators

Iteration is actually the process of iterating over an iterable object, common iterable objects are Dict, String, File, etc.

The iteration consumes the contents in its targeted iterable object.

Functions like sum(), min(), list(), tuple() and in operator makes an iterable object not iterable.

To make a list iterable, we can simply call iter(item_list), and then call next() on it, all elements will be returned.

Any object has iter() and next() is considered as Iterable.

# in Operator

for x in obj:
    # statements

# What's inside

_iter = iter(obj)
while 1:
    try:
        x = _iter.next()
    except StopIteration:
        break
    # statements

Generator

Generator might be a easier-used Iterator。

def countdown(n):
    print "Counting down from", n
    while n > 0:
        yield n
        n -= 1
# Note that two lines below didn't start calling countdown until the next() was called.
# yield produced the n, but suspend the whole function until next time next() was called.
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.next()
Counting down from 10
10
>>> x.next()
9
...
>>> x.next()
1
# When x returns, a next() will raise exception.
>>> x.next()
Traceback (most recent call last):
 File "<stdin>", line 1, in ?
StopIteration
>>>

Python 3.4 version below

def countdown(n):
    print("Counting down from", n)
    while n>0:
        yield n
        n -= 1
    return 'exits'
>>> x= countdown(3)
>>> x
<generator object countdown at 0x101bd7288>
>>> next(x)
counting down 3
3
>>> next(x)
2
>>> next(x)
1
>>> next(x)
# In Python 3.4, Generator Function can also return some value, and the value will be something like error message in the raised exception later.
# This feature is considered as Syntax Error in Python 2.7.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration: exits

Generators vs. Iterators

  • Generator Function isn't just an iterable object.
  • Operations on generators are always one-time, once a whole iteration was done, you have to call the generator function again.
  • Unlike generators, Iterators like list and dict can be iterated unlimited times.

Generator Expressions

Variable b is an Generator below.

>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8

When list a is super large, the use of generator can save a lot memory actually, simply because it doesn't store another big list in memory.

>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]

A generator example

We now have a 1Gb access.log from nginx, the problem here is to sum up sizes of all the packets.

Every line of access.log looks like this below:

xx.xx.xx.xx - - [01/Jul/2014:10:06:06 +0800] "GET /share/ajax/?image_id=xxx&user_id=xxx HTTP/1.1" 200 72 "http://www.baidu.com/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"

We have two solutions, one was implemented by Generator, and the other simply use for-loop.

import cProfile, pstats, StringIO

def gene():
	with open('access.log', 'r') as f:
		lines = (line.split(' ', 11)[9] for line in f)
		sizes = (int(size) for size in lines if not size == '-')
		print "Generators Result: ", sum(sizes)

pr = cProfile.Profile()
pr.enable()
gene()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


def loop():
	size_sum = 0
	with open('access.log', 'r') as f:
		for line in f.readlines():
			size = line.split(' ', 11)[9]
			if not size == '-':
				size_sum += int(size)
		print "Forloop Result: ", size_sum

pr = cProfile.Profile()
pr.enable()
loop()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


Sh4n3@Macintosh:~% python ger.py
Generators Result: 13678125506
         12481726 function calls in 41.487 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   41.487   41.487 ger.py:3(gene)
        1    1.864    1.864   41.487   41.487 {sum}
  4160297   17.209    0.000   39.623    0.000 ger.py:6(<genexpr>)
  4160713   11.972    0.000   22.414    0.000 ger.py:5(<genexpr>)
  4160712   10.442    0.000   10.442    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Forloop Result: 13678125506
         4160716 function calls in 142.672 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   84.979   84.979  142.672  142.672 ger.py:9(loop)
        1   47.609   47.609   47.609   47.609 {method 'readlines' of 'file' objects}
  4160712   10.084    0.000   10.084    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

So the result here shows the generator version is 3x faster than the for-loop version.

Reference

转载于:https://my.oschina.net/shinedev/blog/521321

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值