for line in sys.stdin

最新推荐文章于 2024-06-26 23:08:11 发布

转载最新推荐文章于 2024-06-26 23:08:11 发布 · 365 阅读

文章标签：

#python #大数据

本文探讨了Python中sys.stdin的不同读取方法及其对程序的影响。通过对比for line in sys.stdin与sys.stdin.readline()的区别，揭示了缓冲机制如何影响数据读取，并提供了解决方案。

绚丽也尘埃 » for line in sys.stdin

for line in sys.stdin

March 30th, 2012
绚丽也尘埃 Leave a comment
Go to comments

纠结一个下午和一个晚上了，我想在syslog-ng里面添加一个program destination，程序是用Python写的，结果发现file destination总是在第一时间就能收到消息，而program则没有什么动静，反复测试了好多遍都是如此。man Python，发现了Python有个’-u’参数，man是这样说的。

-u Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, std-out and stderr in binary mode. Note that there is internal buffering in xreadlines(), readlines() and file-object iterators (“for line in sys.stdin”) which is not influenced by this option. To work around this, you will want to use “sys.stdin.readline()” inside a “while 1:” loop.

加上’-u’之后，标准输出打印的内容很快能进入日志文件中，标准输入还是没有动静。当时没有仔细看这段说明，里面已经指出了for line in sys.stdin并不受影响，而我的代码偏偏是这样从标准输入里面读数据的。后来无意中在stackoverflow发现有一个人说这样迭代方式需要等到EOF出现才会开始执行，如果使用sys.stdin.readline()就不会有问题，测试了下发现果然是好用的。

下面两个例子可以说明问题。在终端中分别运行两个程序，第一种遍历方式会等到敲入CRTL+D才会打印输入的内容。第二种方式输入一行，回车之后就会打印这行。

?

01

02

03

04

05

06

07

08

09

10

11

12

#!/bin/env

import sys

for line in sys.stdin:

     print line,

line = sys.stdin.readline()

while line:

     print line,

     line = sys.stdin.readline()

奇怪的是，我写Hadoop Streaming Job时，一直都用for line in sys.stdin这种方式遍历，也没有出过问题。Hadoop Streaming官方文档里面的例子用的是readline这种方法。我猜这个应该是Hadoop的数据都保存在本地了，等于用cat的方式给脚本送数据，所以没有问题。

在网上查资料的时候还发现有人反馈Python使用for line in sys.stdin的一个bug：Issue1633941，就是需要输入两次CRTL+D程序才会退出。Ralph Corderoy指出这个和Python 2.6用fread导致的问题，大概的意思是fread读到的数据长度为0时，它才认为获取到了EOF。如果没有得到指定长度的数据，即使数据后面存在EOF，它也会忽略。解决办法是在循环内使用feof对stdin进行一次判断，如果结束了就立即退出。