Running a simple app in pyspark.
f = sc.textFile("README.md")
wc = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
I want to view RDD contents using foreach action:
wc.foreach(print)
This throws a syntax error:
SyntaxError: invalid syntax
What am I missing?
This error is because print isn't a function in Python 2.6.
You can either define a helper UDF that performs the print, or use the __future__ library to treat printas a function:
>>> from operator import add
>>> f = sc.textFile("README.md")
>>> def g(x):
... print x
...
>>> wc.foreach(g)
or
>>> from __future__ import print_function
>>> wc.foreach(print)
However, I think it would be better to use collect() to bring the RDD contents back to the driver, because foreach executes on the worker nodes and the outputs may not necessarily appear in your driver / shell (it probably will in local mode, but not when running on a cluster).
>>> for x in wc.collect():
... print x
本文介绍如何在PySpark中运行简单应用并使用foreach操作查看RDD内容。遇到语法错误时,可通过定义自定义UDF或导入__future__库解决。建议使用collect()方法将RDD内容带回Driver节点,以便在Shell中查看结果。
3642

被折叠的 条评论
为什么被折叠?



