今天在Windows10系统下,跑Spark Python脚本, 执行collect()时报下面的错误
births.select(s[0]) \
.distinct() \
.rdd \
.map(lambda row: row[0]) \
.collect()
尝试把csv文件编码改为utf-8也没有用,代码中加入编码格式也没有效果。
import sys
reload(sys)
sys.setdefaultencoding('ISO-8859-1') #改成utf8也没有用。
原来我犯了一个错误,python脚本的名称包含中文导致的。(⊙ ︿ ⊙)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "36-38(数据预处理_逻辑回归).py", line 83, in <module>
print births.rdd.collect()
File "e:\anaconda2\lib\site-packages\pyspark\rdd.py", line 815, in collect
with SCCallSiteSync(self.context) as css:
File "e:\anaconda2\lib\site-packages\pyspark\traceback_utils.py", line 72, in __enter__
self._context._jsc.setCallSite(self._call_site)
File "e:\anaconda2\lib\site-packages\py4j\java_gateway.py", line 1277, in __call__
args_command, temp_args = self._build_args(*args)
File "e:\anaconda2\lib\site-packages\py4j\java_gateway.py", line 1247, in _build_args
[get_command_part(arg, self.pool) for arg in new_args])
File "e:\anaconda2\lib\site-packages\py4j\protocol.py", line 292, in get_command_part
command_part = STRING_TYPE + escape_new_line(parameter)
File "e:\anaconda2\lib\site-packages\py4j\protocol.py", line 187, in escape_new_line
return smart_decode(original).replace("\\", "\\\\").\
File "e:\anaconda2\lib\site-packages\py4j\protocol.py", line 219, in smart_decode
return unicode(s, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 17: invalid continuation byte