最近搞数据挖掘,在前期的ETL部分,是pig来写的,大约有4283+行
据说运行非常慢,所以,准备TEZ一下,这里先贴个tez的页面:
https://tez.apache.org/
tez化
#注意-x tez的位置
# 临时处理,所以thrift没有高可用
cmd="${pig_hcatalog_cmd} -Dhive.metastore.uris=thrift://192.168.1.190:9083 -p input=${input_origin_dot_data_path} -p output=${output_temp_feature_path} -x tez ./dot_sample_feature_forFD_temp.pig"
注意:好像要放在pig之前。
坑,找不到类
NoClassDefFoundError: org/apache/hadoop/mapred/MRVersion
分析了一下,这个是MRv1,所以,按照文章:
https://stackoverflow.com/questions/48443337/noclassdeffounderror-org-apache-hadoop-mapred-mrversion-when-using-spark-testin
的思路,从:
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-core/2.6.0-mr1-cdh5.15.0/
下载了jar,然后放入pig的路径中(放哪其实无所谓)
然后:
# 注意最后一个
pig_hcatalog_cmd="${pig_cmd_env} -useHCatalog -Dpig.additional.jars=$HCAT_HOME/share/hcatalog/hive-hcatalog-core*.jar:$HCAT_HOME/share/hcatalog/hive-hcatalog-pig-adapter*.jar:$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:$HIVE_HOME/lib/jdo-api-*.jar:$HIVE_HOME/lib/log4j-*.jar:/opt/pig-0.17.0/lib/hadoop-core-2.6.0-mr1-cdh5.15.0.jar "
然后:完美!!