问题描述:
执行SparkSQL报错:
Caused by: org.apache.spark.SparkException: Cannot broadcast the table over 357913941 rows: 3942700700 rows.
检查SQL中没有join操作
解决方法:
报错sql:
select * from table_a where table_a.id in (select id from table_b);
将sql中使用的过滤条件in (select id from table_b) 修改为使用semi join的方式实现
select * from table_a semi join table_b on table_a.id = table_b.id;
问题解决.
使用in过滤时后面的集合不适合太大,需要小到可以广播(broadcast)