背景
参考饿了么经验:https://zhuanlan.zhihu.com/p/28574213
饿了么经验中谈到:“hive.exec.orc.split.strategy为ETL”,但是这样可能导致spark thriftserver的内存压力很大,面对大作业会导致full gc从而进程卡死或退出。
原因
先看看split的strategy类别,它有BI,ETL和HYBRID三种,默认是HYBRID
long avgFileSize = totalFileSize / numFiles;
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {