Impala 4.0 中移除了 Impala-lzo 的直接支持,下面是 “Impala 4 Breaking Changes” 邮件中关于移除 Impala-lzo 支持的说明及 JIRA,IMPALA-9709 追踪了 Impala-lzo 的移除。
Remove support for Impala-lzo:
Impala-lzo provides code to allow Impala to read the LZO compressed tables.
LZO is GPL licensed, which is why this support is not included directly.
The Impala-lzo code interacts with internal Impala code at a level that is
error prone and intricate. Given the low adoption of LZO and the other
compression options available, Impala plans to remove Impala-lzo support
along with the low level interface it used.
由于数仓中有很大一部分原始数据是用 LZO 存储的,转换的成本比较高,本文在 Impala 4.0 上尝试安装支持 LZO。
分析
通过查阅 Jira,目前只是从开发环境中删除 Impala-lzo,插件基础结构并未删除,这保留了一些 LZO支持代码。可以使用 Impala-lzo 其作为插件加载并获得与以前相同的功能。
参数 enabled_hdfs_text_scanner_plugins
现在是空,之前是 LZO。
// LZO is no longer supported, so there are no plugins enabled by default. This is
// likely to be removed.
DEFINE_string(enabled_hdfs_text_scanner_plugins, "", "(Advanced) whitelist of HDFS "
"text scanner plugins that Impala will try to dynamically load. Must be a "
"comma-separated list of upper-case compression codec names. Each plugin implements "
"support for decompression and hands off the decompressed bytes to Impala's builtin "
"text parser for further processing (e.g. parsing delimited text).");
// 插件库模板,lzo插件应该是 libimpalalzo.so
static const string LIB_IMPALA_TEMPLATE = "libimpala$0.so";
从这里可以得知想要支持 lzo 查询,需要配置enabled_hdfs_text_scanner_plugins
选项和libimpalalzo.so
库。
环境
OS: CentOS 7
Impala: https://github.com/apache/impala/tree/branch-4.0.0
Impala-lzo: https://github.com/chufucun/impala-lzo 注:修复了和 Impala 4.0 编译的错误。
CM & CDH: 6.3
构建 Impala-lzo
要构建 impala-lzo 库,请先准备好 Impala 的编译环境,impala-lzo 依赖这个项目的编译环境。
- Building Impala without Test Data (for testing Impala)
# 可以使用 --depth 1 参数加快克隆。
git clone -b branch-4.0.0 https://gitbox.apache.org/repos/asf/impala.git ~/Impala
cd ~/Impala
export IMPALA_HOME=`pwd`
./bin/bootstrap_system.sh
source ./bin/impala-config.sh
# Format the test cluster and start Impala and dependent servic