目前,impala 的substr函数及substring函数都不支持中文的截取,因此,需要通过udf函数实现。具体的实现效果需要与substr的英文效果相同。具体如下:
SUBSTR("abcde",3)=cde
SUBSTR("abcde",-2)=de
SUBSTR("abcde",3,2)=cd
SUBSTR("abcde",-4,2)=bc
对于impala udf函数,可以采用C++ 和 Java , 但出于效率考虑,一般采用c++(
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/search_prepare_install_search.html 以及
http://blog.youkuaiyun.com/yu616568/article/details/52746332)。
此处要求支持2个参数和3个参数,因此,在udf中,需要添加2个重载方法。
具体步骤:
1、下载
impala-udf-devel 这个包。方法:
> git clone https://github.com/laserson/impala-udf-devel.git > cd impala-udf-devel/ > cmake .
2、 编辑在impala-udf-devel 目录下编辑文件udf-substr.cc udf-substr.h 两个文件,可以先将udf下的两个udf.cc udf.h 文件拷贝到父目录,具体如下:
udf-substr.cc
#include "udf-substr.h"
#include <string>
#include <cmath>
using namespace std;
const unsigned char kFirstBitMask = 128; // 1000000
const unsigned char kSecondBitMask = 64; // 0100000
const unsigned char kThirdBitMask = 32; // 0010000
const unsigned char kFourthBitMask = 16; // 0001000
const unsigned char kFifthBitMask = 8; // 0000100
int utf8_char_len(char firstByte)
{
std::string::difference_type offset = 1;
if(firstByte & kFirstBitMask) // This means the first byte has a value greater than 127, and so is beyond the ASCII range.
{
if(firstByte & kThirdBitMask) // This means that the first byte has a value greater than 224, and so it must be at least a three-octet code point.