impala udf函数实现中文截取

最新推荐文章于 2023-03-23 10:23:26 发布

原创

最新推荐文章于 2023-03-23 10:23:26 发布 · 4.8k 阅读

6 ·

CC 4.0 BY-SA版权

本文介绍了如何使用C++实现Impala的UDF（用户自定义函数）来处理中文字符串的截取，以弥补Impala内置substr函数对中文支持的不足。详细步骤包括下载impala-udf-devel包，编写并编辑udf-substr.cc和udf-substr.h文件，创建重载方法，编译生成libsubstr_udf.so库文件，上传到HDFS，然后在Impala中注册函数进行测试。

  目前，impala 的substr函数及substring函数都不支持中文的截取，因此，需要通过udf函数实现。具体的实现效果需要与substr的英文效果相同。具体如下： 

 
 SUBSTR("abcde",3)=cde  

 
 SUBSTR("abcde",-2)=de  

 
 SUBSTR("abcde",3,2)=cd  

 
 SUBSTR("abcde",-4,2)=bc 

  对于impala udf函数，可以采用C++ 和 Java , 但出于效率考虑，一般采用c++（ 
 https://www.cloudera.com/documentation/enterprise/5-5-x/topics/search_prepare_install_search.html 以及  
 http://blog.youkuaiyun.com/yu616568/article/details/52746332）。 

  此处要求支持2个参数和3个参数，因此，在udf中，需要添加2个重载方法。 

  具体步骤： 

  1、下载 
 impala-udf-devel 这个包。方法： 

 
 > git clone https://github.com/laserson/impala-udf-devel.git > cd impala-udf-devel/ > cmake . 

 
 2、 编辑在impala-udf-devel 目录下编辑文件udf-substr.cc udf-substr.h 两个文件，可以先将udf下的两个udf.cc udf.h 文件拷贝到父目录，具体如下： 

 
 udf-substr.cc 

 
 #include "udf-substr.h" 

 
 #include <string> 

 
 #include <cmath> 

 
 using namespace std; 

 
 const unsigned char kFirstBitMask = 128; // 1000000 

 
 const unsigned char kSecondBitMask = 64; // 0100000 

 
 const unsigned char kThirdBitMask = 32; // 0010000 

 
 const unsigned char kFourthBitMask = 16; // 0001000 

 
 const unsigned char kFifthBitMask = 8; // 0000100 

 
 int utf8_char_len(char firstByte) 

{

 
 std::string::difference_type offset = 1; 

 
 if(firstByte & kFirstBitMask) // This means the first byte has a value greater than 127, and so is beyond the ASCII range. 

{

 
 if(firstByte & kThirdBitMask) // This means that the first byte has a value greater than 224, and so it must be at least a three-octet code point.