hive数据导入ElasticSearch坑记录

最新推荐文章于 2023-06-15 11:18:37 发布

原创最新推荐文章于 2023-06-15 11:18:37 发布 · 4.3k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#ES

ES 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了如何解决Elasticsearch中拼音分词和IK分词出现的非法参数异常问题，包括修改拼音分词源码并重新编译打包，以及清空IK分词词库来解决问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

环境：

CDH5.16.2（hive1.1.0）、ES6.7.2

1、关于拼音分词

org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=10,lastStartOffset=9 for field 'enterprise_name.pinyin' （这里报的是拼音分词）

在网上搜了很多，还是有很靠谱的文章：

https://github.com/medcl/elasticsearch-analysis-pinyin/pull/206/commits/7cbc3d8926c8549b1049b90e90fce415097990be

根据里面的修改了拼音分词的源码，重新使用maven编译打包，将elasticsearch-analysis-pinyin-6.3.0.jar改为elasticsearch-analysis-pinyin-6.7.2.jar，然后将拼音分词的zip包打开，将这个新打包的jar替换进去，重新在线上把旧的拼音分词remove掉再install新的zip，重启，ok。

2、关于IK分词

org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=7,lastStartOffset=7 for field 'enterprise_name'（这里报的是这个字段，这个字段我用的ik_max_word分词）

将IK分词词库 extra_new_word.dic 里的词先全部清空（移到其他地方），然后正常导入，导入数据后再把分词词库移回去就ok了。

在hive中数据导入ES，需要一个包：add jar /root/work/elasticsearch-hadoop-6.7.2.jar;

mapping片段：

PUT /enterprise_credit_index
{
"settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index": {
      "analysis": {
        "analyzer": {
          "pinyin_analyzer": {
            "tokenizer": "my_pinyin"
          }
        },
        "tokenizer": {
          "my_pinyin": {
            "type" : "pinyin",
            "keep_first_letter":true,
            "keep_separate_first_letter" : true,
            "keep_full_pinyin" : true,
            "keep_original" : true,
            "limit_first_letter_length" : 20,
            "lowercase" : true
          }
        }
      }
    }
},
"mappings": {
    "enterprise_credit_type": {
      "properties": {
        "enterprise_name": {
          "type": "text",
          "index": true,
          "analyzer": "ik_max_word",
          "fields": {
            "pinyin": {
              "type": "text",
              "store": false,
              "term_vector": "with_offsets",
              "analyzer": "pinyin_analyzer",
              "boost": 10
            }
          }
        },
        "operators": {
          "type": "keyword",
          "index": true,
          "fields": {
            "pinyin": {
              "type": "text",
              "store": false,
              "analyzer": "pinyin_analyzer",
              "boost": 10
            }
          }
        },
        "registered_money":{
          "type": "keyword",
          "ignore_above": 256
        },

。。。。。。

在hive中建表：

create external table APP_json_result_external(
enterprise_name string ,
operators string ,

。。。。。

)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = '/enterprise_credit_index/enterprise_credit_type',
'es.mapping.id' = 'unified_social_credit_code',
'es.nodes'='10.10.10.10:9200,10.10.10.10:9201,10.10.10.10:9202',
'es.nodes.wan.only'='true',
'es.index.auto.create' = 'false',
'es.write.operation' = 'upsert',
'es.batch.write.refresh'='true',
'es.index.read.missing.as.empty'='false');

将数据导入：

insert OVERWRITE table APP_json_result_external XXX,XXX,XXX from tableName;

this all, have fun !