CDH flume ETL with morpnline conf and write into solr

最新推荐文章于 2024-07-29 21:51:36 发布

mtj66

最新推荐文章于 2024-07-29 21:51:36 发布

阅读量399

点赞数

分类专栏： CDH 文章标签： flume

本文链接：https://blog.youkuaiyun.com/mtj66/article/details/79231776

版权

CDH 专栏收录该内容

18 篇文章

订阅专栏

创建Collection

1.生产实体配置文件：

solrctl instancedir –generate $HOME/collection2

生成配置文件后会在collection2 /conf这个目录下产生很多配置文件，我们可以根据自己的需要修改schema.xml文件，具体schema.xml的修改规则可以参看：http://wiki.apache.org/solr/SchemaXml

2.创建 collection1 实例并将配置文件上传到 zookeeper：

solrctl instancedir –create collection2 $HOME/collection2
如果已经存在采用 –update进行更新

solrctl instancedir –update collection2 $HOME/collection2

可以通过下面命令查看上传的实体：

solrctl instancedir –list

3.上传到 zookeeper 之后，其他节点就可以从上面下载配置文件。接下来创建 collection:

solrctl collection –create collection2 -s 2 -r 1

其中-s表示设置Shard数为2，-r表示设置的replica数为1。

安装以上步骤solr的实例就算创建完毕，可以通过 http://sit-hadoop1:8983/solr/#/~cloud 查看创建的Collection。

4.修改Collection

当我们创建Collection完成后，如果需要修改schema.xml文件重新配置需要索引的字段可以按如下操作：

1.如果是修改原有schema.xml中字段值，而在solr中已经插入了索引数据，那么我们需要清空索引数据集，清空数据集可以通过solr API来完成。

2.如果是在原有schema.xml中加入新的索引字段，那么可以跳过1，直接执行：

solrctl instancedir –update solrtest $HOME/collection2

solrctl collection –reload collection2

5.清空索引

参考 https://www.cnblogs.com/HD/p/3981716.html

选中shard 后选择Documents

RequestHandler 默认是
/update

Document Type 选择 xml

Document(s) 输入

<delete><id>1</id></delete>
<commit/>

执行 Submit Document

shard 卸载重载

solrctl –solr http://xhadoop3:8983/solr core –unload solrtest_shard2_replica1
solrctl –solr http://sit-hadoop3:8983/solr core –reload hotel_shard1_replica1

========================flume 配置文件开始 =====================

# Please paste flume.conf here. Example:
# Sources, channels, and sinks are defined per agent name, in this case 'kafka2solr'.
# 配置 source  channel sink 的名字
kafka2solr.sources = source_from_kafka
kafka2solr.channels = mem_channel
kafka2solr.sinks = solrSink
#kafka2solr.sinks    = sink1 

# 配置Source类别为kafka
kafka2solr.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource
kafka2solr.sources.source_from_kafka.channels = mem_channel
kafka2solr.sources.source_from_kafka.batchSize = 100
kafka2solr.sources.source_from_kafka.kafka.bootstrap.servers= 172.16.6.11:9092,172.16.6.12:9092,172.16.6.13:9092
kafka2solr.sources.source_from_kafka.kafka.topics = logaiStatus
kafka2solr.sources.source_from_kafka.kafka.consumer.group.id = flume_solr_caller
kafka2solr.sources.source_from_kafka.kafka.consumer.auto.offset.reset=latest

#配置channel type为memory,通常生产环境中设置为file或者直接用kafka作为channel
kafka2solr.channels.mem_channel.type = memory
kafka2solr.channels.mem_channel.keep-alive = 60


# Other config values specific to each type of channel(sink or source)  
# can be defined as well  
# In this case, it specifies the capacity of the memory channel  
kafka2solr.channels.mem_channel.capacity = 1000
kafka2solr.channels.mem_channel.transactionCapacity = 1000  

#kafka2solr.sinks.sink1.type         = logger  
#kafka2solr.sinks.sink1.channel      = mem_channel
# 配置sink到solr,并使用morphline转换数据
kafka2solr.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
kafka2solr.sinks.solrSink.channel = mem_channel
#kafka2solr.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
kafka2solr.sinks.solrSink.morphlineFile = morphlines.conf
kafka2solr.sinks.solrSink.morphlineId=morphline1
kafka2solr.sinks.solrSink.isIgnoringRecoverableExceptions=true

=============================== flume 配置文件介绍==================

############ morphline conf 开始 #################

下面进行了正则抽取log 的各个字段
# 一些常用的 patterns https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns

SOLR_COLLECTION : "collection2"
SOLR_COLLECTION : ${?ENV_S
OLR_COLLECTION}

SOLR_LOCATOR : {
  # Name of solr collection
  collection : ${SOLR_COLLECTION}

  # ZooKeeper ensemble 
  #CDH的专有写法,开源版本不支持。
  zkHost : "$ZK_HOST"
 }

morphlines : [
  {
    # Name used to identify a morphline. E.g. used if there are multiple
    # morphlines in a morphline config file
    id : morphline1

    # Import all morphline commands in these java packages and their
    # subpackages. Other commands that may be present on the classpath are 
    # not visible to this morphline.
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [

      {
        # Parse input attachment and emit a record for each input line                
        readLine {
          charset : UTF-8
        }
      }

      {
      # 生成UUID作为 id
        generateUUID {
          type : nonSecure
          field : id 
          preserveExisting : false
        }
      }
      #进行相关的调试 可以调低日志等级 或者采用 logWarn 替换 logInfo
      { logInfo { format : "output record with id: {}", args : ["@{}"] } }
      {
        grok {
          # Consume the output record of the previous command and pipe another record downstream.
          #
          # A grok-dictionary is a config file that contains prefabricated
          # regular expressions that can be referred to by name. grok patterns
          # specify such a regex name, plus an optional output field name.
          # The syntax is %{REGEX_NAME:OUTPUT_FIELD_NAME}
          # The input line is expected in the "message" input field.
          #2018-01-26 21:24:16,171 [INFO ] [sparkDriverActorSystem-akka.actor.default-dispatcher-3] - [akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3.apply$mcV$sp(Slf4jLogger.scala:74)] Remote daemon shut down; proceeding with flushing remote transports.
          #dictionaryFiles : [src/test/resources/grok-dictionaries]
           expressions : {
            # message : """%{SYSLOGTIMESTAMP:timestamp} [%{LOGLEVEL:level}] [%{THREAD:thread}] - (?:\[%{POSINT:pid}\])?: %{GREEDYDATA:msg}"""
            message : """(?<time>\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}:\d{2}\,\d{3}) \[(?<level>.+)\] \[(?<thread>.+)\] - \[(?<app_info>.+)\] (?<logcontent>.*)"""
          }
          #'(?<time>\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}:\d{2}\,\d{3}) (?<level>\[[INFOWARE ]{4,5}\]) (?<app_info>\[[A-Za-z\.\-0-9]+\]) (?<logcontent>.*)'
        }
      }
      { logWarn { format : "output record with grok: {}", args : ["@{}"] } }

      # Consume the output record of the previous command, transform it and pipe the record downstream.
      #
      # This command deletes record fields that are unknown to Solr
      # schema.xml. Recall that Solr throws an exception on any attempt to
      # load a document that contains a field that isn't specified in schema.xml.
      {
       sanitizeUnknownSolrFields {
          # Location from which to fetch Solr schema
          solrLocator : ${SOLR_LOCATOR}
       }
      }

      # log the record at INFO level to SLF4J
      { logInfo { format : "output record before solr: {}", args : ["@{}"] } }

      # load the record into a Solr server or MapReduce Reducer
      {
        loadSolr {solrLocator : ${SOLR_LOCATOR}}
      }
    ]
  }
]

############ morphline conf 结束 #################

复杂一点加上IF 判断根据topic进行索引的选择进行分流也是不错的选择

if {
conditions : [
{
not {
grok {
… some grok expressions go here
}
}
}
]
then : [
{ logDebug { format : “found no grok match: {}”, args : [“@{}”] } }
{ dropRecord {} }
]
else : [
{ logDebug { format : “found grok match: {}”, args : [“@{}”] } }
]
}