创建Collection
1.生产实体配置文件:
solrctl instancedir –generate $HOME/collection2
生成配置文件后会在collection2 /conf这个目录下产生很多配置文件,我们可以根据自己的需要修改schema.xml文件,具体schema.xml的修改规则可以参看:http://wiki.apache.org/solr/SchemaXml
2.创建 collection1 实例并将配置文件上传到 zookeeper:
solrctl instancedir –create collection2 $HOME/collection2
如果已经存在 采用 –update进行更新
solrctl instancedir –update collection2 $HOME/collection2
可以通过下面命令查看上传的实体:
solrctl instancedir –list
3.上传到 zookeeper 之后,其他节点就可以从上面下载配置文件。接下来创建 collection:
solrctl collection –create collection2 -s 2 -r 1
其中-s表示设置Shard数为2,-r表示设置的replica数为1。
安装以上步骤solr的实例就算创建完毕 ,可以通过 http://sit-hadoop1:8983/solr/#/~cloud 查看创建的Collection。
4.修改Collection
当我们创建Collection完成后,如果需要修改schema.xml文件重新配置需要索引的字段可以按如下操作:
1.如果是修改原有schema.xml中字段值,而在solr中已经插入了索引数据,那么我们需要清空索引数据集,清空数据集可以通过solr API来完成。
2.如果是在原有schema.xml中加入新的索引字段,那么可以跳过1,直接执行:
solrctl instancedir –update solrtest $HOME/collection2
solrctl collection –reload collection2
5.清空索引
参考 https://www.cnblogs.com/HD/p/3981716.html
选中shard 后选择Documents
RequestHandler 默认是
/update
Document Type 选择 xml
Document(s) 输入
<delete><id>1</id></delete>
<commit/>
执行 Submit Document
shard 卸载重载
solrctl –solr http://xhadoop3:8983/solr core –unload solrtest_shard2_replica1
solrctl –solr http://sit-hadoop3:8983/solr core –reload hotel_shard1_replica1
========================flume 配置文件 开始 =====================
# Please paste flume.conf here. Example:
# Sources, channels, and sinks are defined per agent name, in this case 'kafka2solr'.
# 配置 source channel sink 的名字
kafka2solr.sources = source_from_kafka
kafka2solr.channels = mem_channel
kafka2solr.sinks = solrSink
#kafka2solr.sinks = sink1
# 配置Source类别为kafka
kafka2solr.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource
kafka2solr.sources.source_from_kafka.channels = mem_channel
kafka2solr.sources.source_from_kafka.batchSize = 100
kafka2solr.sources.source_from_kafka.kafka.bootstrap.servers= 172.16.6.11:9092,172.16.6.12:9092,172.16.6.13:9092
kafka2solr.sources.source_from_kafka.kafka.topics = logaiStatus
kafka2solr.sources.source_from_kafka.kafka.consumer.group.id = flume_solr_caller
kafka2solr.sources.source_from_kafka.kafka.consumer.auto.offset.reset=latest
#配置channel type为memory,通常生产环境中设置为file或者直接用kafka作为channel
kafka2solr.channels.mem_channel.type = memory
kafka2solr.channels.mem_channel.keep-alive = 60
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
kafka2solr.channels.mem_channel.capacity = 1000
kafka2solr.channels.mem_channel.transactionCapacity = 1000
#kafka2solr.sinks.sink1.type = logger
#kafka2solr.sinks.sink1.channel = mem_channel
# 配置sink到solr,并使用morphline转换数据
kafka2solr.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
kafka2solr.sinks.solrSink.channel = mem_channel
#kafka2solr.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
kafka2solr.sinks.solrSink.morphlineFile = morphlines.conf
kafka2solr.sinks.solrSink.morphlineId=morphline1
kafka2solr.sinks.solrSink.isIgnoringRecoverableExceptions=true
=============================== flume 配置文件 介绍==================
############ morphline conf 开始 #################
下面进行了正则抽取log 的各个字段
# 一些常用的 patterns https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns
SOLR_COLLECTION : "collection2"
SOLR_COLLECTION : ${?ENV_S
OLR_COLLECTION}
SOLR_LOCATOR : {
# Name of solr collection
collection : ${SOLR_COLLECTION}
# ZooKeeper ensemble
#CDH的专有写法,开源版本不支持。
zkHost : "$ZK_HOST"
}
morphlines : [
{
# Name used to identify a morphline. E.g. used if there are multiple
# morphlines in a morphline config file
id : morphline1
# Import all morphline commands in these java packages and their
# subpackages. Other commands that may be present on the classpath are
# not visible to this morphline.
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
# Parse input attachment and emit a record for each input line
readLine {
charset : UTF-8
}
}
{
# 生成UUID作为 id
generateUUID {
type : nonSecure
field : id
preserveExisting : false
}
}
#进行相关的调试 可以调低日志等级 或者采用 logWarn 替换 logInfo
{ logInfo { format : "output record with id: {}", args : ["@{}"] } }
{
grok {
# Consume the output record of the previous command and pipe another record downstream.
#
# A grok-dictionary is a config file that contains prefabricated
# regular expressions that can be referred to by name. grok patterns
# specify such a regex name, plus an optional output field name.
# The syntax is %{REGEX_NAME:OUTPUT_FIELD_NAME}
# The input line is expected in the "message" input field.
#2018-01-26 21:24:16,171 [INFO ] [sparkDriverActorSystem-akka.actor.default-dispatcher-3] - [akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3.apply$mcV$sp(Slf4jLogger.scala:74)] Remote daemon shut down; proceeding with flushing remote transports.
#dictionaryFiles : [src/test/resources/grok-dictionaries]
expressions : {
# message : """%{SYSLOGTIMESTAMP:timestamp} [%{LOGLEVEL:level}] [%{THREAD:thread}] - (?:\[%{POSINT:pid}\])?: %{GREEDYDATA:msg}"""
message : """(?<time>\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}:\d{2}\,\d{3}) \[(?<level>.+)\] \[(?<thread>.+)\] - \[(?<app_info>.+)\] (?<logcontent>.*)"""
}
#'(?<time>\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}:\d{2}\,\d{3}) (?<level>\[[INFOWARE ]{4,5}\]) (?<app_info>\[[A-Za-z\.\-0-9]+\]) (?<logcontent>.*)'
}
}
{ logWarn { format : "output record with grok: {}", args : ["@{}"] } }
# Consume the output record of the previous command, transform it and pipe the record downstream.
#
# This command deletes record fields that are unknown to Solr
# schema.xml. Recall that Solr throws an exception on any attempt to
# load a document that contains a field that isn't specified in schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
}
}
# log the record at INFO level to SLF4J
{ logInfo { format : "output record before solr: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {solrLocator : ${SOLR_LOCATOR}}
}
]
}
]
############ morphline conf 结束 #################
复杂一点加上IF 判断 根据topic进行索引的 选择进行分流也是不错的选择
if {
conditions : [
{
not {
grok {
… some grok expressions go here
}
}
}
]
then : [
{ logDebug { format : “found no grok match: {}”, args : [“@{}”] } }
{ dropRecord {} }
]
else : [
{ logDebug { format : “found grok match: {}”, args : [“@{}”] } }
]
}
CM 界面
参考 : http://blog.cloudera.com/blog/2014/04/how-to-process-data-using-morphlines-in-kite-sdk/