flume笔记

Flume使用Event对象来作为传递数据的格式,是内部数据传输的最基本单元。
flume内部有一个或者多个agent,每一个agent都是一个独立的守护进程(JVM)。一个agent可以包含三部分,分别为source、channel、sink,一个source能够对接多个channel,但是一个channel只能对接一个sink,其中channel是一个短暂的存储容器。
可以通过参数设置event的最大个数
Flume通常选择FileChannel,而不使用Memory Channel
Memory Channel:内存存储事务,吞吐率极高,但存在丢数据风险
– File Channel:本地磁盘的事务实现模式,保证数据不会丢失(WAL实现)
WAL:write ahead logging 预写日志
先在日志文件中写入执行的操作过程,然后再写数据,当数据写失败之后,再执行一遍操作,保证数据的不丢失。
• Sink会将事件从Channel中移除,并将事件放置到外部数据介质上
– 例如:通过Flume HDFS Sink将数据放置到HDFS中,或者放置到下一个Flume的Source,等到下一个flume处理。Sink成功取出Event后,将Event从Channel中移除。Sink必须作用于一个确切的Channel。
不同类型的Sink:
1、 存储Event到最终目的的终端:HDFS、Hbase
2、自动消耗:Null Sink
3、用于Agent之间通信:Avro
flume的开发主要是编写conf文件
最简单的一个conf文件:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source,指定source为监听端口
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind = localhost
a1.sources.r1.port = 9000
#a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory		#指定channel的临时存储位置为内存
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#带有正则的过滤器,拦截器的定义是在source端的
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#指定拦截器
a1.sources.r1.interceptors = i1  
a1.sources.r1.interceptors.i1.type =regex_filter  
#使用拦截器进行过滤
a1.sources.r1.interceptors.i1.regex =^[0-9]*$  
a1.sources.r1.interceptors.i1.excludeEvents =true

# Describe the sink
#a1.sinks.k1.type = logger
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs:/flume/events
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream		#以明文的形式写入到hdfs中
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#两个flume串联起来的情况
#第一个flume的代码:
#Name the components on this agent
a2.sources= r1
a2.sinks= k1
a2.channels= c1

#Describe/configure the source
a2.sources.r1.type= netcat
a2.sources.r1.bind= localhost		#绑定的为本机
a2.sources.r1.port = 44444
a2.sources.r1.channels= c1

#Use a channel which buffers events in memory
a2.channels.c1.type= memory
a2.channels.c1.keep-alive= 10
a2.channels.c1.capacity= 100000
a2.channels.c1.transactionCapacity= 100000

#Describe/configure the source
a2.sinks.k1.type= avro			#指定为avro类型
a2.sinks.k1.channel= c1
a2.sinks.k1.hostname= slave3		#把数据传到的位置为slave3机器
a2.sinks.k1.port= 44444				#指定该机器的端口号

#第二个flume的conf文件
#Name the components on this agent
a1.sources= r1
a1.sinks= k1
a1.channels= c1

#Describe/configure the source
a1.sources.r1.type= avro
a1.sources.r1.channels= c1
a1.sources.r1.bind= slave3			#监控slave3机器
a1.sources.r1.port= 44444

#Describe the sink
a1.sinks.k1.type= logger		#以logger日志的形式,显示在终端
a1.sinks.k1.channel = c1

#Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.keep-alive= 10
a1.channels.c1.capacity= 100000
a1.channels.c1.transactionCapacity= 100000
#两个的执行步骤先启动第二个flume的conf文件

启动flume的命令:

bin/flume-ng agent -c conf -f conf/push.conf -n a1 -Dflume.root.logger=INFO,console

当两个都启动完成之后,查看先启动的那个flume的日志文件,看一下是否有提示连接成功的字样。

#用Kafka接收flume的sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/Documents/code/python/flume_exec_test.txt

# 设置kafka接收器 
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# 设置kafka的broker地址和端口号
a1.sinks.k1.brokerList=master:9092
# 设置Kafka的topic
a1.sinks.k1.topic=topicName
# 设置序列化的方式
a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder

# use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
#zookeeper后面加一个host:port和加多个hosts是一样的,但是如果加的这个节点的zookeeper进程挂掉,就会影响Kafka访问zookeeper
Kafka消费flume的步骤
#启动Kafka需要先启动zookeeper,先进入zookeeper的目录下,然后执行下面操作:
./bin/zkServer.sh start     #三个节点分别启动
#然后通过下面命令启动zookeeper客户端
./bin/zkCli.sh -server master:2181,slave1:2181,slave2:2181
#启动Kafka,进入到Kafka目录下,执行:
./bin/kafka-server-start.sh config/server.properties
#查看是否存在相应的topic,如果不存在则新建topic,如果存在则继续往下执行
#使用consumer消费数据
./bin/kafka-console-consumer.sh --zookeeper master:2181,slave1:2181,slave2:2181 --topic test --from-beginning
#启动flume,启动命令如下:
bin/flume-ng agent -c conf -f conf/push.conf -n a1 -Dflume.root.logger=INFO,console

#对于之后得到的数据中,会显示在启动消费者的终端,而启动Kafka的终端不会出来数据,但是会有一些执行日志
#zookeeper的一些相关操作:
#创建topic:
bin/kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 2 --topic test
#查看topic
bin/kafka-topic.sh --list --zookeeper master:2181
#flume与hive的连接
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/Documents/code/python/flume_exec_test.txt
#a1.sources.r1.type=netcat
#a1.sources.r1.bind=master
#a1.sources.r1.port=44444

# hive sink
a1.sinks.k1.type=hive
a1.sinks.k1.hive.metastore=thrift://master:9083
a1.sinks.k1.hive.database=test
a1.sinks.k1.hive.table=flume_test
#a1.sinks.k1.hive.partition = eval_set
#a1.sinks.sink1.hive.txnsPerBatchAsk = 2
#a1.sinks.k1.useLocalTimeStamp = false
#a1.sinks.k1.round = true
#a1.sinks.k1.roundValue = 10
#a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter=","
a1.sinks.k1.serializer.serdeSeparator=','
a1.sinks.k1.serializer.fieldnames = order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

# use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值