绝对原创。
1、flume
主要用于日志采集
核心配置文件:
agent002.sources = sources002
agent002.channels = channels002
agent002.sinks = sinks002
## define sources
agent002.sources.sources002.type = exec
agent002.sources.sources002.command = tail -F /log.input
## define channels
agent002.channels.channels002.type = memory
agent002.channels.channels002.capacity = 1000
agent002.channels.channels002.transactionCapacity = 1000
agent002.channels.channels002.byteCapacityBufferPercentage = 20
agent002.channels.channels002.byteCapacity = 8000
##define sinks
agent002.sinks.hostname=8.8.8.2
agent002.sinks.sinks002.type =org.apache.flume.sink.kafka.KafkaSink
agent002.sinks.sinks002.brokerList=8.8.8.2:9093
agent002.sinks.sinks002.topic=topicTest
##relationship
agent002.sources.sources002.channels = channels002
agent002.sinks.sinks002.channel = channels002
启动命令: /home/flume/bin/flume-ng agent -n agent002 -c /home/flume/conf -f /home/flume/conf/flume-kafka001.properties -Dflume.root.logger=DEBUG,console
2、kafka
1、 启动kafka bin/kafka-server-start.sh config/server.properties
2、 创建topic bin/kafka-topics.sh --create --zookeeper 8.8.8.2:2181 --replication-factor 1 --partitions 1 --topic topicTest
3、接受信息(消费者) bin/kafka-console-consumer.sh --zookeeper 8.8.8.2:2181 --topic topicTest --from-beginning
测试 flume 和 kafaka ,:
echo "ddddddddddddddddddddd" >>/log.input
这样会在 kafka的消费者那个客户端接收到信息 “ddddddddddddddd”。。
3、spark scala wordcount 编码 和编译
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
object MyStreaming {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(30))
val kafkaParams = Map("metadata.broker.list" -> "8.8.8.2:9093")
// topics: Set[String]
val topics = Set("topicTest")
val kafkaStream = org.apache.spark.streaming.kafka.KafkaUtils.
createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics).map(_._2)
val wordCountDStream = kafkaStream.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
wordCountDStream.print()
ssc.start()
ssc.awaitTermination()
}
}
这是scala语言编写的, 因为用到了kafak的jar包,需要用到sbt(类似java中的maven)打包。
配置使用插件
下面是我的scala项目目录结构:
.
├── assembly.sbt
├── build.sbt
├── project
├── README.md
├── run-assembly.sh
├── run.sh
├── src
└── target
我的是sbt 0.13.8
,所以在project/assembly.sbt添加(assembly.sbt)要自己创建:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")
配置assembly的参数
参数在项目根目录下新建assembly.sbt
。
assembly.sbt 的内容如下:
name := "LogStash_2"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.3.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % "1.3.0" % "provided"
)
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("org", "apache", xs @ _*) => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith "axiom.xml" => MergeStrategy.filterDistinctLines
case PathList(ps @ _*) if ps.last endsWith "Log$Logger.class" => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith "ILoggerFactory.class" => MergeStrategy.first
case x => old(x)
}
}
resolvers += "OSChina Maven Repository" at "http://maven.oschina.net/content/groups/public/"
externalResolvers := Resolver.withDefaultResolvers(resolvers.value, mavenCentral = false)
然后在项目根目录直接敲命令 :
sbt assembly
进行编译,但是由于我的sbt下载了大量的依赖包,在编译的时候遇到了编译包冲突的问题。这个时候需要配置Merge
Strategy
(合并策略)。
编译后的目录格式如下图。
编译好后,在 SparkStreaming\target\scala-2.10\目录下有对应的jar包。 然后上传到spark集群机器上。
4、运行程序
运行程序
./bin/spark-submit --name NetworkWordCount --class "MyStreaming" /LogStash_2-assembly-1.0.jar
localhost 9999
然后进行数据输出,echo "we lvhongp ldds ldds" >>/log.input , 然后再观察spark客户端的输出 如下图,说明整个的Flume+kafka+spark
streaming+scala例子演示 成功。 OK !!!!!!!!!
网友们还有啥问题。尽管说。。。。我也是刚入门哈哈。。。。