Spark Streaming通过push方式对接flume数据源

本文介绍如何使用Flume作为数据源,通过Netcat推送数据至Spark Streaming进行实时处理。详细步骤包括设置Maven依赖、编写Scala代码、打包jar、配置Flume及启动Spark作业。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

总的思路:以netcat作为flume数据源,通过push的方式将flume数据推送至spark。

idea编写flume推送数据的代码,打成jar包

step1:pom.xml依赖

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>myfirstspark</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>myfirstspark</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <spark.version>2.3.4</spark.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.15.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.0</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.19</version>
        <configuration>
          <skip>true</skip>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

step2:编写代码

package com.sparkstreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils

object SparkFlumePushDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("flumeDemo")
    val ssc = new StreamingContext(conf,Seconds(5))

    val flumeStream = FlumeUtils.createStream(ssc,"192.168.**.***",55555)

    flumeStream.map(x=> new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()


    ssc.start()
    ssc.awaitTermination()

  }
}

step3:打成jar包,上传至虚拟机(jar包上传的位置,自己定,随便放)

编写flume的配置文件:conf_flumePushSpark.properties

agent.sources=s1conf_flumePushSpark.properties
agent.channels=c1
agent.sinks=k1

agent.sources.s1.type=netcat
agent.sources.s1.bind=192.168.***.***
 //44444是flume数据源的端口号,这里就是netcat的端口号
agent.sources.s1.port=44444 
agent.sources.s1.channels=c1


agent.channels.c1.type=memory
agent.channels.c1.capacity=1000

agent.sinks.k1.type=avro
agent.sinks.k1.hostname=192.168.***.***
//55555要和jar包中的端口号一致
agent.sinks.k1.port=55555
agent.sinks.k1.channels=c1

agent.sinks.k1.channel=c1
agent.sources.s1.channels=c1

运行方式(一定要按照这个顺序)

step1:启动flume

flume-ng agent -n agent-c conf -f /opt/flumeconf/conf_flumePushSpark.properties

step2:启动spark Streaming作业(就是启动前面的jar包)

spark-submit  \
--class com.sparkstreaming.SparkFlumePushDemo  \
--packages org.apache.spark:spark-streaming-flume_2.11:2.3.4 \
/opt/spark/myfirstspark-1.0-SNAPSHOT.jar

step3:netcat连接44444的端口,并发送数据

nc -lk 44444

回车,然后输入数据,此时spark Streaming作业窗口,就会进行worldcount统计

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值