Spark Streaming通过push方式对接flume数据源_flume的push对接ss-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_42289266/article/details/108131342

本文介绍如何使用Flume作为数据源，通过Netcat推送数据至Spark Streaming进行实时处理。详细步骤包括设置Maven依赖、编写Scala代码、打包jar、配置Flume及启动Spark作业。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

总的思路：以netcat作为flume数据源，通过push的方式将flume数据推送至spark。

idea编写flume推送数据的代码，打成jar包

step1：pom.xml依赖

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>myfirstspark</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>myfirstspark</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <spark.version>2.3.4</spark.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.15.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.0</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.19</version>
        <configuration>
          <skip>true</skip>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

step2：编写代码

package com.sparkstreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils

object SparkFlumePushDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("flumeDemo")
    val ssc = new StreamingContext(conf,Seconds(5))

    val flumeStream = FlumeUtils.createStream(ssc,"192.168.**.***",55555)

    flumeStream.map(x=> new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()


    ssc.start()
    ssc.awaitTermination()

  }
}

step3：打成jar包，上传至虚拟机（jar包上传的位置，自己定，随便放）

编写flume的配置文件：conf_flumePushSpark.properties

agent.sources=s1conf_flumePushSpark.properties
agent.channels=c1
agent.sinks=k1

agent.sources.s1.type=netcat
agent.sources.s1.bind=192.168.***.***
 //44444是flume数据源的端口号，这里就是netcat的端口号
agent.sources.s1.port=44444 
agent.sources.s1.channels=c1


agent.channels.c1.type=memory
agent.channels.c1.capacity=1000

agent.sinks.k1.type=avro
agent.sinks.k1.hostname=192.168.***.***
//55555要和jar包中的端口号一致
agent.sinks.k1.port=55555
agent.sinks.k1.channels=c1

agent.sinks.k1.channel=c1
agent.sources.s1.channels=c1

运行方式（一定要按照这个顺序）

step1：启动flume

flume-ng agent -n agent-c conf -f /opt/flumeconf/conf_flumePushSpark.properties

step2：启动spark Streaming作业（就是启动前面的jar包）

spark-submit  \
--class com.sparkstreaming.SparkFlumePushDemo  \
--packages org.apache.spark:spark-streaming-flume_2.11:2.3.4 \
/opt/spark/myfirstspark-1.0-SNAPSHOT.jar

step3：netcat连接44444的端口，并发送数据

nc -lk 44444

回车，然后输入数据，此时spark Streaming作业窗口，就会进行worldcount统计