sparkstreaming kafka010

本文介绍了一个使用Spark Streaming从Kafka消费数据的实时ETL流程示例,演示了如何实时抽取、过滤、转换数据,并将其存储到HDFS中。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

SparkStreaming消费kafka数据

兴趣e族关注0人评论40人阅读2018-10-31 17:05:04

概要:本例子为SparkStreaming消费kafka消息的例子,实现的功能是将数据实时的进行抽取、过滤、转换,然后存储到HDFS中。

实例代码

package com.fwmagic.test

import com.alibaba.fastjson.{JSON, JSONException}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.slf4j.LoggerFactory

/**
  * created by fwmagic
  */
object RealtimeEtl {

  private val logger = LoggerFactory.getLogger(PVUV.getClass)

  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME", "hadoop")

    val conf = new SparkConf().setAppName("RealtimeEtl").setMaster("local[*]")

    val spark = SparkSession.builder().config(conf).getOrCreate()

    val streamContext = new StreamingContext(spark.sparkContext, Seconds(5))

    //直连方式相当于跟kafka的Topic至直接连接
    //"auto.offset.reset:earliest(每次重启重新开始消费),latest(重启时会从最新的offset开始读取)
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hd1:9092,hd2:9092,hd3:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "fwmagic",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    val topics = Array("access")

    val kafkaDStream = KafkaUtils.createDirectStream[String, String](
      streamContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
    )

    //如果使用SparkStream和Kafka直连方式整合,生成的kafkaDStream必须调用foreachRDD
    kafkaDStream.foreachRDD(kafkaRDD => {
      if (!kafkaRDD.isEmpty()) {
        //获取当前批次的RDD的偏移量
        val offsetRanges = kafkaRDD.asInstanceOf[HasOffsetRanges].offsetRanges

        //拿出kafka中的数据
        val lines = kafkaRDD.map(_.value())
        //将lines字符串转换成json对象
        val logBeanRDD = lines.map(line => {
          var logBean: LogBean = null
          try {
            logBean = JSON.parseObject(line, classOf[LogBean])
          } catch {
            case e: JSONException => {
              //logger记录
              logger.error("json解析错误!line:" + line, e)
            }
          }
          logBean
        })

        //过滤
        val filteredRDD = logBeanRDD.filter(_ != null)

        //将RDD转化成DataFrame,因为RDD中装的是case class
        import spark.implicits._

        val df = filteredRDD.toDF()

        df.show()
        //将数据写到hdfs中:hdfs://hd1:9000/360
        df.repartition(1).write.mode(SaveMode.Append).parquet(args(0))

        //提交当前批次的偏移量,偏移量最后写入kafka
        kafkaDStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }
    })

    //启动
    streamContext.start()
    streamContext.awaitTermination()
    streamContext.stop()

  }

}

case class LogBean(time:String,
                   longitude:Double,
                   latitude:Double,
                   openid:String,
                   page:String,
                   evnet_type:Int)

依赖环境(pom.xml)

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.fwmagic.360</groupId>
    <artifactId>fwmagic-360</artifactId>
    <version>1.0</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.version>2.11.7</scala.version>
        <spark.version>2.2.2</spark.version>
        <hadoop.version>2.7.7</hadoop.version>
        <encoding>UTF-8</encoding>
    </properties>

    <dependencies>
        <!-- 导入scala的依赖 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <!-- 导入spark的依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- 导入spark-sql的依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- spark streamingd的依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- 指定hadoop-client API的版本 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <!-- 指定hadoop-client API的版本 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.39</version>
        </dependency>

    </dependencies>

    <build>
        <pluginManagement>
            <plugins>
                <!-- 编译scala的插件 -->
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.2</version>
                </plugin>
                <!-- 编译java的插件 -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.5.1</version>
                </plugin>
            </plugins>
        </pluginManagement>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <!-- 打jar插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
<think>我们正在讨论SparkStreaming集成Kafka和Redis的示例。根据用户的问题,我们需要提供在SparkStreaming中从Kafka读取数据并将处理后的结果写入Redis的示例代码和最佳实践。参考引用:[^1]提到:SparkStreaming可以集成多种输入源,如Kafka、Flume等,并将转换后的数据加载到HDFS或数据库(包括Redis)中。[^2]和[^3]提供了一些Kafka的基本信息,包括Kafka的组成和基本命令,但我们的重点在于SparkStreaming的集成。步骤:1.添加依赖:在项目中添加SparkStreamingKafka和Redis的依赖。2.创建SparkStreaming上下文。3.配置Kafka参数并创建DStream。4.处理数据:对从Kafka获取的数据进行转换操作。5.将处理后的数据写入Redis。依赖(以Maven为例):```xml<dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming_2.12</artifactId><version>3.1.2</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming-kafka-0-10_2.12</artifactId><version>3.1.2</version></dependency><dependency><groupId>redis.clients</groupId><artifactId>jedis</artifactId><version>3.6.0</version></dependency>```示例代码:假设我们从一个Kafka主题读取字符串消息,进行简单的处理(例如计数),然后将结果存储到Redis。注意:在实际生产环境中,需要考虑连接管理、序列化、错误处理等。示例:单词计数并将结果写入Redis```scalaimportorg.apache.spark.SparkConfimportorg.apache.spark.streaming.{Seconds,StreamingContext}importorg.apache.spark.streaming.kafka010._importorg.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistentimportorg.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribeimportredis.clients.jedis.JedisobjectKafkaRedisWordCount{defmain(args:Array[String]):Unit={valconf=newSparkConf().setAppName("KafkaRedisWordCount")valssc=newStreamingContext(conf,Seconds(5))valkafkaParams=Map[String,Object]("bootstrap.servers"->"localhost:9092","key.deserializer"->"org.apache.kafka.common.serialization.StringDeserializer","value.deserializer"->"org.apache.kafka.common.serialization.StringDeserializer","group.id"->"spark-streaming-group","auto.offset.reset"->"latest","enable.auto.commit"->(false:java.lang.Boolean))valtopics=Array("test-topic")valstream=KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,Subscribe[String,String](topics,kafkaParams))//提取消息的值(字符串)vallines=stream.map(record=>record.value())valwords=lines.flatMap(_.split(""))valwordCounts=words.map(word=>(word,1)).reduceByKey(_+_)//将每个批次的词频统计结果写入RediswordCounts.foreachRDD{rdd=>rdd.foreachPartition{partitionOfRecords=>//每个分区创建一个Jedis连接(避免每条记录都创建连接)valjedis=newJedis("localhost",6379)partitionOfRecords.foreach{case(word,count)=>//使用Redis的哈希结构,键为"wordCount",字段为单词,值为计数(累加)jedis.hincrBy("wordCount",word,count)}jedis.close()}}ssc.start()ssc.awaitTermination()}}```最佳实践:1.连接管理:在`foreachPartition`中创建连接,避免在每条记录上创建连接的开销。2.错误处理:在实际应用中,需要添加异常处理,并考虑连接池(如JedisPool)来复用连接。3.序列化:确保数据正确序列化。4.偏移量管理:Kafka的偏移量可以手动管理,例如在Redis中存储偏移量以实现精确一次语义(exactly-once)。偏移量管理示例:我们可以将每个批次的偏移量存储在Redis中,并在启动时从Redis读取偏移量。但注意,这需要更复杂的代码来处理偏移量的提交和恢复。由于用户要求的是示例,以上代码提供了一个基本框架。注意:上述代码中,我们没有管理Kafka的偏移量,这意味着如果应用重启,它将从最新的偏移量开始(根据auto.offset.reset配置)。如果需要精确控制偏移量,可以参考以下步骤:-在Redis中存储每个分区的偏移量(例如,使用键如"topic:partition"存储偏移量值)。-在创建DStream时,使用`ConsumerStrategies.Assign`指定起始偏移量(从Redis读取)。-在处理完一个批次后,将处理完成的分区偏移量写入Redis。但是,这超出了简单示例的范围,用户可以根据需要进一步研究。以上代码和说明提供了SparkStreaming集成Kafka和Redis的基本方法。</think>### Spark Streaming 集成 Kafka 和 Redis 示例 在 Spark Streaming 中集成 Kafka 和 Redis 涉及三个主要步骤:**从 Kafka 消费数据 → Spark Streaming 处理 → 结果写入 Redis**。以下是完整示例代码和最佳实践: #### 1. Maven 依赖配置 ```xml <dependencies> <!-- Spark Streaming --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.12</artifactId> <version>3.1.2</version> </dependency> <!-- Kafka 集成 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.12</artifactId> <version>3.1.2</version> </dependency> <!-- Redis 客户端 --> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>3.7.0</version> </dependency> </dependencies> ``` #### 2. Scala 核心代码示例 ```scala import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe import redis.clients.jedis.Jedis object KafkaRedisIntegration { def main(args: Array[String]): Unit = { // 1. 创建 Spark Streaming 上下文 val conf = new SparkConf().setAppName("KafkaRedisDemo").setMaster("local[*]") val ssc = new StreamingContext(conf, Seconds(5)) // 2. Kafka 配置参数 val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "kafka-broker1:9092,kafka-broker2:9092", // Kafka集群地址[^2] "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "group.id" -> "spark-redis-group", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) // 3. 订阅 Kafka 主题 val topics = Array("sensor-data") // Kafka主题名称[^3] val kafkaStream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) // 4. 数据处理(示例:解析JSON并过滤) val processedData = kafkaStream.map(record => { val json = parseJson(record.value()) // 自定义JSON解析函数 (json("deviceId"), json("temperature").toDouble) }).filter { case (_, temp) => temp > 50.0 } // 过滤高温数据 // 5. 写入 Redis processedData.foreachRDD { rdd => rdd.foreachPartition { partition => // 每个分区建立单次Redis连接(高性能实践) val jedis = new Jedis("redis-server", 6379) partition.foreach { case (deviceId, temp) => // 使用哈希表存储设备温度:HSET device:temperatures deviceId temperature jedis.hset("device:temperatures", deviceId, temp.toString) } jedis.close() } } // 6. 启动流处理 ssc.start() ssc.awaitTermination() } // 简易JSON解析函数(实际项目建议使用Jackson) def parseJson(jsonString: String): Map[String, String] = { jsonString.stripPrefix("{").stripSuffix("}") .split(",") .map(_.split(":")) .map(arr => (arr(0).trim.replace("\"", ""), arr(1).trim.replace("\"", ""))) .toMap } } ``` #### 3. 最佳实践说明 1. **连接管理优化**: - 使用 `foreachPartition` 而非 `foreach`,每个分区共享一个 Redis 连接 - 考虑使用 JedisPool 实现连接池(高并发场景必需) 2. **数据一致性保障**: ```scala // 精确一次语义实现伪代码 kafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd.foreachPartition { iter => val jedis = jedisPool.getResource try { iter.foreach { data => jedis.hset(...) } // 成功处理后提交offset offsetRanges.foreach { range => storeOffsetInRedis(jedis, range) // 自定义存储offset到Redis } } finally { jedis.close() } } } ``` 3. **Redis 数据结构选择**: - 实时指标:`INCRBY`/`HINCRBY` 实现计数器 - 时间序列数据:`ZSET` 存储带时间戳的数据 - 去重操作:`SET` 存储唯一标识符 4. **性能调优**: - 启用 Redis 管道操作:`jedis.pipelined()` - 调整 Spark 批次间隔:根据数据量选择 1-10 秒 - 使用 Kryo 序列化优化 Spark #### 4. 部署注意事项 1. **Kafka 配置**: - 确保分区数量 ≥ Spark 的 executor 数量 - 启用 Kafka 认证(生产环境必需): ```scala kafkaParams += ("security.protocol" -> "SASL_SSL") kafkaParams += ("sasl.mechanism" -> "PLAIN") ``` 2. **Redis 高可用**: - 使用 Redis Sentinel 或 Cluster 模式 - 配置连接失败重试机制 > **关键提示**:在 Spark Structured Streaming 中,可使用 `spark-redis` 包实现更简洁的集成: > ```scala > df.write.format("org.apache.spark.sql.redis") > .option("table", "device:temperatures") > .option("key.column", "deviceId") > .save() > ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值