使用Spark Structured Streaming将Kafka的数据写入到Iceberg数据湖中

Kafka-Iceberg使用

1. 前言

Iceberg Version : 0.12.1

Spark Version : 3.1.2

Scala Version : 2.12.10

使用Spark Structured Streaming将Kafka的数据写入到Iceberg数据湖中

2. kafka原始数据格式

名称类型
user_idLong
station_timeString
scoreInt
local_timeTimestamp

3. 步骤详解

3.1 构建spark

这里选择了HadoopCatalog,将数据都放在hdfs上

catolog名称为czl_iceberg

val sparkSession: SparkSession = SparkSession.builder().master("local[*]")
    //define catalog name
    .config("spark.sql.catalog.czl_iceberg", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.czl_iceberg.type", "hadoop")
    .config("spark.sql.catalog.czl_iceberg.warehouse", wareHousePath)
    .getOrCreate()

3.2 读取Kafka

使用spark-sql-kafka-0-10_2.12加载kafka数据

kafka.bootstrap.serverssubscribe是必须的,指定kafka的ip以及订阅的topic,其余参数可选

import sparkSession.implicits._
val df: DataFrame = sparkSession
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaConsumer)
  .option("subscribe", kafkaTopic)
  .option("startingOffsets", startingOffsets)
  .option("failOnDataLoss", failOnDataLoss)
  .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
  .load()
 val frame: Dataset[Row] = df
  .select(from_json('value.cast("string"), schema) as "value")
  .select($"value.*")

3.3 Create iceberg table

在使用Spark Structured Streaming写入iceberg时,必须先创建iceberg表,参照官方文档

The table should be created in prior to start the streaming query. Refer SQL create table on Spark page to see how to create the Iceberg table.

sparkSession.sql("create table czl_iceberg.demo.kafka_spark_iceberg (" +
  "user_id bigint, station_time string, score integer, local_time timestamp" +
  ") using iceberg")

3.4 写入iceberg

指定outputMode,trigger,path,checkPointLocation

val query: StreamingQuery = frame
  .writeStream
  .format("iceberg")
  //Iceberg supports append and complete output modes:
  //append: appends the rows of every micro-batch to the table
  //complete: replaces the table contents every micro-batch
  .outputMode("append")
  .trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS))
  .option("path", icebergPath)
  .option("checkpointLocation", checkpointPath)
  .start()
query.awaitTermination()

4. 完整代码

package com.czl.datalake.template.iceberg.kafka

import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.streaming.{StreamingQuery, Trigger}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}

import java.util.concurrent.TimeUnit

/**
 * Author: CHEN ZHI LING
 * Date: 2023/3/22
 * Description:
 */
object KafkaToIceberg {

  def main(args: Array[String]): Unit = {

    val kafkaConsumer: String = "kafka ip"
    val kafkaTopic: String = "topic"
    val startingOffsets: String = "latest"
    val failOnDataLoss: String = "true"
    val maxOffsetsPerTrigger: Int = 3000
    val wareHousePath : String = "warehouse path"
    //A table name if the table is tracked by a catalog, like catalog.database.table_name
    val icebergPath : String = "czl_iceberg.demo.kafka_spark_iceberg"
    val checkpointPath : String = "checkpoint path"

    val schema: StructType = StructType(List(
      StructField("user_id", LongType),
      StructField("station_time", StringType),
      StructField("score", IntegerType),
      StructField("local_time", TimestampType)
    ))

    val sparkSession: SparkSession = SparkSession.builder().master("local[*]")
      //define catalog name
      .config("spark.sql.catalog.czl_iceberg", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.czl_iceberg.type", "hadoop")
      .config("spark.sql.catalog.czl_iceberg.warehouse", wareHousePath)
      .getOrCreate()

    //The table should be created in prior to start the streaming query.
    //Refer SQL create table on Spark page to see how to create the Iceberg table.
    sparkSession.sql("create table czl_iceberg.demo.kafka_spark_iceberg (" +
      "user_id bigint, station_time string, score integer, local_time timestamp" +
      ") using iceberg")

    import sparkSession.implicits._
    val df: DataFrame = sparkSession
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaConsumer)
      .option("subscribe", kafkaTopic)
      .option("startingOffsets", startingOffsets)
      .option("failOnDataLoss", failOnDataLoss)
      .option("maxOffsetsPerTrigger",maxOffsetsPerTrigger)
      .load()

    val frame: Dataset[Row] = df
      .select(from_json('value.cast("string"), schema) as "value")
      .select($"value.*")

    val query: StreamingQuery = frame
      .writeStream
      .format("iceberg")
      //Iceberg supports append and complete output modes:
      //append: appends the rows of every micro-batch to the table
      //complete: replaces the table contents every micro-batch
      .outputMode("append")
      .trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS))
      .option("path", icebergPath)
      .option("checkpointLocation", checkpointPath)
      .start()
    query.awaitTermination()
  }
}

5. pom文件参考

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.1.2</version>
</dependency>
<dependency>
    <groupId>org.apache.iceberg</groupId>
    <artifactId>iceberg-spark3-runtime</artifactId>
    <version>0.12.1</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
    <version>3.1.2</version>
</dependency>

6 git地址

Chenzhiling/dataLake-template: some demos of delta,iceberg,hudi (github.com)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值