基于Spark3.3.0的doris-spark-connector实战

本文详细介绍了如何基于Spark3.3.0和Doris构建doris-spark-connector,包括源码编译、Spark环境配置,以及使用RDD和DataFrame读写Doris数据的实战步骤,展示了Spark与Doris的无缝对接。

基于Spark3.3.0的doris-spark-connector实战


想要更全面了解Spark内核和应用实战,可以购买我的新书。

《图解Spark 大数据快速分析实战》(王磊)


1. doris-spark-connector编译

  1. 从github下载源码
     git clone https://github.com/apache/doris-spark-connector.git
    
  2. 执行vim custom_env.sh ,修改custom_env.sh文件,添加如下内容:
export THRIFT_BIN=/usr/local/bin/thrift
export MVN_BIN=/opt/software/apache-maven-3.8.5/bin/mvn
export JAVA_HOME=/opt/software/jdk8
  1. 基于spark3.3.0编译
    cd spark-doris-connector/
    sh build.sh --spark 3.3.0 --scala 2.12
    
  2. 编译结果
    Thrift version 0.13.0
    Apache Maven 3.8.5 (3599d3414f046de2324203b78ddcf9b5e4388aa0)
    Maven home: /opt/software/apache-maven-3.8.5
    Java version: 1.8.0_131, vendor: Oracle Corporation, runtime: /opt/software/jdk8/jre
    Default locale: en_US, platform encoding: ANSI_X3.4-1968
    OS name: "linux", version: "4.18.0-358.el8.x86_64", arch: "amd64", family: "unix"
    SPARK_MINOR_VERSION: 3.3
    [INFO] Scanning for projects...
    
    [INFO] Replacing /opt/software/doris-spark-connector/spark-doris-connector/target/spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT.jar with /opt/software/doris-spark-connector/spark-doris-connector/target/spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT-shaded.jar
    
    [INFO] 
    [INFO] --- maven-javadoc-plugin:3.3.0:jar (attach-javadocs) @ spark-doris-connector-3.3_2.12 ---
    [INFO] Skipping javadoc generation
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time:  01:02 min
    [INFO] Finished at: 2022-07-14T06:00:47Z
    [INFO] ------------------------------------------------------------------------
    
    
2. spark-3.3.0 环境配置

  1. 下载spark二进制压缩包并解压。
    wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
    tar -xzvf spark-3.3.0-bin-hadoop3.tgz
    
  2. spark环境配置:vim /etc/profile
    export SPARK_HOME=/opt/software/spark-3.3.0-bin-hadoop3
    export PATH=$PATH:$SPARK_HOME/bin
    
  3. 将编译好的spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT.jar复制到spark的jars目录
    cd  /opt/software/doris-spark-connector/spark-doris-connector/target/
    cp spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT.jar /opt/software/spark-3.3.0-bin-hadoop3/jars/
    
3. doris-spark-connector实战

  1. 在Doris建立如下表结构:
     CREATE TABLE `expamle_tbl_1` (
      `user_id` largeint(40) NOT NULL COMMENT "id",
      `date` date NOT NULL COMMENT "",
      `city` varchar(800) NULL COMMENT "",
      `age` smallint(6) NULL COMMENT "",
      `sex` tinyint(4) NULL COMMENT "",
      `last_visit_date` datetime REPLACE NULL DEFAULT "1970-01-01 00:00:00" COMMENT "",
      `cost` bigint(20) SUM NULL DEFAULT "0" COMMENT "",
      `max_dwell_time` int(11) MAX NULL DEFAULT "0" COMMENT "",
      `min_dwell_time` int(11) MIN NULL DEFAULT "99999" COMMENT ""
    ) ENGINE=OLAP
    AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
    COMMENT "OLAP"
    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
    PROPERTIES (
    "replication_allocation" = "tag.location.default: 1",
    "in_memory" = "false",
    "storage_format" = "V2"
    ) 
    
  2. 查询表内数据
    mysql> select * from expamle_tbl_1;
    +---------+------------+------+------+------+---------------------+------+----------------+----------------+
    | user_id | date       | city | age  | sex  | last_visit_date     | cost | max_dwell_time | min_dwell_time |
    +---------+------------+------+------+------+---------------------+------+----------------+----------------+
    | 1       | 2022-06-30 | 1    |   10 |    3 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
    | 2       | 2022-06-30 | 2    |    2 |    2 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
    | 2       | 2022-06-30 | 4    |    4 |    4 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
    | 2       | 2022-06-30 | 5    |    5 |    5 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
    | 3       | 2022-06-30 | 3    |    3 |    3 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
    +---------+------------+------+------+------+---------------------+------+----------------+----------------+
    5 rows in set (0.00 sec)
    
    mysql> 
    
  3. 基于doris-spark-connector通过RDD读取Doris数据
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
  tableIdentifier = Some("example_db.expamle_tbl_1"),
  cfg = Some(Map(
    "doris.fenodes" -> "127.0.0.1:8030",
    "doris.request.auth.user" -> "root",
    "doris.request.auth.password" -> ""
  ))
)
dorisSparkRDD.collect()

可以看到有如下结果输出:

res0: Array[AnyRef] = Array([1, 2022-06-30, 1, 10, 3, 1970-01-01 00:00:00, 0, 0, 99999], [2, 2022-06-30, 2, 2, 2, 1970-01-01 00:00:00, 0, 0, 99999], [2, 2022-06-30, 4, 4, 4, 1970-01-01 00:00:00, 0, 0, 99999], [2, 2022-06-30, 5, 5, 5, 1970-01-01 00:00:00, 0, 0, 99999], [3, 2022-06-30, 3, 3, 3, 1970-01-01 00:00:00, 0, 0, 99999])
  1. 基于doris-spark-connector通过DataFrame读取Doris数据
import org.apache.doris.spark._
val dorisSparkDF = spark.read.format("doris").option("doris.table.identifier", "example_db.expamle_tbl_1").option("doris.fenodes", "127.0.0.1:8030").option("user", "root").option("password", "").load()

dorisSparkDF.show()

执行结果如下:

scala> dorisSparkDF.show()
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
|user_id|      date|   city|age|sex|    last_visit_date|cost|max_dwell_time|min_dwell_time|
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
|      1|2022-06-30|      1| 10|  3|1970-01-01 00:00:00|   0|             0|         99999|
|      2|2022-06-30|      2|  2|  2|1970-01-01 00:00:00|   0|             0|         99999|
|      2|2022-06-30|      4|  4|  4|1970-01-01 00:00:00|   0|             0|         99999|
|      2|2022-06-30|      5|  5|  5|1970-01-01 00:00:00|   0|             0|         99999|
|      3|2022-06-30|      3|  3|  3|1970-01-01 00:00:00|   0|             0|         99999|
  1. 基于doris-spark-connector通过DataFrame向Doris写入数据

(1)构造数据

val mockDataDF = List((5, "2022-07-16", "xian",31,2,"2022-07-15",200,10,20),(6, "2022-07-16", "beijing",32,3,"2022-07-15",200,10,20)).toDF("user_id", "date", "city", "age", "sex", "last_visit_date", "cost","max_dwell_time","min_dwell_time")

mockDataDF.show(5)

(2)数据写入

mockDataDF.write.format("doris").option("doris.table.identifier", "example_db.expamle_tbl_1").option("doris.fenodes", "127.0.0.1:8030").option("user", "root").option("password", "").option("doris.write.fields","user_id,date,city,age,sex,last_visit_date,cost,max_dwell_time,min_dwell_time").save()

(3)从Doris查询验证写入结果:

mysql> select * from expamle_tbl_1;
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| user_id | date       | city    | age  | sex  | last_visit_date     | cost | max_dwell_time | min_dwell_time |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| 1       | 2022-06-30 | 1       |   10 |    3 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 2       | 2022-06-30 | 2       |    2 |    2 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 2       | 2022-06-30 | 4       |    4 |    4 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 2       | 2022-06-30 | 5       |    5 |    5 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 3       | 2022-06-30 | 3       |    3 |    3 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 5       | 2022-07-16 | xian    |   31 |    2 | 2022-07-15 00:00:00 |  400 |             10 |             20 |
| 6       | 2022-07-16 | beijing |   32 |    3 | 2022-07-15 00:00:00 |  400 |             10 |             20 |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
7 rows in set (0.00 sec)

mysql>
  1. 基于doris-spark-connector通过Spark SQL读取Doris数据
import org.apache.doris.spark._

  val doris = spark.sql(
      s"""
        |CREATE TEMPORARY VIEW spark_doris
        |USING doris
        |OPTIONS(
        | "table.identifier"="example_db.expamle_tbl_1",
        | "fenodes"="127.0.0.1:8030",
        | "user"="root",
        | "password"=""
        |);
        |""".stripMargin)
spark.sql("SELECT * FROM spark_doris;").show

查询结果如下:

scala> spark.sql("SELECT * FROM spark_doris;").show
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
|user_id|      date|   city|age|sex|    last_visit_date|cost|max_dwell_time|min_dwell_time|
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
|      1|2022-06-30|      1| 10|  3|1970-01-01 00:00:00|   0|             0|         99999|
|      2|2022-06-30|      2|  2|  2|1970-01-01 00:00:00|   0|             0|         99999|
|      2|2022-06-30|      4|  4|  4|1970-01-01 00:00:00|   0|             0|         99999|
|      2|2022-06-30|      5|  5|  5|1970-01-01 00:00:00|   0|             0|         99999|
|      3|2022-06-30|      3|  3|  3|1970-01-01 00:00:00|   0|             0|         99999|
|      5|2022-07-16|   xian| 31|  2|2022-07-15 00:00:00| 400|            10|            20|
|      6|2022-07-16|beijing| 32|  3|2022-07-15 00:00:00| 400|            10|            20|
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
  1. 基于doris-spark-connector通过Spark SQL向Doris写入数据

(1)数据写入:

spark.sql("INSERT INTO spark_doris VALUES (4,'2022-07-15','1',30,1,'2022-07-15',200,10,20);")

(2)从Doris查询验证写入结果:


mysql> select * from expamle_tbl_1;
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| user_id | date       | city    | age  | sex  | last_visit_date     | cost | max_dwell_time | min_dwell_time |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| 1       | 2022-06-30 | 1       |   10 |    3 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 2       | 2022-06-30 | 2       |    2 |    2 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 2       | 2022-06-30 | 4       |    4 |    4 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 2       | 2022-06-30 | 5       |    5 |    5 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 3       | 2022-06-30 | 3       |    3 |    3 | 1970-01-01 00:00:00 |    0 |              0 |          99999 |
| 5       | 2022-07-16 | xian    |   31 |    2 | 2022-07-15 00:00:00 |  400 |             10 |             20 |
| 6       | 2022-07-16 | beijing |   32 |    3 | 2022-07-15 00:00:00 |  400 |             10 |             20 |
| 4       | 2022-07-15 | 1       |   30 |    1 | 2022-07-15 00:00:00 |  200 |             10 |             20 |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
8 rows in set (0.00 sec)
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值