基于Spark3.3.0的doris-spark-connector实战
想要更全面了解Spark内核和应用实战,可以购买我的新书。
1. doris-spark-connector编译
-
从github下载源码
git clone https://github.com/apache/doris-spark-connector.git -
执行vim custom_env.sh ,修改custom_env.sh文件,添加如下内容:
export THRIFT_BIN=/usr/local/bin/thrift
export MVN_BIN=/opt/software/apache-maven-3.8.5/bin/mvn
export JAVA_HOME=/opt/software/jdk8
-
基于spark3.3.0编译
cd spark-doris-connector/ sh build.sh --spark 3.3.0 --scala 2.12 -
编译结果
Thrift version 0.13.0 Apache Maven 3.8.5 (3599d3414f046de2324203b78ddcf9b5e4388aa0) Maven home: /opt/software/apache-maven-3.8.5 Java version: 1.8.0_131, vendor: Oracle Corporation, runtime: /opt/software/jdk8/jre Default locale: en_US, platform encoding: ANSI_X3.4-1968 OS name: "linux", version: "4.18.0-358.el8.x86_64", arch: "amd64", family: "unix" SPARK_MINOR_VERSION: 3.3 [INFO] Scanning for projects... [INFO] Replacing /opt/software/doris-spark-connector/spark-doris-connector/target/spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT.jar with /opt/software/doris-spark-connector/spark-doris-connector/target/spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT-shaded.jar [INFO] [INFO] --- maven-javadoc-plugin:3.3.0:jar (attach-javadocs) @ spark-doris-connector-3.3_2.12 --- [INFO] Skipping javadoc generation [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:02 min [INFO] Finished at: 2022-07-14T06:00:47Z [INFO] ------------------------------------------------------------------------
2. spark-3.3.0 环境配置
-
下载spark二进制压缩包并解压。
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz tar -xzvf spark-3.3.0-bin-hadoop3.tgz -
spark环境配置:vim /etc/profile
export SPARK_HOME=/opt/software/spark-3.3.0-bin-hadoop3 export PATH=$PATH:$SPARK_HOME/bin -
将编译好的spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT.jar复制到spark的jars目录
cd /opt/software/doris-spark-connector/spark-doris-connector/target/ cp spark-doris-connector-3.3_2.12-1.0.0-SNAPSHOT.jar /opt/software/spark-3.3.0-bin-hadoop3/jars/
3. doris-spark-connector实战
-
在Doris建立如下表结构:
CREATE TABLE `expamle_tbl_1` ( `user_id` largeint(40) NOT NULL COMMENT "id", `date` date NOT NULL COMMENT "", `city` varchar(800) NULL COMMENT "", `age` smallint(6) NULL COMMENT "", `sex` tinyint(4) NULL COMMENT "", `last_visit_date` datetime REPLACE NULL DEFAULT "1970-01-01 00:00:00" COMMENT "", `cost` bigint(20) SUM NULL DEFAULT "0" COMMENT "", `max_dwell_time` int(11) MAX NULL DEFAULT "0" COMMENT "", `min_dwell_time` int(11) MIN NULL DEFAULT "99999" COMMENT "" ) ENGINE=OLAP AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`) COMMENT "OLAP" DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "in_memory" = "false", "storage_format" = "V2" ) -
查询表内数据
mysql> select * from expamle_tbl_1; +---------+------------+------+------+------+---------------------+------+----------------+----------------+ | user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +---------+------------+------+------+------+---------------------+------+----------------+----------------+ | 1 | 2022-06-30 | 1 | 10 | 3 | 1970-01-01 00:00:00 | 0 | 0 | 99999 | | 2 | 2022-06-30 | 2 | 2 | 2 | 1970-01-01 00:00:00 | 0 | 0 | 99999 | | 2 | 2022-06-30 | 4 | 4 | 4 | 1970-01-01 00:00:00 | 0 | 0 | 99999 | | 2 | 2022-06-30 | 5 | 5 | 5 | 1970-01-01 00:00:00 | 0 | 0 | 99999 | | 3 | 2022-06-30 | 3 | 3 | 3 | 1970-01-01 00:00:00 | 0 | 0 | 99999 | +---------+------------+------+------+------+---------------------+------+----------------+----------------+ 5 rows in set (0.00 sec) mysql> -
基于doris-spark-connector通过RDD读取Doris数据
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
tableIdentifier = Some("example_db.expamle_tbl_1"),
cfg = Some(Map(
"doris.fenodes" -> "127.0.0.1:8030",
"doris.request.auth.user" -> "root",
"doris.request.auth.password" -> ""
))
)
dorisSparkRDD.collect()
可以看到有如下结果输出:
res0: Array[AnyRef] = Array([1, 2022-06-30, 1, 10, 3, 1970-01-01 00:00:00, 0, 0, 99999], [2, 2022-06-30, 2, 2, 2, 1970-01-01 00:00:00, 0, 0, 99999], [2, 2022-06-30, 4, 4, 4, 1970-01-01 00:00:00, 0, 0, 99999], [2, 2022-06-30, 5, 5, 5, 1970-01-01 00:00:00, 0, 0, 99999], [3, 2022-06-30, 3, 3, 3, 1970-01-01 00:00:00, 0, 0, 99999])
import org.apache.doris.spark._
val dorisSparkDF = spark.read.format("doris").option("doris.table.identifier", "example_db.expamle_tbl_1").option("doris.fenodes", "127.0.0.1:8030").option("user", "root").option("password", "").load()
dorisSparkDF.show()
执行结果如下:
scala> dorisSparkDF.show()
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
|user_id| date| city|age|sex| last_visit_date|cost|max_dwell_time|min_dwell_time|
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
| 1|2022-06-30| 1| 10| 3|1970-01-01 00:00:00| 0| 0| 99999|
| 2|2022-06-30| 2| 2| 2|1970-01-01 00:00:00| 0| 0| 99999|
| 2|2022-06-30| 4| 4| 4|1970-01-01 00:00:00| 0| 0| 99999|
| 2|2022-06-30| 5| 5| 5|1970-01-01 00:00:00| 0| 0| 99999|
| 3|2022-06-30| 3| 3| 3|1970-01-01 00:00:00| 0| 0| 99999|
(1)构造数据
val mockDataDF = List((5, "2022-07-16", "xian",31,2,"2022-07-15",200,10,20),(6, "2022-07-16", "beijing",32,3,"2022-07-15",200,10,20)).toDF("user_id", "date", "city", "age", "sex", "last_visit_date", "cost","max_dwell_time","min_dwell_time")
mockDataDF.show(5)
(2)数据写入
mockDataDF.write.format("doris").option("doris.table.identifier", "example_db.expamle_tbl_1").option("doris.fenodes", "127.0.0.1:8030").option("user", "root").option("password", "").option("doris.write.fields","user_id,date,city,age,sex,last_visit_date,cost,max_dwell_time,min_dwell_time").save()
(3)从Doris查询验证写入结果:
mysql> select * from expamle_tbl_1;
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| 1 | 2022-06-30 | 1 | 10 | 3 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 2 | 2022-06-30 | 2 | 2 | 2 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 2 | 2022-06-30 | 4 | 4 | 4 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 2 | 2022-06-30 | 5 | 5 | 5 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 3 | 2022-06-30 | 3 | 3 | 3 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 5 | 2022-07-16 | xian | 31 | 2 | 2022-07-15 00:00:00 | 400 | 10 | 20 |
| 6 | 2022-07-16 | beijing | 32 | 3 | 2022-07-15 00:00:00 | 400 | 10 | 20 |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
7 rows in set (0.00 sec)
mysql>
import org.apache.doris.spark._
val doris = spark.sql(
s"""
|CREATE TEMPORARY VIEW spark_doris
|USING doris
|OPTIONS(
| "table.identifier"="example_db.expamle_tbl_1",
| "fenodes"="127.0.0.1:8030",
| "user"="root",
| "password"=""
|);
|""".stripMargin)
spark.sql("SELECT * FROM spark_doris;").show
查询结果如下:
scala> spark.sql("SELECT * FROM spark_doris;").show
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
|user_id| date| city|age|sex| last_visit_date|cost|max_dwell_time|min_dwell_time|
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
| 1|2022-06-30| 1| 10| 3|1970-01-01 00:00:00| 0| 0| 99999|
| 2|2022-06-30| 2| 2| 2|1970-01-01 00:00:00| 0| 0| 99999|
| 2|2022-06-30| 4| 4| 4|1970-01-01 00:00:00| 0| 0| 99999|
| 2|2022-06-30| 5| 5| 5|1970-01-01 00:00:00| 0| 0| 99999|
| 3|2022-06-30| 3| 3| 3|1970-01-01 00:00:00| 0| 0| 99999|
| 5|2022-07-16| xian| 31| 2|2022-07-15 00:00:00| 400| 10| 20|
| 6|2022-07-16|beijing| 32| 3|2022-07-15 00:00:00| 400| 10| 20|
+-------+----------+-------+---+---+-------------------+----+--------------+--------------+
(1)数据写入:
spark.sql("INSERT INTO spark_doris VALUES (4,'2022-07-15','1',30,1,'2022-07-15',200,10,20);")
(2)从Doris查询验证写入结果:
mysql> select * from expamle_tbl_1;
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
| 1 | 2022-06-30 | 1 | 10 | 3 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 2 | 2022-06-30 | 2 | 2 | 2 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 2 | 2022-06-30 | 4 | 4 | 4 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 2 | 2022-06-30 | 5 | 5 | 5 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 3 | 2022-06-30 | 3 | 3 | 3 | 1970-01-01 00:00:00 | 0 | 0 | 99999 |
| 5 | 2022-07-16 | xian | 31 | 2 | 2022-07-15 00:00:00 | 400 | 10 | 20 |
| 6 | 2022-07-16 | beijing | 32 | 3 | 2022-07-15 00:00:00 | 400 | 10 | 20 |
| 4 | 2022-07-15 | 1 | 30 | 1 | 2022-07-15 00:00:00 | 200 | 10 | 20 |
+---------+------------+---------+------+------+---------------------+------+----------------+----------------+
8 rows in set (0.00 sec)

本文详细介绍了如何基于Spark3.3.0和Doris构建doris-spark-connector,包括源码编译、Spark环境配置,以及使用RDD和DataFrame读写Doris数据的实战步骤,展示了Spark与Doris的无缝对接。
213

被折叠的 条评论
为什么被折叠?



