Spark 读 S3 Parquet 写入 Hudi 表
目录
参考
Hadoop-aws
Apache Hadoop Amazon Web Services support – Hadoop-AWS module: Integration with Amazon Web Services
EMR 对应版本
Amazon EMR release 6.5.0 - Amazon EMR
EMR Spark
EMR Hudi
关于S3,S3N和S3A的区别与联系
首先是三种协议的访问大小有区别;
其次S3是block-based,s3n/s3a是 object-based。
最后S3A是apache推荐的访问方式,且S3访问方式将会慢慢被替代,AWS不赞成使用S3访问,且S3A更加稳定安全高效
S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.
S3A (URI scheme: s3a) A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for /successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.
S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3.
Spark 读写 S3 Parquet 文件
测试代码
package org.zero
import com.amazonaws.auth.{ClasspathPropertiesFileCredentialsProvider, DefaultAWSCredentialsProviderChain}
import org.apache.log4j.{Level, Logger}
import org.slf4j.LoggerFactory
import org.utils.SparkUtils
import software.amazon.awssdk.auth.credentials.{EnvironmentVariableCredentialsProvider, ProfileCredentialsProvider}
object SparkS2Test {
private var logger: org.slf4j.Logger = _
def main(args: Array[String]): Unit = {
logger = LoggerFactory.getLogger(this.getClass.getSimpleName)
Logger.getLogger("org.apache.hadoop").setLevel(Level.INFO)
Logger.getLogger("org.apache.spark").setLevel(Level.INFO)
Logger.getLogger("org.spark_project.jetty").setLevel(Level.WARN)
val start = System.currentTimeMillis()
logger.warn(s"=================== Spark 读取 S3 ===================")
val spark = SparkUtils.getSparkSession(this.getClass.getSimpleName, "local[*]")
val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.access.key", "AKIA4ZNT6QH3L45V45VY")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "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")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.cn-northwest-1.amazonaws.com.cn")
val dataframe = spark
.read
.parquet("s3a://s3-datafacts-poc-001/dct/s3-datafacts-poc-001/dt=2022-05-09")
val tmpCache = dataframe.cache()
tmpCache.createOrReplaceTempView("parquet_tmp_view")
val dataFrame2 = spark.sql("select * from parquet_tmp_view limit 10")
dataFrame2.show
// dataFrame2.write.parquet("F:\\tmp\\output")
spark.stop()
val end = System.currentTimeMillis()
logger.warn(s"=================== 耗时: ${(end - start) / 1000} 秒 ===================")
}
}
package org.utils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SparkSession}
object SparkUtils {
private val sparkConf: SparkConf = new SparkConf()
def getSparkConf(appName: String, master: String): SparkConf = {
sparkConf.setMaster(master).setAppName(appName)
}
def getSparkSession(appName: String, master: String): SparkSession = {
sparkConf.setMaster(master).setAppName(appName)
sparkSessionInit
}
lazy val sparkSessionInit: SparkSession = SparkSession.builder()
.config(sparkConf)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.io.compression.codec", "snappy")
.config("spark.rdd.compress", "true")
.config("spark.hadoop.parquet.writer.version", "v2")
.config("spark.sql.parquet.enableVectorizedReader", "false")
.config("spark.sql.parquet.compression.codec", "false")
.config("spark.sql.parquet.compression.codec", "snappy")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.parquet.mergeSchema", "true")
.config("spark.sql.parquet.binaryAsString", "true")
.getOrCreate()
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>spark-s3-hudi-test</artifactId>
<version>1.0-SNAPSHOT</version>
<name>spark-s3-hudi-test</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.maven.plugin.version>4.3.0</scala.maven.plugin.version>
<maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
<maven.assembly.plugin.version>3.1.1</maven.assembly.plugin.version>
<scala.version>2.12.13</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.1.2</spark.version>
<hadoop.version>3.2.1</hadoop.version>
<fasterxml.jackson.version>2.10.0</fasterxml.jackson.version>
<project.build.scope>compile</project.build.scope>
</properties>
<repositories>
<repository>
<id>emr-6.5.0-artifacts</id>
<name>EMR 6.5.0 Releases Repository</name>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
<url>https://s3.us-west-1.amazonaws.com/us-west-1-emr-artifacts/emr-6.5.0/repos/maven/</url>
</repository>
</repositories>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>bom</artifactId>
<version>2.17.186</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>kms</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3control</artifactId>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>${project.build.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>${project.build.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>${project.build.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>${spark.pom.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${fasterxml.jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${fasterxml.jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${fasterxml.jackson.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.15</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>${scala.maven.plugin.version}</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>${maven.assembly.plugin.version}</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
配置文件
创建 resources 目录,添加配置文件
core-site.xml
<configuration>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
</value>
</property>
<!-- <property>-->
<!-- <name>fs.s3a.access.key</name>-->
<!-- <description>AWS access key ID.-->
<!-- Omit for IAM role-based or provider-based authentication.</description>-->
<!-- <value>AKIA4ZNT6QH3L45V45VY</value>-->
<!-- </property>-->
<!-- <property>-->
<!-- <name>fs.s3a.secret.key</name>-->
<!-- <description>AWS secret key.-->
<!-- Omit for IAM role-based or provider-based authentication.</description>-->
<!-- <value>og8I6vB52vDhhb/So/r9ioHMvtbJ4EI2xdGPQIce</value>-->
<!-- </property>-->
</configuration>
log4j.properties
################################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
log4j.rootLogger=info, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.a