Spark3 读写 S3 Parquet, Hive, Hudi

Spark 读 S3 Parquet 写入 Hudi 表

目录

Spark 读 S3 Parquet 写入 Hudi 表

参考

关于S3,S3N和S3A的区别与联系

Spark 读写 S3 Parquet 文件

测试代码

pom.xml

配置文件

EMR Spark任务提交

spark-shell

spark-submit

Spark 读写 Hudi

本地测试

代码

集群上测试

spark-shell

spark-sql

Spark-submit

Hive 中测试

问题解决


参考

Hadoop-aws

Apache Hadoop Amazon Web Services support – Hadoop-AWS module: Integration with Amazon Web Services

EMR 对应版本

Amazon EMR release 6.5.0 - Amazon EMR

EMR Spark

Apache Spark - Amazon EMR

EMR Hudi

Hudi - Amazon EMR

关于S3,S3N和S3A的区别与联系

首先是三种协议的访问大小有区别;

其次S3是block-based,s3n/s3a是 object-based。

最后S3A是apache推荐的访问方式,且S3访问方式将会慢慢被替代,AWS不赞成使用S3访问,且S3A更加稳定安全高效

S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

S3A (URI scheme: s3a) A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for /successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3.

Spark 读写 S3 Parquet 文件

测试代码

package org.zero
​
import com.amazonaws.auth.{ClasspathPropertiesFileCredentialsProvider, DefaultAWSCredentialsProviderChain}
import org.apache.log4j.{Level, Logger}
import org.slf4j.LoggerFactory
import org.utils.SparkUtils
import software.amazon.awssdk.auth.credentials.{EnvironmentVariableCredentialsProvider, ProfileCredentialsProvider}
​
object SparkS2Test {
  private var logger: org.slf4j.Logger = _
​
  def main(args: Array[String]): Unit = {
    logger = LoggerFactory.getLogger(this.getClass.getSimpleName)
    Logger.getLogger("org.apache.hadoop").setLevel(Level.INFO)
    Logger.getLogger("org.apache.spark").setLevel(Level.INFO)
    Logger.getLogger("org.spark_project.jetty").setLevel(Level.WARN)
​
    val start = System.currentTimeMillis()
    logger.warn(s"=================== Spark 读取 S3 ===================")
​
    val spark = SparkUtils.getSparkSession(this.getClass.getSimpleName, "local[*]")
    val sc = spark.sparkContext
    sc.hadoopConfiguration.set("fs.s3a.access.key", "AKIA4ZNT6QH3L45V45VY")
    sc.hadoopConfiguration.set("fs.s3a.secret.key", "og8I6vB52vDhhb/So/r9ioHMvtbJ4EI2xdGPQIce")
    sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.cn-northwest-1.amazonaws.com.cn")
​
    val dataframe = spark
      .read
      .parquet("s3a://s3-datafacts-poc-001/dct/s3-datafacts-poc-001/dt=2022-05-09")
​
    val tmpCache = dataframe.cache()
    tmpCache.createOrReplaceTempView("parquet_tmp_view")
​
    val dataFrame2 = spark.sql("select * from parquet_tmp_view limit 10")
​
    dataFrame2.show
​
//    dataFrame2.write.parquet("F:\\tmp\\output")
​
    spark.stop()
​
    val end = System.currentTimeMillis()
    logger.warn(s"=================== 耗时: ${(end - start) / 1000} 秒 ===================")
  }
}
package org.utils
​
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SparkSession}
​
object SparkUtils {
  private val sparkConf: SparkConf = new SparkConf()
​
  def getSparkConf(appName: String, master: String): SparkConf = {
    sparkConf.setMaster(master).setAppName(appName)
  }
​
  def getSparkSession(appName: String, master: String): SparkSession = {
    sparkConf.setMaster(master).setAppName(appName)
    sparkSessionInit
  }
​
  lazy val sparkSessionInit: SparkSession = SparkSession.builder()
    .config(sparkConf)
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.io.compression.codec", "snappy")
    .config("spark.rdd.compress", "true")
    .config("spark.hadoop.parquet.writer.version", "v2")
    .config("spark.sql.parquet.enableVectorizedReader", "false")
    .config("spark.sql.parquet.compression.codec", "false")
    .config("spark.sql.parquet.compression.codec", "snappy")
    .config("spark.sql.parquet.filterPushdown", "true")
    .config("spark.sql.parquet.mergeSchema", "true")
    .config("spark.sql.parquet.binaryAsString", "true")
    .getOrCreate()
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
​
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
​
    <groupId>org.example</groupId>
    <artifactId>spark-s3-hudi-test</artifactId>
    <version>1.0-SNAPSHOT</version>
​
    <name>spark-s3-hudi-test</name>
​
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.maven.plugin.version>4.3.0</scala.maven.plugin.version>
        <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
        <maven.assembly.plugin.version>3.1.1</maven.assembly.plugin.version>
        <scala.version>2.12.13</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
        <spark.version>3.1.2</spark.version>
        <hadoop.version>3.2.1</hadoop.version>
        <fasterxml.jackson.version>2.10.0</fasterxml.jackson.version>
        <project.build.scope>compile</project.build.scope>
    </properties>
​
    <repositories>
        <repository>
            <id>emr-6.5.0-artifacts</id>
            <name>EMR 6.5.0 Releases Repository</name>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
            <url>https://s3.us-west-1.amazonaws.com/us-west-1-emr-artifacts/emr-6.5.0/repos/maven/</url>
        </repository>
    </repositories>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>software.amazon.awssdk</groupId>
                <artifactId>bom</artifactId>
                <version>2.17.186</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>
​
    <dependencies>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>s3</artifactId>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>kms</artifactId>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>s3control</artifactId>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <scope>${project.build.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${project.build.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${project.build.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${spark.pom.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
            <version>${fasterxml.jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>${fasterxml.jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-annotations</artifactId>
            <version>${fasterxml.jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.12.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.15</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.13</version>
        </dependency>
    </dependencies>
​
    <build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>${scala.maven.plugin.version}</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>${maven.assembly.plugin.version}</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

配置文件

创建 resources 目录,添加配置文件

core-site.xml

<configuration>
    <property>
        <name>fs.s3a.aws.credentials.provider</name>
        <value>
            org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
        </value>
    </property>
<!--    <property>-->
<!--        <name>fs.s3a.access.key</name>-->
<!--        <description>AWS access key ID.-->
<!--            Omit for IAM role-based or provider-based authentication.</description>-->
<!--        <value>AKIA4ZNT6QH3L45V45VY</value>-->
<!--    </property>-->
<!--    <property>-->
<!--        <name>fs.s3a.secret.key</name>-->
<!--        <description>AWS secret key.-->
<!--            Omit for IAM role-based or provider-based authentication.</description>-->
<!--        <value>og8I6vB52vDhhb/So/r9ioHMvtbJ4EI2xdGPQIce</value>-->
<!--    </property>-->
</configuration>

log4j.properties

################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
log4j.rootLogger=info, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.a
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

訾零

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值