Spark3 读写 S3 Parquet, Hive, Hudi

訾零

已于 2022-05-26 00:34:11 修改

阅读量5k

点赞数 1

分类专栏： Hudi Spark S3 文章标签： hadoop big data spark

于 2022-05-17 11:56:09 首次发布

本文链接：https://blog.youkuaiyun.com/lingeio/article/details/124817817

版权

Spark 读 S3 Parquet 写入 Hudi 表

参考

关于S3，S3N和S3A的区别与联系

Spark 读写 S3 Parquet 文件

参考

Hadoop-aws

Apache Hadoop Amazon Web Services support – Hadoop-AWS module: Integration with Amazon Web Services

EMR 对应版本

Amazon EMR release 6.5.0 - Amazon EMR

EMR Spark

Apache Spark - Amazon EMR

EMR Hudi

Hudi - Amazon EMR

关于S3，S3N和S3A的区别与联系

首先是三种协议的访问大小有区别；

其次S3是block-based，s3n/s3a是 object-based。

最后S3A是apache推荐的访问方式，且S3访问方式将会慢慢被替代，AWS不赞成使用S3访问，且S3A更加稳定安全高效

S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

S3A (URI scheme: s3a) A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for /successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3.

Spark 读写 S3 Parquet 文件

测试代码

package org.zero

import com.amazonaws.auth.{ClasspathPropertiesFileCredentialsProvider, DefaultAWSCredentialsProviderChain}
import org.apache.log4j.{Level, Logger}
import org.slf4j.LoggerFactory
import org.utils.SparkUtils
import software.amazon.awssdk.auth.credentials.{EnvironmentVariableCredentialsProvider, ProfileCredentialsProvider}

object SparkS2Test {
  private var logger: org.slf4j.Logger = _

  def main(args: Array[String]): Unit = {
    logger = LoggerFactory.getLogger(this.getClass.getSimpleName)
    Logger.getLogger("org.apache.hadoop").setLevel(Level.INFO)
    Logger.getLogger("org.apache.spark").setLevel(Level.INFO)
    Logger.getLogger("org.spark_project.jetty").setLevel(Level.WARN)

    val start = System.currentTimeMillis()
    logger.warn(s"=================== Spark 读取 S3 ===================")

    val spark = SparkUtils.getSparkSession(this.getClass.getSimpleName, "local[*]")
    val sc = spark.sparkContext
    sc.hadoopConfiguration.set("fs.s3a.access.key", "AKIA4ZNT6QH3L45V45VY")
    sc.hadoopConfiguration.set("fs.s3a.secret.key", "og8I6vB52vDhhb/So/r9ioHMvtbJ4EI2xdGPQIce")
    sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.cn-northwest-1.amazonaws.com.cn")

    val dataframe = spark
      .read
      .parquet("s3a://s3-datafacts-poc-001/dct/s3-datafacts-poc-001/dt=2022-05-09")

    val tmpCache = dataframe.cache()
    tmpCache.createOrReplaceTempView("parquet_tmp_view")

    val dataFrame2 = spark.sql("select * from parquet_tmp_view limit 10")

    dataFrame2.show

//    dataFrame2.write.parquet("F:\\tmp\\output")

    spark.stop()

    val end = System.currentTimeMillis()
    logger.warn(s"=================== 耗时： ${(end - start) / 1000} 秒 ===================")
  }
}
package org.utils

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SparkSession}

object SparkUtils {
  private val sparkConf: SparkConf = new SparkConf()

  def getSparkConf(appName: String, master: String): SparkConf = {
    sparkConf.setMaster(master).setAppName(appName)
  }

  def getSparkSession(appName: String, master: String): SparkSession = {
    sparkConf.setMaster(master).setAppName(appName)
    sparkSessionInit
  }

  lazy val sparkSessionInit: SparkSession = SparkSession.builder()
    .config(sparkConf)
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.io.compression.codec", "snappy")
    .config("spark.rdd.compress", "true")
    .config("spark.hadoop.parquet.writer.version", "v2")
    .config("spark.sql.parquet.enableVectorizedReader", "false")
    .config("spark.sql.parquet.compression.codec", "false")
    .config("spark.sql.parquet.compression.codec", "snappy")
    .config("spark.sql.parquet.filterPushdown", "true")
    .config("spark.sql.parquet.mergeSchema", "true")
    .config("spark.sql.parquet.binaryAsString", "true")
    .getOrCreate()
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>spark-s3-hudi-test</artifactId>
    <version>1.0-SNAPSHOT</version>

    <name>spark-s3-hudi-test</name>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.maven.plugin.version>4.3.0</scala.maven.plugin.version>
        <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
        <maven.assembly.plugin.version>3.1.1</maven.assembly.plugin.version>
        <scala.version>2.12.13</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
        <spark.version>3.1.2</spark.version>
        <hadoop.version>3.2.1</hadoop.version>
        <fasterxml.jackson.version>2.10.0</fasterxml.jackson.version>
        <project.build.scope>compile</project.build.scope>
    </properties>

    <repositories>
        <repository>
            <id>emr-6.5.0-artifacts</id>
            <name>EMR 6.5.0 Releases Repository</name>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
            <url>https://s3.us-west-1.amazonaws.com/us-west-1-emr-artifacts/emr-6.5.0/repos/maven/</url>
        </repository>
    </repositories>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>software.amazon.awssdk</groupId>
                <artifactId>bom</artifactId>
                <version>2.17.186</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>s3</artifactId>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>kms</artifactId>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>s3control</artifactId>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <scope>${project.build.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${project.build.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${project.build.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${spark.pom.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
            <version>${fasterxml.jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>${fasterxml.jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-annotations</artifactId>
            <version>${fasterxml.jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.12.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.15</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.13</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>${scala.maven.plugin.version}</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>${maven.assembly.plugin.version}</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

配置文件

创建 resources 目录，添加配置文件

core-site.xml

<configuration>
    <property>
        <name>fs.s3a.aws.credentials.provider</name>
        <value>
            org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
        </value>
    </property>
<!--    <property>-->
<!--        <name>fs.s3a.access.key</name>-->
<!--        <description>AWS access key ID.-->
<!--            Omit for IAM role-based or provider-based authentication.</description>-->
<!--        <value>AKIA4ZNT6QH3L45V45VY</value>-->
<!--    </property>-->
<!--    <property>-->
<!--        <name>fs.s3a.secret.key</name>-->
<!--        <description>AWS secret key.-->
<!--            Omit for IAM role-based or provider-based authentication.</description>-->
<!--        <value>og8I6vB52vDhhb/So/r9ioHMvtbJ4EI2xdGPQIce</value>-->
<!--    </property>-->
</configuration>

log4j.properties

################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
log4j.rootLogger=info, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.a