Spark从s3中读取数据

本文档介绍了在尝试从S3读取数据时遇到的Spark错误,指出该错误通常由于缺少依赖项引起。解决方案包括使用`--packages`参数让Spark通过Maven寻找缺失的依赖,或者使用`--jars`参数添加已下载的依赖。此外,还提供了其他语言的S3使用参考链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

根据Spark官网Quick Start,简单修改下file source
ref: http://spark.apache.org/docs/latest/quick-start.html

package myspark;

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class LogAnalyser {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID");
        sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET");

        String logFile = "s3n://bucket/*.log";

        JavaRDD<String> logData = sc.textFile(logFile).cache();

        long numAs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains("b");
            }
        }).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        sc.stop();
    }
}

将项目打包为test-0.1.0.jar,提交给Spark执行:

SPARK_HOME/bin/spark-submit --class myspark.LogAnalyser \
--master local[4] build/libs/test-0.1.0.jar

发现报错:

No FileSystem for scheme: s3n

原因及解决方法:

This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the –packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use –jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script.

SPARK_HOME/bin/spark-submit --class myspark.LogAnalyser \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
--master local[4] build/libs/test-0.1.0.jar

其他语言(语言)参考: https://sparkour.urizone.net/recipes/using-s3/

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值