rdd.collect()后得到的List排序问题:Exception in thread “main“ java.lang.UnsupportedOperationException

该博客介绍了如何利用Apache Spark处理Apache日志文件,按小时统计访问量。通过Java API实现数据读取、切分、转换为键值对、求和并排序,最终成功输出按访问小时的访问次数。代码运行过程中遇到UnsupportedOperationException问题,通过调整代码解决了此问题,并展示了部分处理后的结果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

跟着视频的写了一个数据统计的demo,实现按小时统计访问量。代码如下

package com.billy.test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.*;

public class SparkTest {

    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        JavaRDD<String> lines = sc.textFile("files/apache.log");

        //数据分片
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {

            @Override
            public Iterator<String> call(String s) throws Exception {
                return Arrays.asList(s.split(":")[1]).iterator();
            }
        });
        //数据转化为 对
        JavaPairRDD<String,Integer> wordsMap = words.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String s) throws Exception {
                return new Tuple2(s,1);
            }
        });
        //求和
        JavaPairRDD<String,Integer> wordsReduce = wordsMap.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer arg0, Integer arg1) throws Exception {
                return arg0 + arg1;
            }
        });
        //JavaPairRDD<String, Integer> sortByKey = wordsReduce.sortByKey(true);
        List<Tuple2<String, Integer>> collect = wordsReduce.collect();
        System.out.println("collect0: " + collect);

        //collect = new ArrayList(collect);
        System.out.println("collect1: " + collect);

        //System.out.println(sorted);
        Comparator<? super Tuple2<String, Integer>> myComparator = new Comparator<>() {
            @Override
            public int compare(Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) {
                return t1._2.compareTo(t2._2);
            }
        };
        collect.sort(myComparator);
        System.out.println("collect2: " + collect);
    }
}

结果一直报Exception in thread “main” java.lang.UnsupportedOperationException错误,后来参考文章
https://blog.youkuaiyun.com/Tracycater/article/details/77592472?locationNum=2&fps=1
发现由数组转为List有一些坑,然后看rdd.collect()方法的源码发现果然collect返回的是一个array
在这里插入图片描述
因此也需要将collect得到的List转化一下,在上面的代码中,打开
collect = new ArrayList(collect);的注释,然后就不报错了。
打印结果:

collect0: [(06,366), (20,486), (19,493), (15,496), (00,361), (02,365), (04,355), (22,346), (17,484), (13,475), (11,459), (08,345), (14,498), (09,364), (21,453), (18,478), (16,473), (03,354), (07,357), (12,462), (05,371), (01,360), (10,443), (23,356)]
collect1: [(06,366), (20,486), (19,493), (15,496), (00,361), (02,365), (04,355), (22,346), (17,484), (13,475), (11,459), (08,345), (14,498), (09,364), (21,453), (18,478), (16,473), (03,354), (07,357), (12,462), (05,371), (01,360), (10,443), (23,356)]
collect2: [(08,345), (22,346), (03,354), (04,355), (23,356), (07,357), (01,360), (00,361), (09,364), (02,365), (06,366), (05,371), (10,443), (21,453), (11,459), (12,462), (16,473), (13,475), (18,478), (17,484), (20,486), (19,493), (15,496), (14,498)]

apache.log摘要如下:

83.149.9.216 - - 17/05/2015:10:05:03 +0000 GET /presentations/logstash-monitorama-2013/images/kibana-search.png
83.149.9.216 - - 17/05/2015:10:05:43 +0000 GET /presentations/logstash-monitorama-2013/images/kibana-dashboard3.png
83.149.9.216 - - 17/05/2015:10:05:47 +0000 GET /presentations/logstash-monitorama-2013/plugin/highlight/highlight.js
83.149.9.216 - - 17/05/2015:10:05:12 +0000 GET /presentations/logstash-monitorama-2013/plugin/zoom-js/zoom.js
83.149.9.216 - - 17/05/2015:10:05:07 +0000 GET /presentations/logstash-monitorama-2013/plugin/notes/notes.js
83.149.9.216 - - 17/05/2015:10:05:34 +0000 GET /presentations/logstash-monitorama-2013/images/sad-medic.png
83.149.9.216 - - 17/05/2015:10:05:57 +0000 GET /presentations/logstash-monitorama-2013/css/fonts/Roboto-Bold.ttf
83.149.9.216 - - 17/05/2015:10:05:50 +0000 GET /presentations/logstash-monitorama-2013/css/fonts/Roboto-Regular.ttf
83.149.9.216 - - 17/05/2015:10:05:24 +0000 GET /presentations/logstash-monitorama-2013/images/frontend-response-codes.png
83.149.9.216 - - 17/05/2015:10:05:50 +0000 GET /presentations/logstash-monitorama-2013/images/kibana-dashboard.png
83.149.9.216 - - 17/05/2015:10:05:46 +0000 GET /presentations/logstash-monitorama-2013/images/Dreamhost_logo.svg
83.149.9.216 - - 17/05/2015:10:05:11 +0000 GET /presentations/logstash-monitorama-2013/images/kibana-dashboard2.png
83.149.9.216 - - 17/05/2015:10:05:19 +0000 GET /presentations/logstash-monitorama-2013/images/apache-icon.gif
83.149.9.216 - - 17/05/2015:10:05:33 +0000 GET /presentations/logstash-monitorama-2013/images/nagios-sms5.png
83.149.9.216 - - 17/05/2015:10:05:00 +0000 GET /presentations/logstash-monitorama-2013/images/redis.png
83.149.9.216 - - 17/05/2015:10:05:25 +0000 GET /presentations/logstash-monitorama-2013/images/elasticsearch.png
83.149.9.216 - - 17/05/2015:10:05:59 +0000 GET /presentations/logstash-monitorama-2013/images/logstashbook.png
83.149.9.216 - - 17/05/2015:10:05:30 +0000 GET /presentations/logstash-monitorama-2013/images/github-contributions.png
83.149.9.216 - - 17/05/2015:10:05:53 +0000 GET /presentations/logstash-monitorama-2013/css/print/paper.css
83.149.9.216 - - 17/05/2015:10:05:24 +0000 GET /presentations/logstash-monitorama-2013/images/1983_delorean_dmc-12-pic-38289.jpeg
83.149.9.216 - - 17/05/2015:10:05:54 +0000 GET /presentations/logstash-monitorama-2013/images/simple-inputs-filters-outputs.jpg
83.149.9.216 - - 17/05/2015:10:05:33 +0000 GET /presentations/logstash-monitorama-2013/images/tiered-outputs-to-inputs.jpg
83.149.9.216 - - 17/05/2015:10:05:56 +0000 GET /favicon.ico
24.236.252.67 - - 17/05/2015:10:05:40 +0000 GET /favicon.ico
93.114.45.13 - - 17/05/2015:10:05:14 +0000 GET /articles/dynamic-dns-with-dhcp/
93.114.45.13 - - 17/05/2015:10:05:04 +0000 GET /reset.css
93.114.45.13 - - 17/05/2015:10:05:45 +0000 GET /style2.css
93.114.45.13 - - 17/05/2015:10:05:14 +0000 GET /favicon.ico
93.114.45.13 - - 17/05/2015:10:05:17 +0000 GET /images/jordan-80.png
93.114.45.13 - - 17/05/2015:10:05:21 +0000 GET /images/web/2009/banner.png
<think>好的,用户需要解决Spark SQL中的AnalysisException和NullPointerException错误。首先,我需要回忆这两种异常的常见原因和解决方案。 对于AnalysisException,通常和表或列不存在、SQL语法错误、权限问题有关。比如用户提到的Spark 3.0中SHOW TBLPROPERTIES在表不存在时会抛出AnalysisException,而旧版本是NoSuchTableException,这说明版本差异需要注意。解决方案可能需要检查表是否存在,或者使用TRY-CATCH处理。 然后是NullPointerException,这通常发生在对象未正确初始化时。比如用户提到的sqlContext.createDataFrame时出现空指针,可能因为SparkSession未正确创建,或者RDD的数据结构与schema不匹配。需要检查SparkContext初始化,以及数据转换时的步骤是否正确。 接下来需要整理解决步骤,分点说明。比如先处理AnalysisException,再处理NullPointer。每个异常的可能原因和对应的解决办法,如检查表名、验证SQL语法、处理权限,以及确保SparkSession初始化,检查RDD是否为空等。 同时需要引用用户提供的资料中的例子,比如引用[1]和引用[2]中的情况,结合实例说明如何解决。例如在Spark 3.0中处理表不存在的情况,或者在local模式下正确初始化sqlContext。 最后生成相关问题,帮助用户进一步排查问题,比如版本兼容性、初始化步骤、调试技巧等。</think>### Spark SQL AnalysisException 和 NullPointerException 解决方案 #### 一、AnalysisException 常见原因及解决方案 1. **表或列不存在** - **现象**:执行`SHOW TBLPROPERTIES`或查询时抛出异常 - **解决方案**: - 使用`spark.catalog.tableExists("表名")`验证表是否存在[^1] - 使用`TRY...CATCH`捕获异常(Spark 3.0+支持): ```sql TRY SHOW TBLPROPERTIES 表名 CATCH SELECT '表不存在' AS error_message ``` 2. **SQL语法错误** - **现象**:解析SQL时因语法错误失败 - **解决方案**: - 使用`EXPLAIN`命令验证SQL可执行性 - 在Spark Web UI的SQL页面检查解析计划 3. **权限问题** - **现象**:操作表时因权限不足失败 - **解决方案**: - 检查HDFS或元数据库(如Hive Metastore)的访问权限 - 使用`spark.sql("SET spark.sql.hive.verifyPartitionPath=true")`避免路径误判 --- #### 二、NullPointerException 常见原因及解决方案 1. **SparkSession未初始化** - **现象**:调用`sqlContext.createDataFrame()`时崩溃 - **解决方案**: - 确保正确初始化SparkSession: ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("demo").getOrCreate() ``` 2. **空RDD转换DataFrame** - **现象**:空RDD尝试创建DataFrame时报错[^2] - **解决方案**: - 添加空值检查: ```python if not rdd.isEmpty(): df = sqlContext.createDataFrame(rdd, schema) ``` 3. **Schema与数据不匹配** - **现象**:字段数量或类型不一致导致空指针 - **解决方案**: - 使用`printSchema()`验证Schema定义 - 添加数据清洗步骤: ```python rdd = rdd.filter(lambda x: len(x) == len(schema.fields)) ``` --- #### 三、调试技巧 1. **启用详细日志** 在`spark-submit`中添加配置: ```bash --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.debug.properties" ``` 2. **最小化复现代码** 将问题SQL/操作拆解为独立单元测试,逐步定位异常位置 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值