spark知识笔记

最新推荐文章于 2025-07-11 16:58:13 发布

ILovePythonhao

最新推荐文章于 2025-07-11 16:58:13 发布

阅读量202

点赞数

CC 4.0 BY-SA版权

文章标签： spark

本文链接：https://blog.youkuaiyun.com/ILovePythonhao/article/details/108111050

本文深入探讨Spark中的高级用法，包括正确使用累加器、理解collect与take的区别、避免内存溢出，以及如何通过多种方式修改DataFrame列名。同时，介绍了在YARN环境下资源管理和优化策略，以及解决数据读取和输出显示的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

方法中定义的变量只能在driver中使用，要想在计算中使用，就要定义一个累加器。
使用rdd.foreach(println) or rdd.map(println)这两种方式进行打印输出的话，在单个机器上不会出现问题，但是在集群模式下就会出现问题了。However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these! 在集群模式下这里面的标准输出是executors上面的标准输出，而不是driver上的标准输出。To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; 采用collect的方法的话，会把所有的executors上面的结果都收集到driver这一台机器上。if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println)。这里面take实际上就是调用的collect方法。
在来仔细看看collect的说明：下面是dataset类中collect方法的说明Running collect requires moving all the data into the application’s driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.我们可以看到collect方法就是把所有executors的结果收集到driver中，因此如果数据集非常大的话，那么内存就要爆掉了。
spark on yarn 在申请资源的时候，申请的是container包括内存和CPU，正常情况下使用完成之后就会释放掉，频繁申请释放会大量消耗资源，当把资源hold住之后，对于小文件来说，运算速度就会提升很大，因为对于小文件来说申请资源的时间甚至都比计算的花费的时间要多很多。我觉得这个应该是有问题的，之后再进行修正，因为我之前看到过一个地方写的是spark on yarn 是一次性申请资源
在这里插入图a片描述
在这里插入图片描述
修改列名的两种方式：

//从hive默认的库的priors表中读取数据
val priors = spark.sql("select * from priors")  
//改名的第一种方式
val prodCnt = priors.groupBy("product_id").count().selectExpr("product_id","count as prod_cnt")
//第二种方式
val prodCnt = priors.groupBy("product_id").count().withColumnRenamed("count","prod_cnt")

用SQL直接进行groupBy

val prodCnt = spark.sql("select product_id,count(1) as prod_cnt from priors group by product_id")
//结果和上面的相同，但是这样的不足在于没有拿到最初的原始数据，对之后的处理可能带来不方便

val productRodCnt1 = priors.selectExpr("product_id","cast(reordered as int)").groupBy("product_id").agg(sum("reordered").as("prod_sum_rod"),avg("reordered").as("prod_rod_rate")) 
//又一种新的改名方式，外加一种转换类型的方式

val ordersNew1 = orders.selectExpr("*","if(days_since_prior_order='',0,days_since_prior_order) as dspo").drop("days_since_prior_order")//对空值进行处理

orders.na.fill(0.0)//对nan值进行处理

//获取同一个用户的所有订单，去重后的商品
//第一种方式：直接使用dataframe去实现
up.groupBy("user_id").agg(collect_set("product_id").as("prod_uni_cnt")).show()
//第二种方式：使用rdd的方式实现
up.rdd.map(x=>(x(0).toString,x(1).toString)).groupByKey().mapValues(_.toSet.mkString(",")).toDF("user_id","prod_uni_cnt")

//每个用户总商品数量以及去重后的商品数量
val userProRcdSize=op
	.rdd
	.map(x=>(x(0).toString,x(1).toString))
	.groupByKey()
	.mapValues{record=>
      val rs = record.toSet
      (rs.size,rs.mkString(","),record.size)
    }.toDF("user_id","tuple")
  .selectExpr("user_id","tuple._1 as prod_dist_cnt","tuple._2 as prod_records","tuple._3 as prod_size")

spark读取文件路径的设置：
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system。如果使用本地文件系统上的路径，那么该文件也必须可以在工作节点上的同一路径上访问。要么将文件复制到所有worker，要么使用网络挂载的共享文件系统。
当在driver端定义一个全局变量，需要在executors端进行计算，然后结果再返回driver端进行显示的话，就需要使用到累加器。
序列化：当定义完一个类，在executors使用算子进行计算的时候，假如需要传入该类对应的方法或者属性的时候，那么该类就需要被序列化。