spark core 学习

最新推荐文章于 2023-04-26 19:40:36 发布

原创最新推荐文章于 2023-04-26 19:40:36 发布 · 435 阅读

CC 4.0 BY-SA版权

本文详细探讨了Spark Core中的缓存机制，包括`persist`和`cache`的使用，强调了它们在跨operation操作中的重要性。内容涵盖了RDD的序列化选项、存储级别、容错机制以及`unpersist`的使用。讨论了序列化对内存和CPU的影响，建议在内存允许的情况下使用默认的非序列化存储。此外，文章提到了广播变量在减少数据复制和提高效率上的作用，以及在实际工作中的权衡和最佳实践。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

persisting caching 缓存
可以跨operation进行操作 across operation 跨operation
可选的,英语单词
persist 持久化一个RDD
进行计算或者重用

spark 编程流程
读入一个文本

sc.textFile("")
trans
读入的文本,

注意读入数据变大原因是数据没有进行序列化,造成了数据的变大

scala> rdd2.count
res2: Long = 3

scala> rdd2.count
res3: Long = 3

scala> rdd2.ca
cache canEqual cartesian

scala> rdd2.cache
res4: org.apache.spark.rdd.RDD[String] = file:///root/data/hello.txt MapPartitionsRDD[1] at textFile at :24

scala> rdd2.count
res5: Long = 3

scala> rdd2.count
res6: Long = 3

在以后的actions 中复用 reuses 通常能快 10X 十个数量级

cache
key tool 关键的算法
快速的交互使用

cache 是lazy

Memory Deserialized 1x Replicated 内存非序列化
Memory Serialized 1x Replicated   内存序列化

备份是防止服务器出问题,可以实时切换

如果要用到的话,就需要cache 将数据进行cache

etl 原始数据,是没有必要cache etl 之后,大宽表,有100字段,一般情况是转换后的,进行cache
在cache之前有多个计算是有意义的,

cache 默认参数,调用的方法,无参,有参

第一次计算是在action的位置

两种情况来描述

fault-tolerant 容错的自动重新运算

scala> rdd2.unpersist
:26: error: missing argument list for method unpersist in class RDD
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing unpersist _ or unpersist(_) instead of unpersist.
rdd2.unpersist
^

scala> rdd2.unpersist()
res8: org.apache.spark.rdd.RDD[String] @scala.reflect.internal.annotations.uncheckedBounds = file:///root/data/hello.txt MapPartitionsRDD[1] at textFile at :24

cache() lazy 懒加载

unpersist() eager 立即执行

serialized
deserialized 非序列化
_replication 副本机制

scala> rdd2.persist(StorageLevel.MEMORY_ONLY_SER)
:26: error: not found: value StorageLevel
rdd2.persist(StorageLevel.MEMORY_ONLY_SER)
^

scala> import org.apache.spark.rdd.StorageLevel
:23: error: object StorageLevel is not a member of package org.apache.spark.rdd
import org.apache.spark.rdd.StorageLevel
^

scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel

scala> rdd2.persist(StorageLevel.MEMORY_ONLY_SER)
res10: org.apache.spark.rdd.RDD[String] = file:///root/data/hello.txt MapPartitionsRDD[1] at textFile at :24

scala> rdd2.count
res11: Long = 3

序列化,能够节省内存开销,代价cpu负载会很高
权衡过程

不建议使用序列化的内存模式,使用默认的方式

如果你的RDD不能够放到内存里面去 fit 放到内存
RDD does not fit in memory
有些partition不会被cached

在工作中选择的还是默认的

store the partitions that donot fit on disk 将不能存储的存到磁盘上

工作中不会缓存到磁盘

缓存到磁盘,不如重新开始
更加节省空间,尤其是选择一个快速序列化工具的时候

但是对于CPU更加敏感

这些字都认得,也读过好几遍,为什么看不到这些信息,若老将这些知识,进行了大量的分层,一层一层的将知识呈现出来
另外有些英语的理解还是存在问题的,这些必须有实际的经验才能够获取这些数据

副本: 生产上不使用python

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.
自动的缓存在shuffle之后
Which Storag

mapReduce 缓存区 buffur 里面类似的缓存区,磁盘

trade-offs 权衡英文翻译

内存使用量 cpu效率之间的权衡
内存占用越小,CPU开销越大

如果连内存都没有,还用什么Spark 吃内存的

Don’t spill to disk
不要把东西写到磁盘去除非运算是昂贵的
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

副班机制一般不用,所以最重要的是

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

权衡结果内容方式,存储测量,常用的记住

checkpoint 基本不用
spark streaming 满足不了需求开发过程中,就没有用过checkpoint

阿拉少用OFF_HEAP,这种方式

共享变量是非常有用的
map reduce 都是需要接收一个函数的

val mapVar = new HashMap() 很多变量

val rdd = sc.textFile("…")

val map(x=>{
…mapVar…
})

map > task < executor
task是并行运算的
所有的task都要持有一个
mapVar

默认情况下每个task都要copy一份的

1000task 10m 10G
如何解决广播变量
这种是非常不高效的方式,
广播变量,计数器,分布式执行的,跨进程的是无法用指针的

广播变量,本次执行丢了多少记录,脏数据占比

生产中用法,去保存第一大特性,read-only cache executor
而不是copy每一个副本在task里面
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

500M

分布式的广播算法降低广播花费

join 通常情况下会有shuffle
不做shuffle 数据集作为一个广播变量, cache
匹配广播变量 shuffle 对比, 数据传输

广播变量不能过大,广播出去的是小表,小表的内容解决方案,广播的内容不能够过大

必须使用keyvalue操作

注意广播变量必须修改为keyvalue格式

key与value的处理方式

//查看结构的方式
scala> val g5 = sc.parallelize(Array((“1”, “豆豆”), (“1”, “牛哥”), (“34”, “进取”)))
g5: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[2] at parallelize at :25

scala> val f11 = sc.parallelize( Array((“34”, “深圳”, 18), (“10”, “北京”, 2)) ).map(x => (x._1, x))
f11: org.apache.spark.rdd.RDD[(String, (String, String, Int))] = MapPartitionsRDD[4] at map at :25

scala> g5.join(f11)
res12: org.apache.spark.rdd.RDD[(String, (String, (String, String, Int)))] = MapPartitionsRDD[7] at join at :29

函数式编程,名字, 这种写法很快,但是不能够解决问题

g5.join(f11).map(x=>{ x._1 +","+x._2._1+","+x._2._2._2 }).collect()

生产经验:
在实际工作中是禁止使用这种方式编程的,可读性很差
极力反对用core编程,绝大部分基于sql api编程的有些方式sql里面是实现不了的
代码可读性很差,可读性很差,spark sql 代码
提供给用户的所有应用程序用户直接用sql搞定的这个平台是开发出来的
开发的时候不能写sql,写sql只能够面向应用的人,找到工作,基于数据平台上面做的
别人写sql能够全部搞定的,下面的平台,写sql离线批处理全部搞定

普通的join 数据平台一部分

数组返回driver 不能进行数据的处理,map用起来就方便了

For each iteration of your for loop, yield generates a value which will be remembered. It’s like the for loop has a buffer you can’t see, and for each iteration of your for loop, another item is added to that buffer. When your for loop finishes running, it will return this collection of all the yielded values. The type of the collection that is returned is the same type that you were iterating over, so a Map yields a Map, a List yields a List, and so on.

Also, note that the initial collection is not changed; the for/yield construct creates a new collection according to the algorithm you specify.

val g5 = sc.parallelize(Array((“1”, “豆豆”), (“1”, “牛哥”), (“34”, “进取”))).collectAsMap()
val f11 = sc.parallelize(Array((“34”, “深圳”, 18), (“10”, “北京”, 2))).map(x => (x._1, x))
val broadcastVar = sc.broadcast(g5)
//for循环的spark-shell 处理方式
f11.mapPartitions(x => {
val g5Stus = broadcastVar.value
for ((key, value) <- x if (g5Stus.contains(key))) yield (key,g5Stus.get(key),value._1,value._2,value._3) }).collect()
yield用法主要是解决存储数据的过程 ,省略大量代码
使用了广播变量,没有shuffle出现

过程,广播变量,使用map

计数器广播变量特性不能变的

计数器只能使用增量的问题,遇到错误的时候才能够添加上去,sc.计数器,int long 过滤 long double 大部分都能够搞定

计数器的操作问题
scala> f11.mapPartitions(x => {
| val g5Stus = broadcastVar.value
| for ((key, value) <- x if (g5Stus.contains(key))) yield (key,g5Stus.get(key),value._1,value._2,value._3) }).collect()
res20: Array[(String, Option[String], String, String, Int)] = Array((34,Some(取),34,深圳,18))

scala> val accum = sc.longAccumulator(“My Accumulator”)
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 350, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))

scala> accum.value
res22: Long = 10

通过value的方式添加上UI

计数器的执行位置是 driver
工作中的累计操作
0 28 0 SUCCESS PROCESS_LOCAL driver localhost 2018/12/22 11:02:32 7 ms My Accumulator: 3
1 29 0 SUCCESS PROCESS_LOCAL driver localhost 2018/12/22 11:02:32 5 ms My Accumulator: 7

可以写一个基于字符串的相加功能这个用的不多

运行模式,都是用yarn spark on yarn

yarn 的方式内容处理的方式不健康的节点, 硬盘空间达到90%,就会出现这个现象
状态,无法提交

磁盘空间
df -lh
查看命令

脚本处理的方式及内容
提交的方式

分开写的内容方式赋权的内容处理方式,环境变量 spark 配置,etc list

核心的数据处理方式

1:47:23 spark启动日志分析
添加

//spark重要的优化点,这里都没有配置的内容,Jar包都上传上去
//jars conf 都打成包上传,分布式缓存
18/12/22 18:59:21 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

分布式缓存通过zip打到一个包里面

传上去,分布式缓存运行, 分布式缓存都在hdfs上面

Spark on YARN 执行过程

分布式集群的理论,生产上使用client模式