spark常用RDD介绍及Demo

本文深入探讨了Scala编程语言的核心特性、高级用法以及实际应用案例。从集合操作到并发编程,再到函数式编程思想的实践,全面展示了Scala在现代软件开发中的优势。
Transformation:

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.
val list=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))
val result = list.map(x => (x._1,x._2+1))
for(each <- result){
  print(each)
}
console:(a,2)(a,3)(b,4)(b,5)

val list=sc.parallelize(List(1,2,3,4,5))
val result = list.map(_+1)
for(each <- result){
  print(each)
}
console:23456


filter(func):  Return a new dataset formed by selecting those elements of the source on which func returns true.
val list=sc.parallelize(List(1,2,3,4,5,6))
val result = list.filter(_%2==0)
for(each <- result){
  print(each)
}
console:246 

flatMap(func): Similar to map,but each input item can be mapped to 0 or more output items(so func should return a Seq rather than a single item).
eg:
val list = sc.parallelize(List("abc","def"))
val result = list.flatMap(_.toList)
for(each <- result){
  print(each)
}
console:abcdef

union(otherDataset): Return a new dataset that contain the union of the elements in the source dataset and the argument.
val list=sc.parallelize(List(1,2,3))
val list1=sc.parallelize(List(4,5,6))
val result = list.union(list1)
for(each <- result){
  print(each)
}
console:123456

join(otherDataset,[numTasks]): When called on datasets of type(K,V) and (K,W),returns a dataset of (K,(V,W)) pairs with all pairs of elements for each key.Outer joins are supported leftOutJoin,rightOuterJoin,and fullOuterJoin.
val list1=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4),('c',4)))
val list2=sc.parallelize(List(('a',5),('a',6),('b',7),('b',8)))
for(each <- list1.join(list2)){
  print(each+" ")
}
console:(a,(1,5)) (a,(1,6)) (a,(2,5)) (a,(2,6)) (b,(3,7)) (b,(3,8)) (b,(4,7)) (b,(4,8)) 

intersection(otherDataset): Return a new RDD that contains the intersection of elements in the source dataset and the argument.
val list1=sc.parallelize(List(('a',1),('a',5),('b',3),('b',4),('c',4)))
val list2=sc.parallelize(List(('a',5),('a',6),('b',4),('b',8)))
for(each <- list1.intersection(list2)){
  print(each+" ")
}
console:(b,4) (a,5)

distinct([numTasks]): Return a new dataset that contains the distinct elements of the source dataset.
val list1=sc.parallelize(List(('a',1),('a',1),('b',3),('b',4),('c',4)))
for(each <- list1.distinct()){
  print(each + " ")
}
console:(a,1) (b,4) (b,3) (c,4) 

groupByKey([numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,Iterable<V>) pairs.
Note:If you are grouping in order to perform an aggregation(such as a sum or average) over each key,using reduceByKey or combineByKey will yield much better performance.
Note:By default,the level of parallelism in the output depends on the number of partitions of the parent RDD.You can pass an optional numTasks argument to set a different number of tasks.
val list1=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4),('c',4)))
for(each <- list1.groupByKey()){
  print(each + " ")
}
console:(a,CompactBuffer(1, 2)) (b,CompactBuffer(3, 4)) (c,CompactBuffer(4)) 

reduceByKey(func,[numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func,which must be of type(V,V)=>V. Like in groupByKey,the number of reduce tasks is configurable through an optional second argument.
val list1=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4),('c',4)))
for(each <- list1.reduceByKey(_+_)){
  print(each + " ")
}
console:(a,3) (b,7) (c,4) 

sortByKey([ascending],[numTasks]): When called on a dataset of (K,V) pairs where K implements Ordered,return a dataset of (K,V) pairs sorted by keys in ascending or descending order,as specified in the boolean ascending argument.
val list1=sc.parallelize(List(('a',1),('e',2),('b',3),('d',4),('c',4)))
for(each <- list1.sortByKey(false)){
  print(each + " ")
}
console:(e,2) (d,4) (c,4) (b,3) (a,1) 

Action:

reduce(func): Aggregate the elements of the dataset using a function func(which takes two arguments and return one).The function should be commutative and associative so that it can be computed correctly in parallel.
eg:
val rdd=sc.parallelize(List(1,2,3,4))
print(rdd.reduce((_+_)))
console:10

collect(): Return all the elements of the dataset as an array at the driver program.This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
val rdd=sc.parallelize(List(1,2,3,4))
val result = rdd.filter(_%2==0).collect()
for(each <- result){
  print(each + " ")
}
console:2 4

count(): Return the number of elements in the dataset.
val rdd = sc.parallelize(List(1,2,3,4))
print(rdd.count())
console:4

first(): Return the first element of the dataset(similar to take(1))
val rdd = sc.parallelize(List(1,2,3,4))
print(rdd.first())
console:1

take(n): Return an array with the first n elements of the dataset.Note that this is currently not executed in parallel.instead,the driver program computers all the elements.


评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值