spark学习13之RDD的partitions数目获取

最新推荐文章于 2024-01-29 21:12:36 发布

XiaoGuang-Xu

最新推荐文章于 2024-01-29 21:12:36 发布

阅读量939

点赞数

分类专栏： Spark-RDD

Spark-RDD 专栏收录该内容

40 篇文章

订阅专栏

spark学习13之RDD的partitions数目获取

原文网址： http://blog.youkuaiyun.com/xubo245/article/details/51475506

spark1.5.2

1解释
获取RDD的partitions数目和index信息
疑问：为什么纯文本的partitions数目与HDFS的block数目一样，但是.gz的压缩文件的partitions数目却为1？

2.代码：

sc.textFile("/xubo/GRCH38Sub/GRCH38L12566578.fna").partitions.length
 
 1
 
 1

sc.textFile("/xubo/GRCH38Sub/GRCH38L12566578.fna.bwt").partitions.foreach(each=>println(each.index))
 
 1
 
 1

spark1.6中可以直接获取：

 @Since("1.6.0")
  final def getNumPartitions: Int = partitions.length
 
 1
2
 
 1
2

3.结果：
（1）第一个文件

partitions数：

scala> sc.textFile("/xubo/GRCH38Sub/GRCH38L12566578.fna").partitions.length
res2: Int = 7

 
 1
2
3
 
 1
2
3

详细信息：

scala> sc.textFile("/xubo/GRCH38Sub/GRCH38L12566578.fna.bwt").partitions.foreach(each=>println(each.index))
0
1
2
3
4
5
6
 
 1
2
3
4
5
6
7
8
 
 1
2
3
4
5
6
7
8

（2）第二个文件：

scala> sc.textFile(file).partitions.foreach(each=>println(each.index))
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

（3）第三个文件:
但是gz文件: 大小差不多，但是partition却为1

scala> sc.textFile("/xubo/data/GRCH38/GCA_000001405.15_GRCh38_full_analysis_set.fna.bwa_index.tar.gz").partitions.length
res5: Int = 1

 
 1
2
3
 
 1
2
3

index：

scala> sc.textFile("/xubo/data/GRCH38/GCA_000001405.15_GRCh38_full_analysis_set.fna.bwa_index.tar.gz").partitions.foreach(each=>println(each.index))
0


 
 1
2
3
4
 
 1
2
3
4

这里写图片描述

（4）大文件（3G），同样的：

scala> val file="/xubo/data/GRCH38/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.bowtie_index.tar.gz"
file: String = /xubo/data/GRCH38/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.bowtie_index.tar.gz

scala> sc.textFile(file).partitions.foreach(each=>println(each.index))
0

 
 1
2
3
4
5
6
 
 1
2
3
4
5
6

这里写图片描述

4.本来想在RDD加一个获取partitions数量的函数或者属性，但是已看代码，1.6中有人加了：


  /**
   * Returns the number of partitions of this RDD.
   */
  @Since("1.6.0")
  final def getNumPartitions: Int = partitions.length

 
 1
2
3
4
5
6
7
 
 1
2
3
4
5
6
7