Spark Partitioning on Disk with partitionBy

Spark数据分区策略与最佳实践
本文探讨了如何使用Spark的partitionBy进行磁盘数据分区,分析了repartition和coalesce对内存分区的影响,以及如何通过设定最大文件数和行数来优化分区。文章强调了正确分区对于查询性能的重要性,并提供了避免小文件问题的策略。


Spark writters 允许对数据 partitioned 到磁盘使用 partitionBy. 一些查询可以运行50到100倍的更快的在partitioned 数据湖, 所以分区对确定的查询至关重要.

创建或维护分区数据湖非常难.

这个博客发布讨论怎样使用partitionBy和解释partitioning在生产磁盘数据集的挑战.不同内存partitioning方法会被讨论,让partitionBy操作更有效.

你需要掌握在这里的概念来创建分区数据湖在大数据集上,尤其如果你处理高基数或数据倾斜严重的partition key.

确保阅读写出漂亮的 Spark 代码
的怎样创建生产级别分区湖的细节概览.

内存分区 VS 磁盘分区

coalesce()repartition() 更改DataFrame 的内存分区.

partitionBy() 是DataFrameWriter 的一个方法, 这个方法明确指出是否数据应该写到磁盘带文件夹的磁盘中. 默认 Spark 不写数据到磁盘嵌套文件夹中.

内存分区通常独立于磁盘分区. 为了正确的写数据到磁盘, 你几乎总是需要在内存首先重分区数据.

简单例子

假设我们有以下的CSV文件,里边有first_name, last_name, 和country列

first_name,last_name,country
Ernesto,Guevara,Argentina
Vladimir,Putin,Russia
Maria,Sharapova,Russia
Bruce,Lee,China
Jack,Ma,China

让我们partition 这些数据到磁盘使用country做为分区key. 让我们为每个分区创建1个文件.

val path = new java.io.File("./src/main/resources/ss_europe/").getCanonicalPath
val df = spark
  .read
  .option("header", "true")
  .option("charset", "UTF8")
  .csv(path)
val outputPath = new java.io.File("./tmp/partitioned_lake1/").getCanonicalPath
df
  .repartition(col("country"))
  .write
  .partitionBy("country")
  .parquet(outputPath)

下边是在磁盘上数据的样子

partitioned_lake1/
  country=Argentina/
    part-00044-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
  country=China/
    part-00059-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
  country=Russia/
    part-00002-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet

每个磁盘分区创建一个文件并不适用于在生产环境下大小的数据集. 猜想下中国分区包含100GB数据 - 我们不能写所有数据到单一个文件中.

repartition(5) 的partitionBy

运行repartition(5)使每行数据到独立的内存分区在运行partitionBy之前,查看会怎样影响,文件怎样写到磁盘.

val outputPath = new java.io.File("./tmp/partitioned_lake2/").getCanonicalPath
df
  .repartition(5)
  .write
  .partitionBy("country")
  .parquet(outputPath)

下边是文件在磁盘上的样子

partitioned_lake2/
  country=Argentina/
    part-00003-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
  country=China/
    part-00000-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
    part-00004-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
  country=Russia/
    part-00001-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
    part-00002-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet

The partitionBy writer will write out files on disk for each memory partition. The maximum number of files written out is the number of unique countries multiplied by the number of memory partitions.

In this example, we have 3 unique countries * 5 memory partitions, so up to 15 files could get written out (if each memory partition had one Argentinian, one Chinese, and one Russian person). We only have 5 rows of data, so only 5 files are written in this example.

partitionBy with repartition(1)

If we repartition the data to one memory partition before partitioning on disk with partitionBy, then we’ll write out a maximum of three files. numMemoryPartitions * numUniqueCountries = maxNumFiles. 1 * 3 = 3.

Let’s take a look at the code.

val outputPath = new java.io.File("./tmp/partitioned_lake2/").getCanonicalPath
df
  .repartition(1)
  .write
  .partitionBy("country")
  .parquet(outputPath)

Here’s what the files look like on disk:

partitioned_lake3/
  country=Argentina/
    part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
  country=China/
    part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
  country=Russia/
    part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet

Partitioning datasets with a max number of files per partition

Let’s use a dataset with 80 people from China, 15 people from France, and 5 people from Cuba. Here’s a link to the data.

Here’s what the data looks like:

 person_name,person_country
a,China
b,China
c,China
...77 more China rows
a,France
b,France
c,France
...12 more France rows
a,Cuba
b,Cuba
c,Cuba
...2 more Cuba rows

Let’s create 8 memory partitions and scatter the data randomly across the memory partitions (we’ll write out the data to disk, so we can inspect the contents of a memory partition).

 val outputPath = new java.io.File("./tmp/repartition_for_lake4/").getCanonicalPath
df
  .repartition(8, col("person_country"), rand)
  .write
  .csv(outputPath)

Let’s look at one of the CSV files that is outputted:

p,China
f1,China
n1,China
a2,China
b2,China
d2,China
e2,China
f,France
c,Cuba

This technique helps us set a maximum number of files per partition when creating a partitioned lake. Let’s write out the data to disk and observe the output.

val outputPath = new java.io.File("./tmp/partitioned_lake4/").getCanonicalPath
df
  .repartition(8, col("person_country"), rand)
  .write
  .partitionBy("person_country")
  .csv(outputPath)   

Here’s what the files look like on disk:

partitioned_lake4/
  person_country=China/
    part-00000-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
    part-00001-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
    ... 6 more files
  person_country=Cuba/
    part-00002-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
    part-00003-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
    ... 2 more files
  person_country=France/
    part-00000-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
    part-00001-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
    ... 5 more files

Each disk partition will have up to 8 files. The data is split randomly in the 8 memory partitions. There won’t be any output files for a given disk partition if the memory partition doesn’t have any data for the country.

This is better, but still not ideal. We have 4 files for Cuba and seven files for France, so too many small files are being created.

Let’s review the contents of our memory partition from earlier:

p,China
f1,China
n1,China
a2,China
b2,China
d2,China
e2,China
f,France
c,Cuba

partitionBy will split up this particular memory partition into three files: one China file with 7 rows of data, one France file with one row of data, and one Cuba file with one row of data.

Partitioning dataset with max rows per file

Let’s write some code that’ll create partitions with ten rows of data per file. We’d like our data to be stored in 8 files for China, one file for Cuba, and two files for France.

We can use the maxRecordsPerFile option to output files with 10 rows.

val outputPath = new java.io.File("./tmp/partitioned_lake5/").getCanonicalPath
df
  .repartition(col("person_country"))
  .write
  .option("maxRecordsPerFile", 10)
  .partitionBy("person_country")
  .csv(outputPath)

This technique is particularity important for partition keys that are highly skewed. The number of inhabitants by country is a good example of a partition key with high skew. For example Jamaica has 3 million people and China has 1.4 billion people – we’ll want ~467 times more files in the China partition than the Jamaica partition.

Partitioning dataset with max rows per file pre Spark 2.2

The maxRecordsPerFile option was added in Spark 2.2, so you’ll need to write your own custom solution if you’re using an earlier version of Spark.

val countDF = df.groupBy("person_country").count()
val desiredRowsPerPartition = 10
val joinedDF = df
  .join(countDF, Seq("person_country"))
  .withColumn(
    "my_secret_partition_key",
    (rand(10) * col("count") / desiredRowsPerPartition).cast(IntegerType)
  )
val outputPath = new java.io.File("./tmp/partitioned_lake6/").getCanonicalPath
joinedDF
  .repartition(col("person_country"), col("my_secret_partition_key"))
  .drop("count", "my_secret_partition_key")
  .write
  .partitionBy("person_country")
  .csv(outputPath)

We calculate the total number of records per partition key and then create a my_secret_partition_key column rather than relying on a fixed number of partitions.

You should choose the desiredRowsPerPartition based on what will give you ~1 GB files. If you have a 500 GB dataset with 750 million rows, set desiredRowsPerPartition to 1,500,000.

Small file problem

Partitioned data lakes can quickly develop a small file problem when they’re updated incrementally. It’s hard to compact partitioned data lakes. As we’ve seen, it’s even hard to make a partitioned data lake!

Use the tactics outlined in this blog post to build your partitioned data lakes and start them off without the small file problem!

Conclusion

Partitioned data lakes can be much faster to query (when filtering on the partition keys) because they allow for a massive data skipping.

Creating and maintaining partitioned data lakes is challenging, but the performance gains make them a worthwhile effort.

Oracle 数据库中的 `PARTITION BY` 子句在两个主要场景中被广泛使用:一是表分区(table partitioning),二是窗口函数(window functions)。 ### 表分区(Table Partitioning) 表分区是一种将大表物理划分为多个较小、更易管理的部分的技术,从而提升查询性能和维护效率。`PARTITION BY` 用于定义如何对表进行分区。 常见的分区类型包括: - **范围分区(Range Partitioning)** 根据列值的范围划分数据,例如日期或数值范围。 ```sql CREATE TABLE sales ( sale_id NUMBER, sale_date DATE ) PARTITION BY RANGE (sale_date) ( PARTITION p_2023_q1 VALUES LESS THAN (TO_DATE('2023-04-01', 'YYYY-MM-DD')), PARTITION p_2023_q2 VALUES LESS THAN (TO_DATE('2023-07-01', 'YYYY-MM-DD')) ); ``` - **列表分区(List Partitioning)** 根据列值的明确列表划分数据,适用于离散值集合。 - **哈希分区(Hash Partitioning)** 使用哈希算法将数据均匀分布到多个分区中,常用于负载均衡。 - **复合分区(Composite Partitioning)** 结合范围和哈希分区或范围和列表分区,适用于大规模数据管理。 在查询优化方面,`PARTITION PRUNING` 是一项关键性能特性,优化器会根据 SQL 语句中的 `WHERE` 子句条件,排除不必要的分区访问,仅对相关分区执行操作,从而显著提升查询效率[^1]。 ### 窗口函数(Window Functions) 在分析查询中,`PARTITION BY` 用于窗口函数中,将数据划分为多个逻辑组,窗口函数会在每个组内独立计算。它与 `ORDER BY` 和 `FRAME` 子句结合使用,实现复杂的分析逻辑。 例如,计算每个部门员工的工资排名: ```sql SELECT employee_id, department_id, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees; ``` 在此查询中,`PARTITION BY department_id` 表示按部门划分数据组,`RANK()` 函数在每个部门内部根据工资排序并计算排名。 窗口函数支持多种分析功能,包括累计聚合、移动平均、首次/末次值获取等,是构建复杂报表和数据洞察的关键工具。 ### 注意事项 - Oracle 中的 SQL 关键字并非完全保留,因此不建议将 `PARTITION` 或其他关键字用作对象名,以免引起语法歧义或不可预测的行为[^2]。 - 在使用 `PARTITION BY` 时,应确保分区键的选择合理,以实现数据分布均衡和查询性能优化。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值