GPU Accelerated Polars – Intuitively and Exhaustively Explained

原创于 2025-12-16 00:12:50 发布 · 840 阅读

27 ·

CC 4.0 BY-SA版权

License CC BY-NC-SA 4.0 / 自豪地采用谷歌翻译

文章标签：

#榛樿鍒嗙被

榛樿鍒嗙被专栏收录该内容

1056 篇文章

订阅专栏

原文：towardsdatascience.com/gpu-accelerated-polars-intuitively-and-exhaustively-explained-e823a82f92a8

我最近参加了由 Cuda 和 Polars 团队举办的一个秘密演示。他们让我通过金属探测器，给我头上套了一个袋子，然后开车把我带到法国乡村森林中的一间小屋。他们拿走了我的手机、钱包和护照，以确保我不会在最终展示他们一直在研究的东西之前泄露任何信息。

或者，感觉就是这样。实际上，那是一个 Zoom 会议，他们礼貌地要求我在指定的时间之前不要说话，但作为一个技术作家，这种神秘感让我感觉有点像詹姆斯·邦德。

在这篇文章中，我们将讨论那次会议的内容：Polars 中一个新的执行引擎，它使 GPU 加速计算成为可能，允许对 100GB+ 的数据进行交互式操作。我们将讨论在 Polars 中数据框是什么，GPU 加速如何与 Polars 数据框协同工作，以及使用新的 CUDA 驱动的执行引擎可以期待的性能提升有多大。

Who is this useful for? Anyone who works with data and wants to work faster.

How advanced is this post? This post contains simple but cutting-edge data engineering concepts. It’s relevant to readers of all levels.

Pre-requisites: None

Note: At the time of writing I am not affiliated with or endorsed by Polars or Nvidia in any way.

Polars In a Nutshell

在 Polars 中，你可以创建和操作数据框（它们就像超级强大的电子表格）。在这里，我创建了一个简单的 dataframe，包含一些人的年龄和他们居住的城市。

""" Creating a simple dataframe in polars
"""
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Jill", "William"],
    "age": [25, 30, 35, 22, 40],
    "city": ["New York", "Los Angeles", "Chicago", "New York", "Chicago"]
})

print(df)

shape: (5, 3)
┌─────────┬─────┬─────────────┐
│ name    ┆ age ┆ city        │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ i64 ┆ str         │
╞═════════╪═════╪═════════════╡
│ Alice   ┆ 25  ┆ New York    │
│ Bob     ┆ 30  ┆ Los Angeles │
│ Charlie ┆ 35  ┆ Chicago     │
│ Jill    ┆ 22  ┆ New York    │
│ William ┆ 40  ┆ Chicago     │
└─────────┴─────┴─────────────┘

使用这个 dataframe，你可以进行诸如按年龄筛选等操作。

""" Filtering the previously defined dataframe to only show rows that have
an age of over 28
"""
df_filtered = df.filter(pl.col("age") > 28)
print(df_filtered)

shape: (3, 3)
┌─────────┬─────┬─────────────┐
│ name    ┆ age ┆ city        │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ i64 ┆ str         │
╞═════════╪═════╪═════════════╡
│ Bob     ┆ 30  ┆ Los Angeles │
│ Charlie ┆ 35  ┆ Chicago     │
│ William ┆ 40  ┆ Chicago     │
└─────────┴─────┴─────────────┘

你可以进行数学运算，

""" Creating a new column called "age_doubled" which is double the age
column.
"""
df = df.with_columns([
    (pl.col("age") * 2).alias("age_doubled")
])

print(df)

shape: (5, 4)
┌─────────┬─────┬─────────────┬─────────────┐
│ name    ┆ age ┆ city        ┆ age_doubled │
│ ---     ┆ --- ┆ ---         ┆ ---         │
│ str     ┆ i64 ┆ str         ┆ i64         │
╞═════════╪═════╪═════════════╪═════════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 50          │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 60          │
│ Charlie ┆ 35  ┆ Chicago     ┆ 70          │
│ Jill    ┆ 22  ┆ New York    ┆ 44          │
│ William ┆ 40  ┆ Chicago     ┆ 80          │
└─────────┴─────┴─────────────┴─────────────┘

你可以执行聚合函数，比如计算一个城市中平均年龄。

""" Calculating the average age by city
"""
df_aggregated = df.group_by("city").agg(pl.col("age").mean())
print(df_aggregated)

shape: (3, 2)
┌─────────────┬──────┐
│ city        ┆ age  │
│ ---         ┆ ---  │
│ str         ┆ f64  │
╞═════════════╪══════╡
│ Chicago     ┆ 37.5 │
│ New York    ┆ 23.5 │
│ Los Angeles ┆ 30.0 │
└─────────────┴──────┘

大多数阅读这篇文章的人可能都熟悉 Pandas，这是 Python 中更受欢迎的 dataframe 库。我认为，在我们探讨 GPU 加速的 Polars 之前，探索一个区分 Polars 和 Pandas 的重要特性可能是有用的。

Polars LazyFrames

Polars 有两种基本的执行模式，“eager” 和 “lazy”。一个 eager 的 dataframe 会在被调用时立即进行计算，正好按照它们被调用的方式。如果你给一个列中的每个值加 2，然后再给那个列中的每个值加 3，每个操作都会像使用 eager dataframe 时预期的那样执行。每个值都会加上 2，然后每个这些值都会在你说这些操作应该发生的确切时刻加上 3。

import polars as pl

# Create a DataFrame with a list of numbers
df = pl.DataFrame({
    "numbers": [1, 2, 3, 4, 5]
})

# Add 2 to every number and overwrite the original 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 2
)

# Add 3 to the updated 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 3
)

print(df)

shape: (5, 1)
┌─────────┐
│ numbers │
│ ---     │
│ i64     │
╞═════════╡
│ 6       │
│ 7       │
│ 8       │
│ 9       │
│ 10      │
└─────────┘

如果我们用 .lazy() 函数初始化我们的 dataframe，我们会得到一个非常不同的输出。

import polars as pl

# Create a lazy DataFrame with a list of numbers
df = pl.DataFrame({
    "numbers": [1, 2, 3, 4, 5]
}).lazy() # <-------------------------- Lazy Initialization

# Add 2 to every number and overwrite the original 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 2
)

# Add 3 to the updated 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 3
)

print(df)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [[(col("numbers")) + (3)]]
   WITH_COLUMNS:
   [[(col("numbers")) + (2)]]
    DF ["numbers"]; PROJECT */1 COLUMNS; SELECTION: "None"

我们得到的不是一个 dataframe，而是一个类似 SQL 的表达式，它概述了为了得到我们想要的 dataframe 需要执行的操作。我们可以调用 .collect() 来实际运行这些计算并获取我们的 dataframe。

print(df.collect())

shape: (5, 1)
┌─────────┐
│ numbers │
│ ---     │
│ i64     │
╞═════════╡
│ 6       │
│ 7       │
│ 8       │
│ 9       │
│ 10      │
└─────────┘

初看这可能似乎没有太大用处：我们是在代码的哪个部分进行所有计算又有什么关系呢？实际上没有人关心。这个系统的优势不在于计算发生的时间，而在于发生的是什么样的计算。

在执行懒态 dataframe 之前，Polars 会查看累积的操作，并找出任何可能加快执行速度的捷径。这个过程通常被称为“查询优化”。例如，如果我们创建一个懒态 dataframe 然后在数据上运行一些操作，我们会得到一些 SQL 表达式

# Create a DataFrame with a list of numbers
df = pl.DataFrame({
    "col_0": [1, 2, 3, 4, 5],
    "col_1": [8, 7, 6, 5, 4],
    "col_2": [-1, -2, -3, -4, -5]
}).lazy()

#doing some random operations
df = df.filter(pl.col("col_0") > 0)
df = df.with_columns((pl.col("col_1") * 2).alias("col_1_double"))
df = df.group_by("col_2").agg(pl.sum("col_1_double"))

print(df)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

AGGREGATE
 [col("col_1_double").sum()] BY [col("col_2")] FROM
   WITH_COLUMNS:
   [[(col("col_1")) * (2)].alias("col_1_double")]
    FILTER [(col("col_0")) > (0)] FROM

    DF ["col_0", "col_1", "col_2"]; PROJECT */3 COLUMNS; SELECTION: "None"

但如果我们对那个 dataframe 运行 .explain(optimized=True)，我们会得到一个不同的表达式，这是 Polars 认为执行相同操作更优的方式。

print(df.explain(optimized=True))

AGGREGATE
 [col("col_1_double").sum()] BY [col("col_2")] FROM
   WITH_COLUMNS:
   [[(col("col_1")) * (2)].alias("col_1_double")]
    DF ["col_0", "col_1", "col_2"]; PROJECT */3 COLUMNS; SELECTION: "[(col("col_0")) > (0)]"

这实际上是当你对懒态 dataframe 调用 .collect() 时运行的优化表达式。

这不仅仅只是花哨和有趣，它还可以带来一些相当严重的性能提升。在这里，我正在对两个相同的数据 frame 运行相同的操作，一个急切，一个懒态。我在 10 次运行中平均执行时间，并计算平均速度差异。

"""Performing the same operations on the same data between two dataframes,
one with eager execution and one with lazy execution, and calculating the
difference in execution speed.
"""

import polars as pl
import numpy as np
import time

# Specifying constants
num_rows = 20_000_000  # 20 million rows
num_cols = 10          # 10 columns
n = 10  # Number of times to repeat the test

# Generate random data
np.random.seed(0)  # Set seed for reproducibility
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)}

# Define a function that works for both lazy and eager DataFrames
def apply_transformations(df):
    df = df.filter(pl.col("col_0") > 0)  # Filter rows where col_0 is greater than 0
    df = df.with_columns((pl.col("col_1") * 2).alias("col_1_double"))  # Double col_1
    df = df.group_by("col_2").agg(pl.sum("col_1_double"))  # Group by col_2 and aggregate
    return df

# Variables to store total durations for eager and lazy execution
total_eager_duration = 0
total_lazy_duration = 0

# Perform the test n times
for i in range(n):
    print(f"Run {i+1}/{n}")

    # Create fresh DataFrames for each run (polars operations can be in-place, so ensure clean DF)
    df1 = pl.DataFrame(data)
    df2 = pl.DataFrame(data).lazy()

    # Measure eager execution time
    start_time_eager = time.time()
    eager_result = apply_transformations(df1)  # Eager execution
    eager_duration = time.time() - start_time_eager
    total_eager_duration += eager_duration
    print(f"Eager execution time: {eager_duration:.2f} seconds")

    # Measure lazy execution time
    start_time_lazy = time.time()
    lazy_result = apply_transformations(df2).collect()  # Lazy execution
    lazy_duration = time.time() - start_time_lazy
    total_lazy_duration += lazy_duration
    print(f"Lazy execution time: {lazy_duration:.2f} seconds")

# Calculating the average execution time
average_eager_duration = total_eager_duration / n
average_lazy_duration = total_lazy_duration / n

#calculating how much faster lazy execution was
faster = (average_eager_duration-average_lazy_duration)/average_eager_duration*100

print(f"nAverage eager execution time over {n} runs: {average_eager_duration:.2f} seconds")
print(f"Average lazy execution time over {n} runs: {average_lazy_duration:.2f} seconds")
print(f"Lazy took {faster:.2f}% less time")

Run 1/10
Eager execution time: 3.07 seconds
Lazy execution time: 2.70 seconds
Run 2/10
Eager execution time: 4.17 seconds
Lazy execution time: 2.69 seconds
Run 3/10
Eager execution time: 2.97 seconds
Lazy execution time: 2.76 seconds
Run 4/10
Eager execution time: 4.21 seconds
Lazy execution time: 2.74 seconds
Run 5/10
Eager execution time: 2.97 seconds
Lazy execution time: 2.77 seconds
Run 6/10
Eager execution time: 4.12 seconds
Lazy execution time: 2.80 seconds
Run 7/10
Eager execution time: 3.00 seconds
Lazy execution time: 2.72 seconds
Run 8/10
Eager execution time: 4.53 seconds
Lazy execution time: 2.76 seconds
Run 9/10
Eager execution time: 3.14 seconds
Lazy execution time: 3.08 seconds
Run 10/10
Eager execution time: 4.26 seconds
Lazy execution time: 2.77 seconds

Average eager execution time over 10 runs: 3.64 seconds
Average lazy execution time over 10 runs: 2.78 seconds
Lazy took 23.75% less time

23.75% 的性能提升不容小觑，并且是由懒执行（在 Pandas 中不存在）实现的。在幕后，当你使用 Polars 懒态 dataframe 时，你实际上是在定义一个高级计算图，Polars 会对其进行各种花哨的魔法操作。在优化查询后，它执行查询，这意味着你得到的结果与使用急切 dataframe 得到的结果相同，但通常更快。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5686465ddadc0eac22cf9fc9bf675e0b.png

在 Polars 中调用查询后启动的操作的高级分解。值得注意的是，急切执行本身有许多优化改进，比如原生多核支持，这在懒执行中存在并得到了改进。

个人而言，我是个 Pandas 粉丝，并没有真正看到有充分的理由去切换。我想“它可能更好，但可能还不够好到让我放弃我最基本工具的程度”。如果你有同样的感觉，而且 23.75% 的提升幅度没有让你皱眉，那么我确实有一些结果要给你看。

介绍 Polars 的 GPU 执行

这个功能是最新发布的，所以我不能 100%确定你将如何在你自己的环境中利用 GPU 加速。在撰写本文时，我得到了一个 wheel 文件（这就像一个可以安装的本地库）。我有一种印象，在本文发布后，你可以在你的机器上使用以下命令来安装带有 GPU 加速的 polar。

pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com

我还预计，如果那不起作用，你可以在polars pypi页面上找到一些文档。无论如何，一旦你启动并运行，你就可以开始在你的 GPU 上使用 polar 的强大功能。你唯一需要做的是，在collect一个懒的 dataframe 时指定 GPU 作为引擎。

在比较之前测试中的急切执行和懒执行的基础上，让我们再比较一下懒执行与 GPU 引擎。我们可以通过以下行results = df.collect(engine=gpu_engine)来实现，其中gpu_engine是基于以下内容指定的：

gpu_engine = pl.GPUEngine(
        device=0, # This is the default
        raise_on_fail=True, # Fail loudly if we can't run on the GPU.
    )

GPU 执行引擎不支持所有 polar 功能，默认情况下会回退到 CPU。通过设置raise_on_fail=True，我们指定如果 GPU 执行不受支持，则代码应抛出异常。好的，这是实际的代码。

"""Performing the same operations on the same data between three dataframes,
one with eager execution, one with lazy execution, and one with lazy execution
and GPU acceleration. Calculating the difference in execution speed between the
three.
"""

import polars as pl
import numpy as np
import time

# Creating a large random DataFrame
num_rows = 20_000_000  # 20 million rows
num_cols = 10          # 10 columns
n = 10  # Number of times to repeat the test

# Generate random data
np.random.seed(0)  # Set seed for reproducibility
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)}

# Defining a function that works for both lazy and eager DataFrames
def apply_transformations(df):
    df = df.filter(pl.col("col_0") > 0)  # Filter rows where col_0 is greater than 0
    df = df.with_columns((pl.col("col_1") * 2).alias("col_1_double"))  # Double col_1
    df = df.group_by("col_2").agg(pl.sum("col_1_double"))  # Group by col_2 and aggregate
    return df

# Variables to store total durations for eager and lazy execution
total_eager_duration = 0
total_lazy_duration = 0
total_lazy_GPU_duration = 0

# Performing the test n times
for i in range(n):
    print(f"Run {i+1}/{n}")

    # Create fresh DataFrames for each run (polars operations can be in-place, so ensure clean DF)
    df1 = pl.DataFrame(data)
    df2 = pl.DataFrame(data).lazy()
    df2 = pl.DataFrame(data).lazy()

    # Measure eager execution time
    start_time_eager = time.time()
    eager_result = apply_transformations(df1)  # Eager execution
    eager_duration = time.time() - start_time_eager
    total_eager_duration += eager_duration
    print(f"Eager execution time: {eager_duration:.2f} seconds")

    # Measure lazy execution time
    start_time_lazy = time.time()
    lazy_result = apply_transformations(df2).collect()  # Lazy execution
    lazy_duration = time.time() - start_time_lazy
    total_lazy_duration += lazy_duration
    print(f"Lazy execution time: {lazy_duration:.2f} seconds")

    # Defining GPU Engine
    gpu_engine = pl.GPUEngine(
        device=0, # This is the default
        raise_on_fail=True, # Fail loudly if we can't run on the GPU.
    )

    # Measure lazy execution time
    start_time_lazy_GPU = time.time()
    lazy_result = apply_transformations(df2).collect(engine=gpu_engine)  # Lazy execution with GPU
    lazy_GPU_duration = time.time() - start_time_lazy_GPU
    total_lazy_GPU_duration += lazy_GPU_duration
    print(f"Lazy execution time: {lazy_GPU_duration:.2f} seconds")

# Calculating the average execution time
average_eager_duration = total_eager_duration / n
average_lazy_duration = total_lazy_duration / n
average_lazy_GPU_duration = total_lazy_GPU_duration / n

#calculating how much faster lazy execution was
faster_1 = (average_eager_duration-average_lazy_duration)/average_eager_duration*100
faster_2 = (average_lazy_duration-average_lazy_GPU_duration)/average_lazy_duration*100
faster_3 = (average_eager_duration-average_lazy_GPU_duration)/average_eager_duration*100

print(f"nAverage eager execution time over {n} runs: {average_eager_duration:.2f} seconds")
print(f"Average lazy execution time over {n} runs: {average_lazy_duration:.2f} seconds")
print(f"Average lazy execution time over {n} runs: {average_lazy_GPU_duration:.2f} seconds")
print(f"Lazy was {faster_1:.2f}% faster than eager")
print(f"GPU was {faster_2:.2f}% faster than CPU Lazy and {faster_3:.2f}% faster than CPU eager")

Run 1/10
Eager execution time: 0.74 seconds
Lazy execution time: 0.66 seconds
Lazy execution time: 0.17 seconds
Run 2/10
Eager execution time: 0.72 seconds
Lazy execution time: 0.65 seconds
Lazy execution time: 0.17 seconds
Run 3/10
Eager execution time: 0.82 seconds
Lazy execution time: 0.76 seconds
Lazy execution time: 0.17 seconds
Run 4/10
Eager execution time: 0.81 seconds
Lazy execution time: 0.69 seconds
Lazy execution time: 0.18 seconds
Run 5/10
Eager execution time: 0.79 seconds
Lazy execution time: 0.66 seconds
Lazy execution time: 0.18 seconds
Run 6/10
Eager execution time: 0.75 seconds
Lazy execution time: 0.63 seconds
Lazy execution time: 0.18 seconds
Run 7/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.72 seconds
Lazy execution time: 0.18 seconds
Run 8/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.72 seconds
Lazy execution time: 0.17 seconds
Run 9/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.72 seconds
Lazy execution time: 0.17 seconds
Run 10/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.70 seconds
Lazy execution time: 0.17 seconds

Average eager execution time over 10 runs: 0.77 seconds
Average lazy execution time over 10 runs: 0.69 seconds
Average lazy execution time over 10 runs: 0.17 seconds
Lazy was 10.30% faster than eager
GPU was 74.78% faster than CPU Lazy and 77.38% faster than CPU eager

(注意：这是一个与之前测试类似的测试，但在不同的、更大的机器上。因此，执行时间与之前的测试不同)

是的。74.78%更快。而且这甚至不是一个特别大的数据集。人们可能会期待在更大的数据集上获得更大的性能提升。

不幸的是，我无法分享 Nvidia 和 Polar 团队提供的演示文稿，但我可以描述我所理解在底层发生的事情。基本上，Polar 有几个执行引擎，用于各种任务，他们基本上只是添加了一个支持 GPU 的引擎。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/1b30fc04306f74ea72bc818f29c40bcd.png

在输入了一大批查询之后，查询优化器优化查询并将操作发送到多种执行引擎之一。现在，有一个新的，由 GPU 驱动的执行引擎。

根据我的理解，这些引擎是即时调用的，既基于可用的硬件，也基于正在执行的查询。一些查询高度可并行化，在 GPU 上表现极好，而那些不太可并行化的操作则可以在 CPU 上的内存引擎中完成。从理论上讲，这使得 CUDA 加速的 polar 几乎总是更快，我发现这一点在数据集较大的情况下尤其明显。

抽象内存管理

英伟达团队提出的一个关键观点是，新的查询优化器足够聪明，能够处理 CPU 和 GPU 之间的内存管理。对于那些没有太多 GPU 编程经验的人来说，CPU 和 GPU 有不同的内存，CPU 使用 RAM 来存储信息，而 GPU 使用 vRAM，这是存储在 GPU 本身上的。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/724cdeb618757a744eee39257bac30db.png

CPU 和 GPU 有点像独立的计算机，各自拥有自己的资源，彼此之间进行通信。CPU 进行计算，其 RAM 存储数据，而 GPU 也进行计算，其显卡上的 vRAM 也存储数据。这些独立且某种程度上是自主的系统需要在复杂任务上协同工作。此图来自我关于CUDA 编程的 AI的文章。

可以想象一个场景，即创建一个 polars 数据框并在 GPU 执行引擎上执行。然后，可以想象一个需要该数据框与仍在 CPU 上的另一个数据框交互的操作。polars 查询优化器能够通过在 CPU 和 GPU 之间按需传递数据来理解和处理这种差异。

对我来说，这是一项巨大的资产，但也存在一些棘手的不便。当你在大型的重负载（例如构建 AI 模型）上使用 GPU 时，通常需要严格管理内存消耗。我想象一下，具有大型模型的工作流程，这些模型占据了 GPU 的大量空间，可能会遇到 polars 在您的数据上随意操作的问题。我注意到 GPU 执行引擎似乎被指向单个 GPU，所以可能拥有多个 GPU 的机器可以更严格地隔离内存。

尽管如此，我认为这对大多数人来说几乎没有实际意义。一般来说，当进行数据科学工作时，你从原始数据开始，进行大量数据处理，然后保存旨在为模型准备就绪的工件。我很难想到一个必须在同一台机器上同时进行大数据处理和建模工作的用例。如果你认为这是一个大问题，那么英伟达和 Polars 团队目前正在调查显式内存控制，这可能在未来的版本中出现。对于纯粹的数据处理工作负载，我想象自动处理 RAM 和 vRAM 将大大节省许多数据科学家和工程师的时间。

结论

非常吸引人的内容，而且是最新发布的。通常我需要花费数周时间来审查已经确立数月甚至数年的主题，所以“预发布”对我来说有点新鲜。

坦白说，我不知道这对我的一般工作流程会有多大影响。我可能仍然会在谷歌 Colab 的牛仔式编码中大量使用 pandas，因为我对它很舒服，但面对大数据框和计算密集型查询时，我想我会更频繁地转向 Polars，并且我预计最终会将它整合到我的工作流程的核心部分。这其中的一个重要原因是 GPU 加速的 Polars 在几乎任何其他数据框工具上都能带来天文数字的速度提升。