高效Pandas迭代方法比较：从基础循环到向量化操作-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01154/article/details/148601845

高效Pandas迭代方法比较：从基础循环到向量化操作

Machine-Learning-with-Python Practice and tutorial-style notebooks covering wide variety of machine learning techniques 项目地址: https://gitcode.com/gh_mirrors/mac/Machine-Learning-with-Python

引言：Pandas迭代的挑战

在数据分析工作中，我们经常需要对Pandas DataFrame进行逐行操作。无论是简单的逻辑判断还是复杂的数学变换，选择正确的迭代方法对性能影响巨大。本文将深入探讨几种常见的Pandas迭代方法，并通过实际测试比较它们的性能差异。

测试数据准备

我们首先创建一个包含10万行4列的随机整数DataFrame作为测试数据：

import numpy as np, pandas as pd
from time import time

np.random.seed(101)
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)), 
                 columns=list('ABCD'),dtype=np.int16)

这个DataFrame占用约781KB内存，每列都是int16类型，确保测试结果具有代表性。

方法一：最基础的for循环

最直观的方法是使用传统的for循环配合iloc访问数据：

count = 0
t1 = time()
for i in range(len(df)):
    if df.iloc[i]['A'] + df.iloc[i]['B'] > df.iloc[i]['C'] + df.iloc[i]['D']:
        count += 1
t2 = time()
print(f"耗时: {round(t2-t1,2)}秒")

测试结果：32.22秒完成10万行数据遍历

这种方法性能最差，因为每次循环都需要通过索引定位数据，产生了大量开销。

方法二：使用iterrows()

Pandas提供了iterrows()方法，可以更优雅地遍历DataFrame：

count = 0
t1 = time()
for idx, row in df.iterrows():
    if row['A'] + row['B'] > (row['C'] + row['D']):
        count += 1
t2 = time()
print(f"耗时: {round(t2-t1,2)}秒")

测试结果：6.91秒，比基础for循环快约4.6倍

虽然iterrows()比基础循环快，但它仍然不是最优选择，因为它每次迭代都会返回一个Series对象，存在额外开销。

方法三：使用df.values

更高效的方法是直接访问DataFrame的底层numpy数组：

count = 0
t1 = time()
for row in df.values:
    if row[0] + row[1] > (row[2] + row[3]):
        count += 1
t2 = time()
print(f"耗时: {round(t2-t1,3)}秒")

测试结果：仅需0.112秒，比iterrows()快61.7倍！

这种方法之所以快，是因为它直接操作numpy数组，避免了Pandas的索引开销。

复杂运算场景测试

我们进一步测试更复杂的运算场景，比较不同方法的性能差异：

带系数的比较：
```
if row['A'] + row['B'] > 1.25*(row['C'] + row['D'])
```
- iterrows(): 8.05秒
- df.values: 0.546秒 (快14.7倍)
数学函数运算：
```
if np.log(1+row['A']+row['B']) > np.sqrt(0.5*(row['C']+row['D']))
```
- iterrows(): 8.76秒
- df.values: 0.962秒 (快9.1倍)

终极方案：向量化操作

对于可以向量化的操作，直接使用Pandas/Numpy的向量化功能是最佳选择：

t1 = time()
df['result'] = np.log(1+df['A']+df['B']) > np.sqrt(0.5*(df['C']+df['D']))
t2 = time()
print(f"耗时: {round(t2-t1,3)}秒")

测试结果：仅需0.01秒，比df.values方法还快约100倍！

向量化操作利用了底层优化的C/Fortran代码，避免了Python解释器的循环开销。

字符串处理示例

当处理字符串标识符生成等操作时，同样适用这些原则：

def identifier():
    letters = list('CFJQZ')
    numbers = list('123456789')
    return (np.random.choice(letters) + np.random.choice(letters) + 
            np.random.choice(numbers) + np.random.choice(numbers) + 
            np.random.choice(letters))

# 向量化生成
df['ID'] = [identifier() for _ in range(len(df))]

性能优化总结

| 方法 | 相对速度 | 适用场景 | |------|---------|----------| | 基础for循环 | 1x (基准) | 不推荐使用 | | iterrows() | ~5x | 需要行索引时 | | df.values | ~60x | 简单数值运算 | | 向量化操作 | ~3000x | 可向量化的运算 |