【Pandas】数据预处理：set_index和reset_index

原创已于 2025-01-20 20:59:59 修改

· 1.3k 阅读

25 ·

版权

文章标签：

#pandas #python

于 2025-01-20 20:56:47 首次发布

Pandas 专栏收录该内容

2 篇文章

订阅专栏

文章目录

前言
一、set_index
- 1.基本语法
- 2.示例
二、reset_index
- 1.基本语法
- 2.示例

前言

在 Pandas 中，索引（Index）是数据框架（DataFrame）中最基础却也最重要的概念之一。它就像数据的"门牌号"，不仅能帮助我们定位和访问数据，还能实现强大的数据对齐和分组操作。而set_index和reset_index这两个方法，就像是数据重组的"变形金刚"，让我们能够灵活地在列和索引之间转换，从而以不同的视角组织和分析数据。无论是在数据预处理、多级索引构建，还是在数据透视等场景中，熟练运用这两个方法都能让我们的数据分析工作事半功倍。

一、set_index

1.基本语法

参考：set_index官方说明文档

DataFrame.set_index(keys, *, drop=True, append=False, inplace=False, verify_integrity=False)

描述：

使用现有列设置DataFrame索引。
- 使用一个或多个现有列或数组（具有正确长度）设置DataFrame索引（行标签）。索引可以替换现有索引或在其基础上扩展。

参数：

keys: label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.
drop: bool, default True
Delete columns to be used as the new index.
append: bool, default False
Whether to append columns to existing index.
inplace: bool, default False
Whether to modify the DataFrame rather than creating a new one.
verify_integrity: bool, default False
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

一般只需要指定 keys ，其他默认即可。

2.示例

创建数据集

>>> df = pd.DataFrame({'month': [1, 4, 7, 10],
...                    'year': [2012, 2014, 2013, 2014],
...                    'sale': [55, 40, 84, 31]})
>>> df
   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

（1）将’month’列设置为索引

>>> df.set_index('month')
       year  sale
month
1      2012    55
4      2014    40
7      2013    84
10     2014    31

（2）将’year’和’month’列设置成多级索引

>>> df.set_index(['year', 'month'])
            sale
year  month
2012  1     55
2014  4     40
2013  7     84
2014  10    31

（3）使用一个index对象和一列创建多级索引

>>> df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
         month  sale
   year
1  2012  1      55
2  2014  4      40
3  2013  7      84
4  2014  10     31

（4）使用两个 Series 创建多级索引

>>> s = pd.Series([1, 2, 3, 4])
>>> df.set_index([s, s**2])
      month  year  sale
1 1       1  2012    55
2 4       4  2014    40
3 9       7  2013    84
4 16     10  2014    31

二、reset_index

1.基本语法

参考：reset_index官方说明文档

DataFrame.reset_index(level=None, *, drop=False, inplace=False, col_level=0, col_fill='', allow_duplicates=<no_default>, names=None)

描述：

重置索引或其中的一个层级。
- 重置DataFrame的索引，改用默认索引。如果DataFrame具有多级索引，此方法可以移除一个或多个层级。

参数：

level: int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default.
drop: bool, default False
Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplace: bool, default False
Whether to modify the DataFrame rather than creating a new one.
col_level: int or str, default 0
If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
col_fill: object, default ‘’
If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
allow_duplicates: bool, optional, default lib.no_default
Allow duplicate column labels to be created.
names: int, str or 1-dimensional list, default None
Using the given string, rename the DataFrame column which contains the index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.

一般所有参数都默认即可，有时也会设置 drop=True

2.示例

import pandas as pd  

# 创建一个带索引的DataFrame  
df = pd.DataFrame({  
    'A': [1, 2, 3, 4],  
    'B': ['a', 'b', 'c', 'd']  
}, index=['w', 'x', 'y', 'z'])  

print("原始数据:")  
print(df)  

# 输出：
原始数据:
   A  B
w  1  a
x  2  b
y  3  c
z  4  d

（1）基本重置索引

reset_df = df.reset_index()  
print("\n重置索引后:")  
print(reset_df)  

# 输出：
重置索引后:
  index  A  B
0     w  1  a
1     x  2  b
2     y  3  c
3     z  4  d

（2）使用drop=True删除原索引

reset_drop_df = df.reset_index(drop=True)  
print("\n删除原索引:")  
print(reset_drop_df)

# 输出：
删除原索引:
   A  B
0  1  a
1  2  b
2  3  c
3  4  d

（3）处理多级索引

# 创建多级索引的DataFrame  
multi_df = pd.DataFrame({  
    'A': [1, 2, 3, 4],  
    'B': ['a', 'b', 'c', 'd']  
})  
multi_df.index = pd.MultiIndex.from_tuples([  
    ('2024', 'Q1'),  
    ('2024', 'Q2'),  
    ('2025', 'Q1'),  
    ('2025', 'Q2')  
], names=['year', 'quarter'])  

print("\n多级索引数据:")  
print(multi_df)  

# 重置所有级别的索引  
reset_all = multi_df.reset_index()  
print("\n重置所有索引级别:")  
print(reset_all)  

# 只重置第一级索引  
reset_level_0 = multi_df.reset_index(level=0)  
print("\n只重置第一级索引:")  
print(reset_level_0)

输出结果如下：

多级索引数据:
              A  B
year quarter      
2024 Q1       1  a
     Q2       2  b
2025 Q1       3  c
     Q2       4  d

重置所有索引级别:
   year quarter  A  B
0  2024      Q1  1  a
1  2024      Q2  2  b
2  2025      Q1  3  c
3  2025      Q2  4  d

只重置第一级索引:
         year  A  B
quarter            
Q1       2024  1  a
Q2       2024  2  b
Q1       2025  3  c
Q2       2025  4  d