pandas组队学习-Task2 pandas的各种索引操作_索引设定中的所有方法分别适用于哪些场合?怎么直接把某个dataframe的索引换成任意-优快云博客

本文链接：https://blog.youkuaiyun.com/Zee_Chao/article/details/105718169

3.4.1 逻辑运算符'&'、'|'和'~'

4.2.2 第一类特殊情况：元组构成的列表

4.2.3 第二类特殊情况：列表构成的元组

5.4.2 利用append参数将当前索引维持不变

5.6 rename_axis()和rename()方法

5.6.1 rename_axis()用于修改多级索引中的索引名

5.6.2 rename()用于修改列或者行索引标签而非索引名

6. 常用索引型函数

6.1 where()函数

6.1.1 对条件为False的单元进行填充

6.1.2 对条件为False的单元进行指定值填充

1. 学习内容

1. 了解pandas支持的各种索引操作

2. 了解pandas对索引的设置

3. 了解pandas的索引函数

4. 利用pandas的索引操作对数据进行处理和抽样

本项目参见https://github.com/datawhalechina/team-learning/tree/master/Pandas%E6%95%99%E7%A8%8B%EF%BC%88%E4%B8%8A%EF%BC%89

2. 准备工作

import numpy as np
import pandas as pd
df = pd.read_csv(r'./data/table.csv', index_col = 'ID')
print(df.head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

3. pandas的单级索引操作

pandas最基本的索引操作有三种：loc[]，iloc[]和[]操作符。这三种方法都适用于序列和表格，不过由于序列和表格存在不同之处，这三种方法在二者上的使用也有细节上的不同。

在此基础上，还可以引申出布尔索引和区间索引等索引方式。

3.1 loc[]

在[1]中，我曾说过：“如果将表格视作一个二维数组的话，按照习惯我们很有可能会将行索引作为第一个下标并将列名作为第二个下标。但是，需要注意的是：表格中某个元素的提取要将列名作为第一个下标，而索引要作为第二个下标。”

loc[]的使用也要遵循这个规律。另外，loc[]能传入的只有布尔列表和索引子集构成的列表，而且loc[]中的切片始终都包含右端点。

3.1.1 单行索引

print(df.loc[1103])

School          S_1
Class           C_1
Gender            M
Address    street_2
Height          186
Weight           82
Math           87.2
Physics          B+
Name: 1103, dtype: object

3.1.2 多行索引

print(df.loc[[1102, 2304]])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+
2304    S_2   C_3      F  street_6     164      81  95.5      A-

print(df.loc[1304:].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1304    S_1   C_3      M  street_2     195      70  85.2       A
1305    S_1   C_3      F  street_5     187      69  61.7      B-
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-

print(df.loc[2402::-1].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2402    S_2   C_4      M  street_7     166      82  48.7       B
2401    S_2   C_4      F  street_2     192      62  45.3       A
2305    S_2   C_3      M  street_4     187      73  48.9       B
2304    S_2   C_3      F  street_6     164      81  95.5      A-
2303    S_2   C_3      F  street_7     190      99  65.9       C

3.1.3 单列索引

print(df.loc[:, 'Height'].head())

ID
1101    173
1102    192
1103    186
1104    167
1105    159
Name: Height, dtype: int64

3.1.4 多列索引

print(df.loc[:, ['Height', 'Math']].head())

      Height  Math
ID                
1101     173  34.0
1102     192  32.5
1103     186  87.2
1104     167  80.4
1105     159  84.8

print(df.loc[:, 'Height':'Math'].head())

      Height  Weight  Math
ID                        
1101     173      63  34.0
1102     192      73  32.5
1103     186      82  87.2
1104     167      81  80.4
1105     159      64  84.8

3.1.5 联合索引

print(df.loc[1102:2401:3, 'Height':'Math'].head())

      Height  Weight  Math
ID                        
1102     192      73  32.5
1105     159      64  84.8
1203     160      53  58.8
1301     161      68  31.5
1304     195      70  85.2

3.1.6 函数式索引

loc[]中可以使用匿名函数或者自定义的函数。其中，匿名函数传入的参数就是整个表格。

print(df.loc[lambda x: x['Gender'] == 'M'].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1301    S_1   C_3      M  street_4     161      68  31.5      B+

def f(x):
    return [1101, 1103]
print(df.loc[f])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1103    S_1   C_1      M  street_2     186      82  87.2      B+

3.1.7 布尔索引

print(df.loc[df['Address'].isin(['street_7', 'street_4'])].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
1303    S_1   C_3      M  street_7     188      82  49.7       B
2101    S_2   C_1      M  street_7     174      84  83.3       C

print(df.loc[[True if i[-1] == '4' or i[-1] == '7' else False \
              for i in df['Address'].values]].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
1303    S_1   C_3      M  street_7     188      82  49.7       B
2101    S_2   C_1      M  street_7     174      84  83.3       C

3.2 iloc[]

实际上，pandas的表格也是可以按照数组的方式来访问的。只不过必须要借助iloc[]才能实现。iloc[]中接收的参数只能为整数或整数列表，不能使用布尔索引。与loc[]不同，iloc[]切片右端点是不包含的。

3.2.1 单行索引

print(df.iloc[3])

School          S_1
Class           C_1
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 1104, dtype: object

3.2.2 多行索引

print(df.iloc[3:5])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

3.2.3 单列索引

print(df.iloc[:, 3].head())

ID
1101    street_1
1102    street_2
1103    street_2
1104    street_2
1105    street_4
Name: Address, dtype: object

3.2.4 多列索引

print(df.iloc[:, 7::-2].head())

     Physics  Weight   Address Class
ID                                  
1101      A+      63  street_1   C_1
1102      B+      73  street_2   C_1
1103      B+      82  street_2   C_1
1104      B-      81  street_2   C_1
1105      B+      64  street_4   C_1

3.2.5 联合索引

print(df.iloc[3::4, 7::-2].head())

     Physics  Weight   Address Class
ID                                  
1104      B-      81  street_2   C_1
1203      A+      53  street_6   C_2
1302      A-      57  street_1   C_3
2101       C      84  street_7   C_1
2105       A      81  street_4   C_1

3.2.6 函数式索引

print(df.iloc[lambda x: [3, 4]].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

3.3 []操作符

[]操作符有的时候使用的是索引标签，有时候使用的又是绝对位置。因此很容易产生混淆。所以，还是用loc[]和iloc[]更合适一些。

另外，如果索引是浮点数，那么对于序列就更不要使用[]来进行访问了。

3.3.1 表格单行索引

print(df[1:2])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+

这里很容易产生误解，以为“1:2”指的是索引标签。但实际上它指的是绝对位置。如果想要用索引标签来访问整个行的话，要用如下的方式：

row = df.index.get_loc(1102)
print(df[row:row+1])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+

3.3.2 表格多行索引

print(df[3:5])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

3.3.3 表格单列索引

print(df['School'].head())

ID
1101    S_1
1102    S_1
1103    S_1
1104    S_1
1105    S_1
Name: School, dtype: object

3.3.4 表格多列索引

print(df[['School', 'Math']].head())

     School  Math
ID               
1101    S_1  34.0
1102    S_1  32.5
1103    S_1  87.2
1104    S_1  80.4
1105    S_1  84.8

3.3.5 表格函数式索引

print(df[lambda x: ['Math', 'Physics']].head())

      Math Physics
ID                
1101  34.0      A+
1102  32.5      B+
1103  87.2      B+
1104  80.4      B-
1105  84.8      B+

3.3.6 表格布尔索引

print(df[df['Gender'] == 'F'].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1204    S_1   C_2      F  street_5     162      63  33.8       B

3.3.7 序列单行索引

s = pd.Series(df['Math'],index=df.index)
print(s[1101])

34.0

需要注意的是这里使用的是索引标签而非绝对位置。

3.3.8 序列多行索引

print(s[0:4])

ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64

这里使用的是绝对位置进行的切片。

3.3.9 序列函数式索引

print(s[lambda x: x.index[16::-6]])

ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64

3.3.10 序列布尔索引

print(s[s > 80])

ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64

3.4 布尔索引

3.4.1 逻辑运算符'&'、'|'和'~'

print(df[(df['Gender'] == 'F') & (df['Address'] == 'street_2')].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
2401    S_2   C_4      F  street_2     192      62  45.3       A
2404    S_2   C_4      F  street_2     160      84  67.7       B

print(df[(df['Math'] > 85) | (df['Address'] == 'street_7')].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1302    S_1   C_3      F  street_1     175      57  87.7      A-
1303    S_1   C_3      M  street_7     188      82  49.7       B
1304    S_1   C_3      M  street_2     195      70  85.2       A

print(df[~((df['Math'] > 75) | (df['Address'] == 'street_1'))].head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1204    S_1   C_2      F  street_5     162      63  33.8       B
1205    S_1   C_2      F  street_6     167      63  68.4      B-

3.4.2 在loc[]中使用布尔索引

print(df.loc[df['Math'] > 60, (df[:8]['Address'] == 'street_6').values].head())

     Physics
ID          
1103      B+
1104      B-
1105      B+
1201      A-
1202      B-

3.4.3 isin()方法

print(df[df['Address'].isin(['street_1', 'street_4']) & \
         df['Physics'].isin(['A','A+'])])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2105    S_2   C_1      M  street_4     170      81  34.2       A
2203    S_2   C_2      M  street_4     155      91  73.8      A+

该逻辑同样可以用字典来实现，代码如下：

print(df[df[['Address', 'Physics']].isin({'Address':['street_1', 'street_4'], \
                                          'Physics':['A','A+']}).all(1)])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2105    S_2   C_1      M  street_4     170      81  34.2       A
2203    S_2   C_2      M  street_4     155      91  73.8      A+

3.5 快速标量索引

如果只是取单个元素，那么at[]和iat[]在时间上要远好于loc[]和iloc[]。

display(df.at[1101,'School'])
display(df.loc[1101,'School'])
display(df.iat[0,0])
display(df.iloc[0,0])

3.6 区间索引

3.6.1 构造区间

print(pd.interval_range(start = 0, end = 5))

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')

print(pd.interval_range(start = 0, periods = 8, freq = 5))

IntervalIndex([(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30], (30, 35], (35, 40]],
              closed='right',
              dtype='interval[int64]')

3.6.2 利用cut将数值列转为区间

math_interval = pd.cut(df['Math'],bins = [0, 40, 60, 80, 100])
#注意，如果没有类型转换，此时并不是区间类型，而是category类型
math_interval.head()

ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]

3.6.3 区间索引的选取

df_i = df.join(math_interval,rsuffix = \
               '_interval')[['Math', \
                             'Math_interval']].reset_index().set_index('Math_interval')
print(df_i.head())

                 ID  Math
Math_interval            
(0, 40]        1101  34.0
(0, 40]        1102  32.5
(80, 100]      1103  87.2
(80, 100]      1104  80.4
(80, 100]      1105  84.8

print(df_i.loc[65].head())

                 ID  Math
Math_interval            
(60, 80]       1202  63.5
(60, 80]       1205  68.4
(60, 80]       1305  61.7
(60, 80]       2104  72.2
(60, 80]       2202  68.5

print(df_i.loc[[45, 65]])

                 ID  Math
Math_interval            
(40, 60]       1203  58.8
(40, 60]       1303  49.7
(40, 60]       2102  50.6
(40, 60]       2103  52.5
(40, 60]       2204  47.2
(40, 60]       2305  48.9
(40, 60]       2401  45.3
(40, 60]       2402  48.7
(40, 60]       2403  59.7
(40, 60]       2405  47.6
(60, 80]       1202  63.5
(60, 80]       1205  68.4
(60, 80]       1305  61.7
(60, 80]       2104  72.2
(60, 80]       2202  68.5
(60, 80]       2203  73.8
(60, 80]       2301  72.3
(60, 80]       2303  65.9
(60, 80]       2404  67.7

# 如果想要选取某个区间，先要把分类变量转为区间变量，再使用overlap方法
print(df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head())

                 ID  Math
Math_interval            
(80, 100]      1103  87.2
(80, 100]      1104  80.4
(80, 100]      1105  84.8
(80, 100]      1201  97.0
(60, 80]       1202  63.5

4. pandas的多级索引操作

4.1 创建多级索引

4.1.1 从元组和数组中创建

以下的三段代码是等效的。

tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect','good','fair','bad']},index = mul_index))

L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1, L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect', 'good', 'fair', 'bad']},index = mul_index))

arrays = [['A', 'a'], ['A', 'b'], ['B', 'a'], ['B', 'b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect', 'good', 'fair', 'bad']},index = mul_index))

               Score
Upper Lower         
A     a      perfect
      b         good
B     a         fair
      b          bad

如果将mul_index打印出来的话，结果是这样的：

MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])

4.1.2 从笛卡尔积中创建

L1 = ['A', 'B']
L2 = ['a', 'b']
mul_index = pd.MultiIndex.from_product([L1, L2],names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect', 'good', 'fair', 'bad']},index = mul_index))

实质上这里的mul_index跟上面是一模一样的。

4.1.3 指定表格中的列创建

df_using_mul = df.set_index(['Class', 'Address'])
print(df_using_mul.head())

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_1   street_1    S_1      M     173      63  34.0      A+
      street_2    S_1      F     192      73  32.5      B+
      street_2    S_1      M     186      82  87.2      B+
      street_2    S_1      F     167      81  80.4      B-
      street_4    S_1      F     159      64  84.8      B+

4.2 多级索引切片

4.2.1 一般情况

在使用多级索引之前，必须要保证数据的索引是已经排好序的。否则的话会出错。

# 在未保证索引有序的前提下进行多级索引切片是会出错的
df_using_mul.loc['C_2','street_5']

因此要先对索引进行排序然后再进行多级索引的相关操作。

print(df_using_mul.sort_index().loc['C_2','street_5'])

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_2   street_5    S_1      M     188      68  97.0      A-
      street_5    S_1      F     162      63  33.8       B
      street_5    S_2      M     193     100  39.1       B

至于索引是否有序，可以用以下语句来检验：

print(df_using_mul.index.is_lexsorted())

多级索引切片的相关语句如下（同理要先保证索引有序）：

print(df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')])

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_2   street_6    S_1      M     160      53  58.8      A+
      street_6    S_1      F     167      63  68.4      B-
      street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_1    S_1      F     175      57  87.7      A-
      street_2    S_1      M     195      70  85.2       A
      street_4    S_1      M     161      68  31.5      B+
      street_4    S_2      F     157      78  72.3      B+
      street_4    S_2      M     187      73  48.9       B

print(df_using_mul.sort_index().loc[('C_2','street_7'):('C_3')].head())

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_2   street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_1    S_1      F     175      57  87.7      A-
      street_2    S_1      M     195      70  85.2       A
      street_4    S_1      M     161      68  31.5      B+

4.2.2 第一类特殊情况：元组构成的列表

print(df_using_mul.sort_index().loc[[('C_2','street_7'), ('C_3','street_2')]])

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_2   street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_2    S_1      M     195      70  85.2       A

4.2.3 第二类特殊情况：列表构成的元组

print(df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']), :])

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_2   street_4    S_1      F     176      94  63.5      B-
      street_4    S_2      M     155      91  73.8      A+
      street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_4    S_1      M     161      68  31.5      B+
      street_4    S_2      F     157      78  72.3      B+
      street_4    S_2      M     187      73  48.9       B
      street_7    S_1      M     188      82  49.7       B
      street_7    S_2      F     190      99  65.9       C

4.3 多级索引中的slice对象

L1,L2 = ['A','B','C'], ['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names = ('Upper', 'Lower'))
L3,L4 = ['D','E','F'], ['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names = ('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9), index = mul_index1, columns = mul_index2)
print(df_s)

Big                 D                             E                      \
Small               d         e         f         d         e         f   
Upper Lower                                                               
A     a      0.419165  0.610234  0.584066  0.365608  0.869079  0.966130   
      b      0.423426  0.883564  0.456577  0.711615  0.423257  0.170660   
      c      0.849075  0.194531  0.062969  0.218767  0.774477  0.036436   
B     a      0.981672  0.531616  0.928525  0.397371  0.273752  0.701575   
      b      0.842900  0.607972  0.317989  0.705531  0.938026  0.488817   
      c      0.743939  0.607704  0.966594  0.552271  0.702859  0.459315   
C     a      0.831294  0.688231  0.069070  0.097113  0.771693  0.392254   
      b      0.268481  0.044965  0.620735  0.745501  0.062748  0.537725   
      c      0.050948  0.562608  0.358885  0.283099  0.169198  0.940639   

Big                 F                      
Small               d         e         f  
Upper Lower                                
A     a      0.331276  0.033756  0.698122  
      b      0.691064  0.212999  0.151984  
      c      0.044488  0.088146  0.365206  
B     a      0.484372  0.734307  0.262505  
      b      0.485355  0.005099  0.839988  
      c      0.867462  0.343085  0.461860  
C     a      0.003566  0.797579  0.249657  
      b      0.923007  0.627113  0.050880  
      c      0.907182  0.534797  0.822204

idx = pd.IndexSlice
print(df_s.loc[idx['B':, df_s['D']['d'] > 0.3], idx[df_s.sum() > 4]])

Big                 D                             E                      \
Small               d         e         f         d         e         f   
Upper Lower                                                               
B     a      0.981672  0.531616  0.928525  0.397371  0.273752  0.701575   
      b      0.842900  0.607972  0.317989  0.705531  0.938026  0.488817   
      c      0.743939  0.607704  0.966594  0.552271  0.702859  0.459315   
C     a      0.831294  0.688231  0.069070  0.097113  0.771693  0.392254   

Big                 F  
Small               d  
Upper Lower            
B     a      0.484372  
      b      0.485355  
      c      0.867462  
C     a      0.003566

4.4 多级索引中的索引交换

print(df_using_mul.head())

               School Gender  Height  Weight  Math Physics
Class Address                                             
C_1   street_1    S_1      M     173      63  34.0      A+
      street_2    S_1      F     192      73  32.5      B+
      street_2    S_1      M     186      82  87.2      B+
      street_2    S_1      F     167      81  80.4      B-
      street_4    S_1      F     159      64  84.8      B+

4.4.1 两层交换

print(df_using_mul.swaplevel(i = 1, j = 0, axis = 0).sort_index().head())

               School Gender  Height  Weight  Math Physics
Address  Class                                            
street_1 C_1      S_1      M     173      63  34.0      A+
         C_2      S_2      M     175      74  47.2      B-
         C_3      S_1      F     175      57  87.7      A-
street_2 C_1      S_1      F     192      73  32.5      B+
         C_1      S_1      M     186      82  87.2      B+

4.4.2 多层交换

df_muls = df.set_index(['School','Class','Address'])
print(df_muls.head())

                      Gender  Height  Weight  Math Physics
School Class Address                                      
S_1    C_1   street_1      M     173      63  34.0      A+
             street_2      F     192      73  32.5      B+
             street_2      M     186      82  87.2      B+
             street_2      F     167      81  80.4      B-
             street_4      F     159      64  84.8      B+

print(df_muls.reorder_levels([2,0,1],axis=0).sort_index().head())

                      Gender  Height  Weight  Math Physics
Address  School Class                                     
street_1 S_1    C_1        M     173      63  34.0      A+
                C_3        F     175      57  87.7      A-
         S_2    C_2        M     175      74  47.2      B-
street_2 S_1    C_1        F     192      73  32.5      B+
                C_1        M     186      82  87.2      B+

print(df_muls.reorder_levels(['Address', 'School', 'Class'], \
                             axis = 0).sort_index().head())

                      Gender  Height  Weight  Math Physics
Address  School Class                                     
street_1 S_1    C_1        M     173      63  34.0      A+
                C_3        F     175      57  87.7      A-
         S_2    C_2        M     175      74  47.2      B-
street_2 S_1    C_1        F     192      73  32.5      B+
                C_1        M     186      82  87.2      B+

5. 索引设定

5.1 index_col参数

index_col参数是read_csv()中的一个参数，用来指定将某些原数据的列作为索引。

print(pd.read_csv(r'./data/table.csv',index_col = ['Address','School']).head())

                Class    ID Gender  Height  Weight  Math Physics
Address  School                                                 
street_1 S_1      C_1  1101      M     173      63  34.0      A+
street_2 S_1      C_1  1102      F     192      73  32.5      B+
         S_1      C_1  1103      M     186      82  87.2      B+
         S_1      C_1  1104      F     167      81  80.4      B-
street_4 S_1      C_1  1105      F     159      64  84.8      B+

5.2 reindex()方法

5.2.1 一般用法

print(df.head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

print(df.reindex(index = [1101,1203,1206,2402]))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1   173.0    63.0  34.0      A+
1203    S_1   C_2      M  street_6   160.0    53.0  58.8      A+
1206    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN
2402    S_2   C_4      M  street_7   166.0    82.0  48.7       B

print(df.reindex(columns = ['Height', 'Gender', 'Average']).head())

      Height Gender  Average
ID                          
1101     173      M      NaN
1102     192      F      NaN
1103     186      M      NaN
1104     167      F      NaN
1105     159      F      NaN

5.2.2 自动填充缺失值

可以选择缺失值的填充方法：fill_value和method（bfill/ffill/nearest），其中method参数必须索引单调。

print(df.reindex(index = [1101, 1203, 1206, 2402], method = 'bfill'))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1206    S_1   C_3      M  street_4     161      68  31.5      B+
2402    S_2   C_4      M  street_7     166      82  48.7       B

5.3 reindex_like()方法

reindex_like()方法会生成一个横纵索引完全与参数列表一致的表格。

df_temp = pd.DataFrame({'Weight':np.zeros(5),
                        'Height':np.zeros(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
print(df_temp)
print(df_temp.reindex_like(df[0:5][['Weight','Height']]))

      Weight  Height
ID                  
1101     0.0     0.0
1104     0.0     0.0
1103     0.0     0.0
1106     0.0     0.0
1102     0.0     0.0
      Weight  Height
ID                  
1101     0.0     0.0
1102     0.0     0.0
1103     0.0     0.0
1104     0.0     0.0
1105     NaN     NaN

如果df_temp单调还可以使用method参数。

df_temp = pd.DataFrame({'Weight':range(5),
                        'Height':range(5),
                        'ID':[1101, 1104, 1103, \
                              1106, 1102]}).set_index('ID').sort_index()
print(df_temp.reindex_like(df[0:5][['Weight','Height']], method = 'bfill'))

      Weight  Height
ID                  
1101       0       0
1102       4       4
1103       2       2
1104       1       1
1105       3       3

5.4 set_index()方法

5.4.1 设置表内的列作为索引

print(df.head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

print(df.set_index('Class').head())

      School Gender   Address  Height  Weight  Math Physics
Class                                                      
C_1      S_1      M  street_1     173      63  34.0      A+
C_1      S_1      F  street_2     192      73  32.5      B+
C_1      S_1      M  street_2     186      82  87.2      B+
C_1      S_1      F  street_2     167      81  80.4      B-
C_1      S_1      F  street_4     159      64  84.8      B+

5.4.2 利用append参数将当前索引维持不变

print(df.set_index('Class', append = True).head())

           School Gender   Address  Height  Weight  Math Physics
ID   Class                                                      
1101 C_1      S_1      M  street_1     173      63  34.0      A+
1102 C_1      S_1      F  street_2     192      73  32.5      B+
1103 C_1      S_1      M  street_2     186      82  87.2      B+
1104 C_1      S_1      F  street_2     167      81  80.4      B-
1105 C_1      S_1      F  street_4     159      64  84.8      B+

5.4.3 使用与表长相同的列作为索引

# 必须要先将参数转化为序列
print(df.set_index(pd.Series(range(df.shape[0]))).head())

  School Class Gender   Address  Height  Weight  Math Physics
0    S_1   C_1      M  street_1     173      63  34.0      A+
1    S_1   C_1      F  street_2     192      73  32.5      B+
2    S_1   C_1      M  street_2     186      82  87.2      B+
3    S_1   C_1      F  street_2     167      81  80.4      B-
4    S_1   C_1      F  street_4     159      64  84.8      B+

5.4.4 添加多级索引

print(df.set_index([pd.Series(range(df.shape[0])), \
                    pd.Series(np.ones(df.shape[0]))]).head())

      School Class Gender   Address  Height  Weight  Math Physics
0 1.0    S_1   C_1      M  street_1     173      63  34.0      A+
1 1.0    S_1   C_1      F  street_2     192      73  32.5      B+
2 1.0    S_1   C_1      M  street_2     186      82  87.2      B+
3 1.0    S_1   C_1      F  street_2     167      81  80.4      B-
4 1.0    S_1   C_1      F  street_4     159      64  84.8      B+

5.5 reset_index()方法

5.5.1 默认状态直接恢复到自然数索引

print(df.reset_index().head())

     ID School Class Gender   Address  Height  Weight  Math Physics
0  1101    S_1   C_1      M  street_1     173      63  34.0      A+
1  1102    S_1   C_1      F  street_2     192      73  32.5      B+
2  1103    S_1   C_1      M  street_2     186      82  87.2      B+
3  1104    S_1   C_1      F  street_2     167      81  80.4      B-
4  1105    S_1   C_1      F  street_4     159      64  84.8      B+

5.5.2 指定某级进行重置

L1,L2 = ['A','B','C'], ['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names = ('Upper', 'Lower'))
L3,L4 = ['D','E','F'], ['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names = ('Big', 'Small'))
df_temp = pd.DataFrame(np.random.rand(9, 9),index = mul_index1,columns = mul_index2)
print(df_temp.head())

Big                 D                             E                      \
Small               d         e         f         d         e         f   
Upper Lower                                                               
A     a      0.996992  0.239206  0.466831  0.317890  0.534678  0.692468   
      b      0.768305  0.288890  0.937617  0.727311  0.634722  0.213160   
      c      0.424430  0.188946  0.576724  0.934759  0.466979  0.319530   
B     a      0.778316  0.394294  0.373620  0.792681  0.614062  0.480871   
      b      0.285218  0.202051  0.780276  0.531896  0.939366  0.167180   

Big                 F                      
Small               d         e         f  
Upper Lower                                
A     a      0.386664  0.441064  0.539764  
      b      0.744900  0.249924  0.741903  
      c      0.415877  0.426759  0.080056  
B     a      0.624051  0.955153  0.464689  
      b      0.355089  0.306564  0.428365

df_temp1 = df_temp.reset_index(level=1,col_level=1)
print(df_temp1.head())

Big                 D                             E                      \
Small Lower         d         e         f         d         e         f   
Upper                                                                     
A         a  0.996992  0.239206  0.466831  0.317890  0.534678  0.692468   
A         b  0.768305  0.288890  0.937617  0.727311  0.634722  0.213160   
A         c  0.424430  0.188946  0.576724  0.934759  0.466979  0.319530   
B         a  0.778316  0.394294  0.373620  0.792681  0.614062  0.480871   
B         b  0.285218  0.202051  0.780276  0.531896  0.939366  0.167180   

Big           F                      
Small         d         e         f  
Upper                                
A      0.386664  0.441064  0.539764  
A      0.744900  0.249924  0.741903  
A      0.415877  0.426759  0.080056  
B      0.624051  0.955153  0.464689  
B      0.355089  0.306564  0.428365

df_temp1.columns

MultiIndex([( '', 'Lower'),
            ('D',     'd'),
            ('D',     'e'),
            ('D',     'f'),
            ('E',     'd'),
            ('E',     'e'),
            ('E',     'f'),
            ('F',     'd'),
            ('F',     'e'),
            ('F',     'f')],
           names=['Big', 'Small'])

df_temp1.index

Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], dtype='object', name='Upper')

5.6 rename_axis()和rename()方法

5.6.1 rename_axis()用于修改多级索引中的索引名

print(df_temp.rename_axis(index = {'Lower':'LowerLower'}, columns = {'Big':'BigBig'}))

BigBig                   D                             E                      \
Small                    d         e         f         d         e         f   
Upper LowerLower                                                               
A     a           0.996992  0.239206  0.466831  0.317890  0.534678  0.692468   
      b           0.768305  0.288890  0.937617  0.727311  0.634722  0.213160   
      c           0.424430  0.188946  0.576724  0.934759  0.466979  0.319530   
B     a           0.778316  0.394294  0.373620  0.792681  0.614062  0.480871   
      b           0.285218  0.202051  0.780276  0.531896  0.939366  0.167180   
      c           0.627673  0.650161  0.279214  0.256610  0.718082  0.581464   
C     a           0.557991  0.005820  0.713160  0.196349  0.982871  0.157749   
      b           0.213228  0.102764  0.284405  0.486068  0.798130  0.634954   
      c           0.761508  0.048896  0.907974  0.973812  0.954789  0.693932   

BigBig                   F                      
Small                    d         e         f  
Upper LowerLower                                
A     a           0.386664  0.441064  0.539764  
      b           0.744900  0.249924  0.741903  
      c           0.415877  0.426759  0.080056  
B     a           0.624051  0.955153  0.464689  
      b           0.355089  0.306564  0.428365  
      c           0.272150  0.907038  0.112153  
C     a           0.491743  0.853777  0.330861  
      b           0.438192  0.880983  0.126540  
      c           0.862145  0.444185  0.237926

5.6.2 rename()用于修改列或者行索引标签而非索引名

print(df_temp.rename(index = {'A':'T'}, columns = {'e':'changed_e'}).head())

Big                 D                             E                      \
Small               d changed_e         f         d changed_e         f   
Upper Lower                                                               
T     a      0.996992  0.239206  0.466831  0.317890  0.534678  0.692468   
      b      0.768305  0.288890  0.937617  0.727311  0.634722  0.213160   
      c      0.424430  0.188946  0.576724  0.934759  0.466979  0.319530   
B     a      0.778316  0.394294  0.373620  0.792681  0.614062  0.480871   
      b      0.285218  0.202051  0.780276  0.531896  0.939366  0.167180   

Big                 F                      
Small               d changed_e         f  
Upper Lower                                
T     a      0.386664  0.441064  0.539764  
      b      0.744900  0.249924  0.741903  
      c      0.415877  0.426759  0.080056  
B     a      0.624051  0.955153  0.464689  
      b      0.355089  0.306564  0.428365

6. 常用索引型函数

6.1 where()函数

print(df.head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

6.1.1 对条件为False的单元进行填充

print(df.where(df['Gender'] == 'M').head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1   173.0    63.0  34.0      A+
1102    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN
1103    S_1   C_1      M  street_2   186.0    82.0  87.2      B+
1104    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN
1105    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN

6.1.2 对条件为False的单元进行指定值填充

print(df.where(df['Gender'] == 'M', \
               np.random.rand(df.shape[0], df.shape[1])).head())

        School      Class    Gender   Address      Height     Weight  \
ID                                                                     
1101       S_1        C_1         M  street_1  173.000000  63.000000   
1102  0.531856   0.588896   0.40665  0.936004    0.473496   0.674745   
1103       S_1        C_1         M  street_2  186.000000  82.000000   
1104  0.203294  0.0809859  0.423323  0.126764    0.965819   0.664274   
1105   0.82605   0.541455  0.714742  0.455908    0.019623   0.906721   

           Math   Physics  
ID                         
1101  34.000000        A+  
1102   0.203038  0.831944  
1103  87.200000        B+  
1104   0.352529  0.836464  
1105   0.352698  0.156627

6.2 mask()函数

mask()函数与where()功能上相反，其余完全一致，即对条件为True的单元进行填充。

print(df.mask(df['Gender'] == 'M').dropna().head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2   192.0    73.0  32.5      B+
1104    S_1   C_1      F  street_2   167.0    81.0  80.4      B-
1105    S_1   C_1      F  street_4   159.0    64.0  84.8      B+
1202    S_1   C_2      F  street_4   176.0    94.0  63.5      B-
1204    S_1   C_2      F  street_5   162.0    63.0  33.8       B

6.3 query()函数

query()函数中的布尔表达式中，下面的符号都是合法的：行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符

print(df.query('''(Address in ["street_6","street_7"])& \
                    (Weight>(70+10))&(ID in [1303,2304,2402])'''))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1303    S_1   C_3      M  street_7     188      82  49.7       B
2304    S_2   C_3      F  street_6     164      81  95.5      A-
2402    S_2   C_4      M  street_7     166      82  48.7       B

7. 重复元素处理

7.1 确认重复元素存在

print(df.duplicated('Class').head())

ID
1101    False
1102     True
1103     True
1104     True
1105     True
dtype: bool

可选参数keep默认为first，即首次出现设为不重复，若为last，则最后一次设为不重复，若为False，则所有重复项为False。

print(df.duplicated('Class', keep = 'last').tail())

ID
2401     True
2402     True
2403     True
2404     True
2405    False
dtype: bool

print(df.duplicated('Class', keep = False).head())

ID
1101    True
1102    True
1103    True
1104    True
1105    True
dtype: bool

7.2 去除重复元素

print(df.drop_duplicates('Class'))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
2401    S_2   C_4      F  street_2     192      62  45.3       A

print(df.drop_duplicates('Class', keep = 'last'))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2105    S_2   C_1      M  street_4     170      81  34.2       A
2205    S_2   C_2      F  street_7     183      76  85.4       B
2305    S_2   C_3      M  street_4     187      73  48.9       B
2405    S_2   C_4      F  street_6     193      54  47.6       B

在传入多列时等价于将多列共同视作一个多级索引，比较重复项。

print(df.drop_duplicates(['School', 'Class']))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
2101    S_2   C_1      M  street_7     174      84  83.3       C
2201    S_2   C_2      M  street_5     193     100  39.1       B
2301    S_2   C_3      F  street_4     157      78  72.3      B+
2401    S_2   C_4      F  street_2     192      62  45.3       A

8. 抽样函数sample()

8.1 按个数抽样

print(df.sample(n = 5))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2402    S_2   C_4      M  street_7     166      82  48.7       B
2401    S_2   C_4      F  street_2     192      62  45.3       A
2205    S_2   C_2      F  street_7     183      76  85.4       B
2305    S_2   C_3      M  street_4     187      73  48.9       B
1104    S_1   C_1      F  street_2     167      81  80.4      B-

8.2 按比例抽样

df.sample(frac = 0.05)

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1304    S_1   C_3      M  street_2     195      70  85.2       A
1203    S_1   C_2      M  street_6     160      53  58.8      A+

8.3 有放回抽样

print(df.sample(n = df.shape[0], replace = True).head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1105    S_1   C_1      F  street_4     159      64  84.8      B+
2202    S_2   C_2      F  street_7     194      77  68.5      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-
2203    S_2   C_2      M  street_4     155      91  73.8      A+
1304    S_1   C_3      M  street_2     195      70  85.2       A

print(df.sample(n = 35, replace = True).index.is_unique)

False

8.4 对列随机抽样

print(df.sample(n = 3, axis = 1).head())

     Class Gender School
ID                      
1101   C_1      M    S_1
1102   C_1      F    S_1
1103   C_1      M    S_1
1104   C_1      F    S_1
1105   C_1      F    S_1

8.5 带权重抽样

8.5.1 对行赋予权值

print(df.sample(n = 3, weights = np.random.rand(df.shape[0])).head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1105    S_1   C_1      F  street_4     159      64  84.8      B+
2405    S_2   C_4      F  street_6     193      54  47.6       B
1205    S_1   C_2      F  street_6     167      63  68.4      B-

8.5.2 以某列值为权值（自动归一化）

print(df.sample(n = 3, weights = df['Math']).head())

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2105    S_2   C_1      M  street_4     170      81  34.2       A
2403    S_2   C_4      F  street_6     158      60  59.7      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-