目录
5.6.1 rename_axis()用于修改多级索引中的索引名
5.6.2 rename()用于修改列或者行索引标签而非索引名
1. 学习内容
1. 了解pandas支持的各种索引操作
2. 了解pandas对索引的设置
3. 了解pandas的索引函数
4. 利用pandas的索引操作对数据进行处理和抽样
2. 准备工作
import numpy as np
import pandas as pd
df = pd.read_csv(r'./data/table.csv', index_col = 'ID')
print(df.head())
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
3. pandas的单级索引操作
pandas最基本的索引操作有三种:loc[],iloc[]和[]操作符。这三种方法都适用于序列和表格,不过由于序列和表格存在不同之处,这三种方法在二者上的使用也有细节上的不同。
在此基础上,还可以引申出布尔索引和区间索引等索引方式。
3.1 loc[]
在[1]中,我曾说过:“如果将表格视作一个二维数组的话,按照习惯我们很有可能会将行索引作为第一个下标并将列名作为第二个下标。但是,需要注意的是:表格中某个元素的提取要将列名作为第一个下标,而索引要作为第二个下标。”
loc[]的使用也要遵循这个规律。另外,loc[]能传入的只有布尔列表和索引子集构成的列表,而且loc[]中的切片始终都包含右端点。
3.1.1 单行索引
print(df.loc[1103])
School S_1
Class C_1
Gender M
Address street_2
Height 186
Weight 82
Math 87.2
Physics B+
Name: 1103, dtype: object
3.1.2 多行索引
print(df.loc[[1102, 2304]])
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
2304 S_2 C_3 F street_6 164 81 95.5 A-
print(df.loc[1304:].head())
School Class Gender Address Height Weight Math Physics
ID
1304 S_1 C_3 M street_2 195 70 85.2 A
1305 S_1 C_3 F street_5 187 69 61.7 B-
2101 S_2 C_1 M street_7 174 84 83.3 C
2102 S_2 C_1 F street_6 161 61 50.6 B+
2103 S_2 C_1 M street_4 157 61 52.5 B-
print(df.loc[2402::-1].head())
School Class Gender Address Height Weight Math Physics
ID
2402 S_2 C_4 M street_7 166 82 48.7 B
2401 S_2 C_4 F street_2 192 62 45.3 A
2305 S_2 C_3 M street_4 187 73 48.9 B
2304 S_2 C_3 F street_6 164 81 95.5 A-
2303 S_2 C_3 F street_7 190 99 65.9 C
3.1.3 单列索引
print(df.loc[:, 'Height'].head())
ID
1101 173
1102 192
1103 186
1104 167
1105 159
Name: Height, dtype: int64
3.1.4 多列索引
print(df.loc[:, ['Height', 'Math']].head())
Height Math
ID
1101 173 34.0
1102 192 32.5
1103 186 87.2
1104 167 80.4
1105 159 84.8
print(df.loc[:, 'Height':'Math'].head())
Height Weight Math
ID
1101 173 63 34.0
1102 192 73 32.5
1103 186 82 87.2
1104 167 81 80.4
1105 159 64 84.8
3.1.5 联合索引
print(df.loc[1102:2401:3, 'Height':'Math'].head())
Height Weight Math
ID
1102 192 73 32.5
1105 159 64 84.8
1203 160 53 58.8
1301 161 68 31.5
1304 195 70 85.2
3.1.6 函数式索引
loc[]中可以使用匿名函数或者自定义的函数。其中,匿名函数传入的参数就是整个表格。
print(df.loc[lambda x: x['Gender'] == 'M'].head())
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1301 S_1 C_3 M street_4 161 68 31.5 B+
def f(x):
return [1101, 1103]
print(df.loc[f])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1103 S_1 C_1 M street_2 186 82 87.2 B+
3.1.7 布尔索引
print(df.loc[df['Address'].isin(['street_7', 'street_4'])].head())
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
print(df.loc[[True if i[-1] == '4' or i[-1] == '7' else False \
for i in df['Address'].values]].head())
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
3.2 iloc[]
实际上,pandas的表格也是可以按照数组的方式来访问的。只不过必须要借助iloc[]才能实现。iloc[]中接收的参数只能为整数或整数列表,不能使用布尔索引。与loc[]不同,iloc[]切片右端点是不包含的。
3.2.1 单行索引
print(df.iloc[3])
School S_1
Class C_1
Gender F
Address street_2
Height 167
Weight 81
Math 80.4
Physics B-
Name: 1104, dtype: object
3.2.2 多行索引
print(df.iloc[3:5])
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
3.2.3 单列索引
print(df.iloc[:, 3].head())
ID
1101 street_1
1102 street_2
1103 street_2
1104 street_2
1105 street_4
Name: Address, dtype: object
3.2.4 多列索引
print(df.iloc[:, 7::-2].head())
Physics Weight Address Class
ID
1101 A+ 63 street_1 C_1
1102 B+ 73 street_2 C_1
1103 B+ 82 street_2 C_1
1104 B- 81 street_2 C_1
1105 B+ 64 street_4 C_1
3.2.5 联合索引
print(df.iloc[3::4, 7::-2].head())
Physics Weight Address Class
ID
1104 B- 81 street_2 C_1
1203 A+ 53 street_6 C_2
1302 A- 57 street_1 C_3
2101 C 84 street_7 C_1
2105 A 81 street_4 C_1
3.2.6 函数式索引
print(df.iloc[lambda x: [3, 4]].head())
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
3.3 []操作符
[]操作符有的时候使用的是索引标签,有时候使用的又是绝对位置。因此很容易产生混淆。所以,还是用loc[]和iloc[]更合适一些。
另外,如果索引是浮点数,那么对于序列就更不要使用[]来进行访问了。
3.3.1 表格单行索引
print(df[1:2])
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
这里很容易产生误解,以为“1:2”指的是索引标签。但实际上它指的是绝对位置。如果想要用索引标签来访问整个行的话,要用如下的方式:
row = df.index.get_loc(1102)
print(df[row:row+1])
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
3.3.2 表格多行索引
print(df[3:5])
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
3.3.3 表格单列索引
print(df['School'].head())
ID
1101 S_1
1102 S_1
1103 S_1
1104 S_1
1105 S_1
Name: School, dtype: object
3.3.4 表格多列索引
print(df[['School', 'Math']].head())
School Math
ID
1101 S_1 34.0
1102 S_1 32.5
1103 S_1 87.2
1104 S_1 80.4
1105 S_1 84.8
3.3.5 表格函数式索引
print(df[lambda x: ['Math', 'Physics']].head())
Math Physics
ID
1101 34.0 A+
1102 32.5 B+
1103 87.2 B+
1104 80.4 B-
1105 84.8 B+
3.3.6 表格布尔索引
print(df[df['Gender'] == 'F'].head())
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1204 S_1 C_2 F street_5 162 63 33.8 B
3.3.7 序列单行索引
s = pd.Series(df['Math'],index=df.index)
print(s[1101])
34.0
需要注意的是这里使用的是索引标签而非绝对位置。
3.3.8 序列多行索引
print(s[0:4])
ID
1101 34.0
1102 32.5
1103 87.2
1104 80.4
Name: Math, dtype: float64
这里使用的是绝对位置进行的切片。
3.3.9 序列函数式索引
print(s[lambda x: x.index[16::-6]])
ID
2102 50.6
1301 31.5
1105 84.8
Name: Math, dtype: float64
3.3.10 序列布尔索引
print(s[s > 80])
ID
1103 87.2
1104 80.4
1105 84.8
1201 97.0
1302 87.7
1304 85.2
2101 83.3
2205 85.4
2304 95.5
Name: Math, dtype: float64
3.4 布尔索引
3.4.1 逻辑运算符'&'、'|'和'~'
print(df[(df['Gender'] == 'F') & (df['Address'] == 'street_2')].head())
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
2401 S_2 C_4 F street_2 192 62 45.3 A
2404 S_2 C_4 F street_2 160 84 67.7 B
print(df[(df['Math'] > 85) | (df['Address'] == 'street_7')].head())
School Class Gender Address Height Weight Math Physics
ID
1103 S_1 C_1 M street_2 186 82 87.2 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1302 S_1 C_3 F street_1 175 57 87.7 A-
1303 S_1 C_3 M street_7 188 82 49.7 B
1304 S_1 C_3 M street_2 195 70 85.2 A
print(df[~((df['Math'] > 75) | (df['Address'] == 'street_1'))].head())
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1204 S_1 C_2 F street_5 162 63 33.8 B
1205 S_1 C_2 F street_6 167 63 68.4 B-
3.4.2 在loc[]中使用布尔索引
print(df.loc[df['Math'] > 60, (df[:8]['Address'] == 'street_6').values].head())
Physics
ID
1103 B+
1104 B-
1105 B+
1201 A-
1202 B-
3.4.3 isin()方法
print(df[df['Address'].isin(['street_1', 'street_4']) & \
df['Physics'].isin(['A','A+'])])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2105 S_2 C_1 M street_4 170 81 34.2 A
2203 S_2 C_2 M street_4 155 91 73.8 A+
该逻辑同样可以用字典来实现,代码如下:
print(df[df[['Address', 'Physics']].isin({'Address':['street_1', 'street_4'], \
'Physics':['A','A+']}).all(1)])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2105 S_2 C_1 M street_4 170 81 34.2 A
2203 S_2 C_2 M street_4 155 91 73.8 A+
3.5 快速标量索引
如果只是取单个元素,那么at[]和iat[]在时间上要远好于loc[]和iloc[]。
display(df.at[1101,'School'])
display(df.loc[1101,'School'])
display(df.iat[0,0])
display(df.iloc[0,0])
3.6 区间索引
3.6.1 构造区间
print(pd.interval_range(start = 0, end = 5))
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
closed='right',
dtype='interval[int64]')
print(pd.interval_range(start = 0, periods = 8, freq = 5))
IntervalIndex([(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30], (30, 35], (35, 40]],
closed='right',
dtype='interval[int64]')
3.6.2 利用cut将数值列转为区间
math_interval = pd.cut(df['Math'],bins = [0, 40, 60, 80, 100])
#注意,如果没有类型转换,此时并不是区间类型,而是category类型
math_interval.head()
ID
1101 (0, 40]
1102 (0, 40]
1103 (80, 100]
1104 (80, 100]
1105 (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]
3.6.3 区间索引的选取
df_i = df.join(math_interval,rsuffix = \
'_interval')[['Math', \
'Math_interval']].reset_index().set_index('Math_interval')
print(df_i.head())
ID Math
Math_interval
(0, 40] 1101 34.0
(0, 40] 1102 32.5
(80, 100] 1103 87.2
(80, 100] 1104 80.4
(80, 100] 1105 84.8
print(df_i.loc[65].head())
ID Math
Math_interval
(60, 80] 1202 63.5
(60, 80] 1205 68.4
(60, 80] 1305 61.7
(60, 80] 2104 72.2
(60, 80] 2202 68.5
print(df_i.loc[[45, 65]])
ID Math
Math_interval
(40, 60] 1203 58.8
(40, 60] 1303 49.7
(40, 60] 2102 50.6
(40, 60] 2103 52.5
(40, 60] 2204 47.2
(40, 60] 2305 48.9
(40, 60] 2401 45.3
(40, 60] 2402 48.7
(40, 60] 2403 59.7
(40, 60] 2405 47.6
(60, 80] 1202 63.5
(60, 80] 1205 68.4
(60, 80] 1305 61.7
(60, 80] 2104 72.2
(60, 80] 2202 68.5
(60, 80] 2203 73.8
(60, 80] 2301 72.3
(60, 80] 2303 65.9
(60, 80] 2404 67.7
# 如果想要选取某个区间,先要把分类变量转为区间变量,再使用overlap方法
print(df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head())
ID Math
Math_interval
(80, 100] 1103 87.2
(80, 100] 1104 80.4
(80, 100] 1105 84.8
(80, 100] 1201 97.0
(60, 80] 1202 63.5
4. pandas的多级索引操作
4.1 创建多级索引
4.1.1 从元组和数组中创建
以下的三段代码是等效的。
tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect','good','fair','bad']},index = mul_index))
L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1, L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect', 'good', 'fair', 'bad']},index = mul_index))
arrays = [['A', 'a'], ['A', 'b'], ['B', 'a'], ['B', 'b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect', 'good', 'fair', 'bad']},index = mul_index))
Score
Upper Lower
A a perfect
b good
B a fair
b bad
如果将mul_index打印出来的话,结果是这样的:
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'Lower'])
4.1.2 从笛卡尔积中创建
L1 = ['A', 'B']
L2 = ['a', 'b']
mul_index = pd.MultiIndex.from_product([L1, L2],names = ('Upper', 'Lower'))
print(pd.DataFrame({'Score':['perfect', 'good', 'fair', 'bad']},index = mul_index))
实质上这里的mul_index跟上面是一模一样的。
4.1.3 指定表格中的列创建
df_using_mul = df.set_index(['Class', 'Address'])
print(df_using_mul.head())
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+
4.2 多级索引切片
4.2.1 一般情况
在使用多级索引之前,必须要保证数据的索引是已经排好序的。否则的话会出错。
# 在未保证索引有序的前提下进行多级索引切片是会出错的
df_using_mul.loc['C_2','street_5']
因此要先对索引进行排序然后再进行多级索引的相关操作。
print(df_using_mul.sort_index().loc['C_2','street_5'])
School Gender Height Weight Math Physics
Class Address
C_2 street_5 S_1 M 188 68 97.0 A-
street_5 S_1 F 162 63 33.8 B
street_5 S_2 M 193 100 39.1 B
至于索引是否有序,可以用以下语句来检验:
print(df_using_mul.index.is_lexsorted())
多级索引切片的相关语句如下(同理要先保证索引有序):
print(df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')])
School Gender Height Weight Math Physics
Class Address
C_2 street_6 S_1 M 160 53 58.8 A+
street_6 S_1 F 167 63 68.4 B-
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_1 S_1 F 175 57 87.7 A-
street_2 S_1 M 195 70 85.2 A
street_4 S_1 M 161 68 31.5 B+
street_4 S_2 F 157 78 72.3 B+
street_4 S_2 M 187 73 48.9 B
print(df_using_mul.sort_index().loc[('C_2','street_7'):('C_3')].head())
School Gender Height Weight Math Physics
Class Address
C_2 street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_1 S_1 F 175 57 87.7 A-
street_2 S_1 M 195 70 85.2 A
street_4 S_1 M 161 68 31.5 B+
4.2.2 第一类特殊情况:元组构成的列表
print(df_using_mul.sort_index().loc[[('C_2','street_7'), ('C_3','street_2')]])
School Gender Height Weight Math Physics
Class Address
C_2 street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_2 S_1 M 195 70 85.2 A
4.2.3 第二类特殊情况:列表构成的元组
print(df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']), :])
School Gender Height Weight Math Physics
Class Address
C_2 street_4 S_1 F 176 94 63.5 B-
street_4 S_2 M 155 91 73.8 A+
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_4 S_1 M 161 68 31.5 B+
street_4 S_2 F 157 78 72.3 B+
street_4 S_2 M 187 73 48.9 B
street_7 S_1 M 188 82 49.7 B
street_7 S_2 F 190 99 65.9 C
4.3 多级索引中的slice对象
L1,L2 = ['A','B','C'], ['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names = ('Upper', 'Lower'))
L3,L4 = ['D','E','F'], ['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names = ('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9), index = mul_index1, columns = mul_index2)
print(df_s)
Big D E \
Small d e f d e f
Upper Lower
A a 0.419165 0.610234 0.584066 0.365608 0.869079 0.966130
b 0.423426 0.883564 0.456577 0.711615 0.423257 0.170660
c 0.849075 0.194531 0.062969 0.218767 0.774477 0.036436
B a 0.981672 0.531616 0.928525 0.397371 0.273752 0.701575
b 0.842900 0.607972 0.317989 0.705531 0.938026 0.488817
c 0.743939 0.607704 0.966594 0.552271 0.702859 0.459315
C a 0.831294 0.688231 0.069070 0.097113 0.771693 0.392254
b 0.268481 0.044965 0.620735 0.745501 0.062748 0.537725
c 0.050948 0.562608 0.358885 0.283099 0.169198 0.940639
Big F
Small d e f
Upper Lower
A a 0.331276 0.033756 0.698122
b 0.691064 0.212999 0.151984
c 0.044488 0.088146 0.365206
B a 0.484372 0.734307 0.262505
b 0.485355 0.005099 0.839988
c 0.867462 0.343085 0.461860
C a 0.003566 0.797579 0.249657
b 0.923007 0.627113 0.050880
c 0.907182 0.534797 0.822204
idx = pd.IndexSlice
print(df_s.loc[idx['B':, df_s['D']['d'] > 0.3], idx[df_s.sum() > 4]])
Big D E \
Small d e f d e f
Upper Lower
B a 0.981672 0.531616 0.928525 0.397371 0.273752 0.701575
b 0.842900 0.607972 0.317989 0.705531 0.938026 0.488817
c 0.743939 0.607704 0.966594 0.552271 0.702859 0.459315
C a 0.831294 0.688231 0.069070 0.097113 0.771693 0.392254
Big F
Small d
Upper Lower
B a 0.484372
b 0.485355
c 0.867462
C a 0.003566
4.4 多级索引中的索引交换
print(df_using_mul.head())
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+
4.4.1 两层交换
print(df_using_mul.swaplevel(i = 1, j = 0, axis = 0).sort_index().head())
School Gender Height Weight Math Physics
Address Class
street_1 C_1 S_1 M 173 63 34.0 A+
C_2 S_2 M 175 74 47.2 B-
C_3 S_1 F 175 57 87.7 A-
street_2 C_1 S_1 F 192 73 32.5 B+
C_1 S_1 M 186 82 87.2 B+
4.4.2 多层交换
df_muls = df.set_index(['School','Class','Address'])
print(df_muls.head())
Gender Height Weight Math Physics
School Class Address
S_1 C_1 street_1 M 173 63 34.0 A+
street_2 F 192 73 32.5 B+
street_2 M 186 82 87.2 B+
street_2 F 167 81 80.4 B-
street_4 F 159 64 84.8 B+
print(df_muls.reorder_levels([2,0,1],axis=0).sort_index().head())
Gender Height Weight Math Physics
Address School Class
street_1 S_1 C_1 M 173 63 34.0 A+
C_3 F 175 57 87.7 A-
S_2 C_2 M 175 74 47.2 B-
street_2 S_1 C_1 F 192 73 32.5 B+
C_1 M 186 82 87.2 B+
print(df_muls.reorder_levels(['Address', 'School', 'Class'], \
axis = 0).sort_index().head())
Gender Height Weight Math Physics
Address School Class
street_1 S_1 C_1 M 173 63 34.0 A+
C_3 F 175 57 87.7 A-
S_2 C_2 M 175 74 47.2 B-
street_2 S_1 C_1 F 192 73 32.5 B+
C_1 M 186 82 87.2 B+
5. 索引设定
5.1 index_col参数
index_col参数是read_csv()中的一个参数,用来指定将某些原数据的列作为索引。
print(pd.read_csv(r'./data/table.csv',index_col = ['Address','School']).head())
Class ID Gender Height Weight Math Physics
Address School
street_1 S_1 C_1 1101 M 173 63 34.0 A+
street_2 S_1 C_1 1102 F 192 73 32.5 B+
S_1 C_1 1103 M 186 82 87.2 B+
S_1 C_1 1104 F 167 81 80.4 B-
street_4 S_1 C_1 1105 F 159 64 84.8 B+
5.2 reindex()方法
5.2.1 一般用法
print(df.head())
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
print(df.reindex(index = [1101,1203,1206,2402]))
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+
1203 S_1 C_2 M street_6 160.0 53.0 58.8 A+
1206 NaN NaN NaN NaN NaN NaN NaN NaN
2402 S_2 C_4 M street_7 166.0 82.0 48.7 B
print(df.reindex(columns = ['Height', 'Gender', 'Average']).head())
Height Gender Average
ID
1101 173 M NaN
1102 192 F NaN
1103 186 M NaN
1104 167 F NaN
1105 159 F NaN
5.2.2 自动填充缺失值
可以选择缺失值的填充方法:fill_value和method(bfill/ffill/nearest),其中method参数必须索引单调。
print(df.reindex(index = [1101, 1203, 1206, 2402], method = 'bfill'))
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
1206 S_1 C_3 M street_4 161 68 31.5 B+
2402 S_2 C_4 M street_7 166 82 48.7 B
5.3 reindex_like()方法
reindex_like()方法会生成一个横纵索引完全与参数列表一致的表格。
df_temp = pd.DataFrame({'Weight':np.zeros(5),
'Height':np.zeros(5),
'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
print(df_temp)
print(df_temp.reindex_like(df[0:5][['Weight','Height']]))
Weight Height
ID
1101 0.0 0.0
1104 0.0 0.0
1103 0.0 0.0
1106 0.0 0.0
1102 0.0 0.0
Weight Height
ID
1101 0.0 0.0
1102 0.0 0.0
1103 0.0 0.0
1104 0.0 0.0
1105 NaN NaN
如果df_temp单调还可以使用method参数。
df_temp = pd.DataFrame({'Weight':range(5),
'Height':range(5),
'ID':[1101, 1104, 1103, \
1106, 1102]}).set_index('ID').sort_index()
print(df_temp.reindex_like(df[0:5][['Weight','Height']], method = 'bfill'))
Weight Height
ID
1101 0 0
1102 4 4
1103 2 2
1104 1 1
1105 3 3
5.4 set_index()方法
5.4.1 设置表内的列作为索引
print(df.head())
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
print(df.set_index('Class').head())
School Gender Address Height Weight Math Physics
Class
C_1 S_1 M street_1 173 63 34.0 A+
C_1 S_1 F street_2 192 73 32.5 B+
C_1 S_1 M street_2 186 82 87.2 B+
C_1 S_1 F street_2 167 81 80.4 B-
C_1 S_1 F street_4 159 64 84.8 B+
5.4.2 利用append参数将当前索引维持不变
print(df.set_index('Class', append = True).head())
School Gender Address Height Weight Math Physics
ID Class
1101 C_1 S_1 M street_1 173 63 34.0 A+
1102 C_1 S_1 F street_2 192 73 32.5 B+
1103 C_1 S_1 M street_2 186 82 87.2 B+
1104 C_1 S_1 F street_2 167 81 80.4 B-
1105 C_1 S_1 F street_4 159 64 84.8 B+
5.4.3 使用与表长相同的列作为索引
# 必须要先将参数转化为序列
print(df.set_index(pd.Series(range(df.shape[0]))).head())
School Class Gender Address Height Weight Math Physics
0 S_1 C_1 M street_1 173 63 34.0 A+
1 S_1 C_1 F street_2 192 73 32.5 B+
2 S_1 C_1 M street_2 186 82 87.2 B+
3 S_1 C_1 F street_2 167 81 80.4 B-
4 S_1 C_1 F street_4 159 64 84.8 B+
5.4.4 添加多级索引
print(df.set_index([pd.Series(range(df.shape[0])), \
pd.Series(np.ones(df.shape[0]))]).head())
School Class Gender Address Height Weight Math Physics
0 1.0 S_1 C_1 M street_1 173 63 34.0 A+
1 1.0 S_1 C_1 F street_2 192 73 32.5 B+
2 1.0 S_1 C_1 M street_2 186 82 87.2 B+
3 1.0 S_1 C_1 F street_2 167 81 80.4 B-
4 1.0 S_1 C_1 F street_4 159 64 84.8 B+
5.5 reset_index()方法
5.5.1 默认状态直接恢复到自然数索引
print(df.reset_index().head())
ID School Class Gender Address Height Weight Math Physics
0 1101 S_1 C_1 M street_1 173 63 34.0 A+
1 1102 S_1 C_1 F street_2 192 73 32.5 B+
2 1103 S_1 C_1 M street_2 186 82 87.2 B+
3 1104 S_1 C_1 F street_2 167 81 80.4 B-
4 1105 S_1 C_1 F street_4 159 64 84.8 B+
5.5.2 指定某级进行重置
L1,L2 = ['A','B','C'], ['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names = ('Upper', 'Lower'))
L3,L4 = ['D','E','F'], ['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names = ('Big', 'Small'))
df_temp = pd.DataFrame(np.random.rand(9, 9),index = mul_index1,columns = mul_index2)
print(df_temp.head())
Big D E \
Small d e f d e f
Upper Lower
A a 0.996992 0.239206 0.466831 0.317890 0.534678 0.692468
b 0.768305 0.288890 0.937617 0.727311 0.634722 0.213160
c 0.424430 0.188946 0.576724 0.934759 0.466979 0.319530
B a 0.778316 0.394294 0.373620 0.792681 0.614062 0.480871
b 0.285218 0.202051 0.780276 0.531896 0.939366 0.167180
Big F
Small d e f
Upper Lower
A a 0.386664 0.441064 0.539764
b 0.744900 0.249924 0.741903
c 0.415877 0.426759 0.080056
B a 0.624051 0.955153 0.464689
b 0.355089 0.306564 0.428365
df_temp1 = df_temp.reset_index(level=1,col_level=1)
print(df_temp1.head())
Big D E \
Small Lower d e f d e f
Upper
A a 0.996992 0.239206 0.466831 0.317890 0.534678 0.692468
A b 0.768305 0.288890 0.937617 0.727311 0.634722 0.213160
A c 0.424430 0.188946 0.576724 0.934759 0.466979 0.319530
B a 0.778316 0.394294 0.373620 0.792681 0.614062 0.480871
B b 0.285218 0.202051 0.780276 0.531896 0.939366 0.167180
Big F
Small d e f
Upper
A 0.386664 0.441064 0.539764
A 0.744900 0.249924 0.741903
A 0.415877 0.426759 0.080056
B 0.624051 0.955153 0.464689
B 0.355089 0.306564 0.428365
df_temp1.columns
MultiIndex([( '', 'Lower'),
('D', 'd'),
('D', 'e'),
('D', 'f'),
('E', 'd'),
('E', 'e'),
('E', 'f'),
('F', 'd'),
('F', 'e'),
('F', 'f')],
names=['Big', 'Small'])
df_temp1.index
Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], dtype='object', name='Upper')
5.6 rename_axis()和rename()方法
5.6.1 rename_axis()用于修改多级索引中的索引名
print(df_temp.rename_axis(index = {'Lower':'LowerLower'}, columns = {'Big':'BigBig'}))
BigBig D E \
Small d e f d e f
Upper LowerLower
A a 0.996992 0.239206 0.466831 0.317890 0.534678 0.692468
b 0.768305 0.288890 0.937617 0.727311 0.634722 0.213160
c 0.424430 0.188946 0.576724 0.934759 0.466979 0.319530
B a 0.778316 0.394294 0.373620 0.792681 0.614062 0.480871
b 0.285218 0.202051 0.780276 0.531896 0.939366 0.167180
c 0.627673 0.650161 0.279214 0.256610 0.718082 0.581464
C a 0.557991 0.005820 0.713160 0.196349 0.982871 0.157749
b 0.213228 0.102764 0.284405 0.486068 0.798130 0.634954
c 0.761508 0.048896 0.907974 0.973812 0.954789 0.693932
BigBig F
Small d e f
Upper LowerLower
A a 0.386664 0.441064 0.539764
b 0.744900 0.249924 0.741903
c 0.415877 0.426759 0.080056
B a 0.624051 0.955153 0.464689
b 0.355089 0.306564 0.428365
c 0.272150 0.907038 0.112153
C a 0.491743 0.853777 0.330861
b 0.438192 0.880983 0.126540
c 0.862145 0.444185 0.237926
5.6.2 rename()用于修改列或者行索引标签而非索引名
print(df_temp.rename(index = {'A':'T'}, columns = {'e':'changed_e'}).head())
Big D E \
Small d changed_e f d changed_e f
Upper Lower
T a 0.996992 0.239206 0.466831 0.317890 0.534678 0.692468
b 0.768305 0.288890 0.937617 0.727311 0.634722 0.213160
c 0.424430 0.188946 0.576724 0.934759 0.466979 0.319530
B a 0.778316 0.394294 0.373620 0.792681 0.614062 0.480871
b 0.285218 0.202051 0.780276 0.531896 0.939366 0.167180
Big F
Small d changed_e f
Upper Lower
T a 0.386664 0.441064 0.539764
b 0.744900 0.249924 0.741903
c 0.415877 0.426759 0.080056
B a 0.624051 0.955153 0.464689
b 0.355089 0.306564 0.428365
6. 常用索引型函数
6.1 where()函数
print(df.head())
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
6.1.1 对条件为False的单元进行填充
print(df.where(df['Gender'] == 'M').head())
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+
1102 NaN NaN NaN NaN NaN NaN NaN NaN
1103 S_1 C_1 M street_2 186.0 82.0 87.2 B+
1104 NaN NaN NaN NaN NaN NaN NaN NaN
1105 NaN NaN NaN NaN NaN NaN NaN NaN
6.1.2 对条件为False的单元进行指定值填充
print(df.where(df['Gender'] == 'M', \
np.random.rand(df.shape[0], df.shape[1])).head())
School Class Gender Address Height Weight \
ID
1101 S_1 C_1 M street_1 173.000000 63.000000
1102 0.531856 0.588896 0.40665 0.936004 0.473496 0.674745
1103 S_1 C_1 M street_2 186.000000 82.000000
1104 0.203294 0.0809859 0.423323 0.126764 0.965819 0.664274
1105 0.82605 0.541455 0.714742 0.455908 0.019623 0.906721
Math Physics
ID
1101 34.000000 A+
1102 0.203038 0.831944
1103 87.200000 B+
1104 0.352529 0.836464
1105 0.352698 0.156627
6.2 mask()函数
mask()函数与where()功能上相反,其余完全一致,即对条件为True的单元进行填充。
print(df.mask(df['Gender'] == 'M').dropna().head())
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192.0 73.0 32.5 B+
1104 S_1 C_1 F street_2 167.0 81.0 80.4 B-
1105 S_1 C_1 F street_4 159.0 64.0 84.8 B+
1202 S_1 C_2 F street_4 176.0 94.0 63.5 B-
1204 S_1 C_2 F street_5 162.0 63.0 33.8 B
6.3 query()函数
query()函数中的布尔表达式中,下面的符号都是合法的:行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符
print(df.query('''(Address in ["street_6","street_7"])& \
(Weight>(70+10))&(ID in [1303,2304,2402])'''))
School Class Gender Address Height Weight Math Physics
ID
1303 S_1 C_3 M street_7 188 82 49.7 B
2304 S_2 C_3 F street_6 164 81 95.5 A-
2402 S_2 C_4 M street_7 166 82 48.7 B
7. 重复元素处理
7.1 确认重复元素存在
print(df.duplicated('Class').head())
ID
1101 False
1102 True
1103 True
1104 True
1105 True
dtype: bool
可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为False。
print(df.duplicated('Class', keep = 'last').tail())
ID
2401 True
2402 True
2403 True
2404 True
2405 False
dtype: bool
print(df.duplicated('Class', keep = False).head())
ID
1101 True
1102 True
1103 True
1104 True
1105 True
dtype: bool
7.2 去除重复元素
print(df.drop_duplicates('Class'))
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1301 S_1 C_3 M street_4 161 68 31.5 B+
2401 S_2 C_4 F street_2 192 62 45.3 A
print(df.drop_duplicates('Class', keep = 'last'))
School Class Gender Address Height Weight Math Physics
ID
2105 S_2 C_1 M street_4 170 81 34.2 A
2205 S_2 C_2 F street_7 183 76 85.4 B
2305 S_2 C_3 M street_4 187 73 48.9 B
2405 S_2 C_4 F street_6 193 54 47.6 B
在传入多列时等价于将多列共同视作一个多级索引,比较重复项。
print(df.drop_duplicates(['School', 'Class']))
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1301 S_1 C_3 M street_4 161 68 31.5 B+
2101 S_2 C_1 M street_7 174 84 83.3 C
2201 S_2 C_2 M street_5 193 100 39.1 B
2301 S_2 C_3 F street_4 157 78 72.3 B+
2401 S_2 C_4 F street_2 192 62 45.3 A
8. 抽样函数sample()
8.1 按个数抽样
print(df.sample(n = 5))
School Class Gender Address Height Weight Math Physics
ID
2402 S_2 C_4 M street_7 166 82 48.7 B
2401 S_2 C_4 F street_2 192 62 45.3 A
2205 S_2 C_2 F street_7 183 76 85.4 B
2305 S_2 C_3 M street_4 187 73 48.9 B
1104 S_1 C_1 F street_2 167 81 80.4 B-
8.2 按比例抽样
df.sample(frac = 0.05)
School Class Gender Address Height Weight Math Physics
ID
1304 S_1 C_3 M street_2 195 70 85.2 A
1203 S_1 C_2 M street_6 160 53 58.8 A+
8.3 有放回抽样
print(df.sample(n = df.shape[0], replace = True).head())
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
2202 S_2 C_2 F street_7 194 77 68.5 B+
2103 S_2 C_1 M street_4 157 61 52.5 B-
2203 S_2 C_2 M street_4 155 91 73.8 A+
1304 S_1 C_3 M street_2 195 70 85.2 A
print(df.sample(n = 35, replace = True).index.is_unique)
False
8.4 对列随机抽样
print(df.sample(n = 3, axis = 1).head())
Class Gender School
ID
1101 C_1 M S_1
1102 C_1 F S_1
1103 C_1 M S_1
1104 C_1 F S_1
1105 C_1 F S_1
8.5 带权重抽样
8.5.1 对行赋予权值
print(df.sample(n = 3, weights = np.random.rand(df.shape[0])).head())
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
2405 S_2 C_4 F street_6 193 54 47.6 B
1205 S_1 C_2 F street_6 167 63 68.4 B-
8.5.2 以某列值为权值(自动归一化)
print(df.sample(n = 3, weights = df['Math']).head())
School Class Gender Address Height Weight Math Physics
ID
2105 S_2 C_1 M street_4 170 81 34.2 A
2403 S_2 C_4 F street_6 158 60 59.7 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
9. 问题与练习
9.1 问题
【问题一】 如何更改列或行的顺序?如何交换奇偶行或列的顺序?
更改行的次序可以参考[2],更改列的次序可以参考[3]。交换奇偶行或列的操作可以用更改次序的操作来完成。
另外,reindex()方法也可以实现更改行列次序。
【问题二】 如果要选出DataFrame的某个子集,请给出尽可能多的方法实现。
参考前面loc[]、iloc[]和[]操作符的用法。
【问题三】 query()函数比其他索引方法的速度更慢吗?在什么场合使用什么索引最高效?
是的。
【问题四】 单级索引能使用Slice对象吗?能的话怎么使用,请给出一个例子。
可以。
【问题五】 如何快速找出某一列的缺失值所在索引?
利用isnull()方法对该列进行布尔索引然后用index提取索引。
【问题六】 索引设定中的所有方法分别适用于哪些场合?怎么直接把某个表格的索引换成任意给定同长度的索引?
索引重排。索引换成任意长度的话可以参考[4]。
【问题七】 多级索引有什么适用场合?
多级限制条件下的查找和比较。
【问题八】 什么时候需要重复元素处理?
数据分组。
9.2 练习
略。
10. 参考文献
1. https://blog.youkuaiyun.com/Zee_Chao/article/details/105646577
2. https://blog.youkuaiyun.com/weixin_45531245/article/details/100659940
3. https://blog.youkuaiyun.com/a19990412/article/details/81945315