文章目录
import pandas as pd
import numpy as np
Indexing Series
Series 的索引和 NumPy array 类似,但除了可以使用整数值,还可以使用 Series 的 index
:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
"""
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
"""
obj['b']
"""
1.0
"""
obj[1]
"""
1.0
"""
类似 NumPy array 的 fancy indexing:
obj[['b', 'a', 'd']]
"""
b 1.0
a 0.0
d 3.0
dtype: float64
"""
也可以使用 index
的 label 进行切片,但和常规切片操作不同的是,右区间为闭合区间 (index
非整数情况下):
obj['b':'c']
"""
b 1.0
c 2.0
dtype: float64
"""
obj[1:2]
"""
b 1.0
dtype: float64
"""
Indexing DataFrame
Indexing columns
df = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Beijing', 'Shanghai', 'Guangzhou', 'Xian'],
columns=['one', 'two', 'three', 'four'])
print(df)
"""
one two three four
Beijing 0 1 2 3
Shanghai 4 5 6 7
Guangzhou 8 9 10 11
Xian 12 13 14 15
"""
索引 two
这一列:
df['two']
"""
Beijing 1
Shanghai 5
Guangzhou 9
Xian 13
Name: two, dtype: int32
"""
df.two
df['two']
"""
Beijing 1
Shanghai 5
Guangzhou 9
Xian 13
Name: two, dtype: int32
"""
索引多列:
print(df[['three', 'one']])
"""
three one
Beijing 2 0
Shanghai 6 4
Guangzhou 10 8
Xian 14 12
"""
Selecting rows
print(df[:2])
"""
one two three four
Beijing 0 1 2 3
Shanghai 4 5 6 7
"""
使用布尔 array:
df['three'] > 5
"""
Beijing False
Shanghai True
Guangzhou True
Xian True
Name: three, dtype: bool
"""
print(df[df['three'] > 5])
"""
one two three four
Shanghai 4 5 6 7
Guangzhou 8 9 10 11
Xian 12 13 14 15
"""
Indexing using a bool DataFrame
print(df < 5)
"""
one two three four
Beijing True True True True
Shanghai True False False False
Guangzhou False False False False
Xian False False False False
"""
df[df < 5] = 0
print(df)
"""
one two three four
Beijing 0 0 0 0
Shanghai 0 5 6 7
Guangzhou 8 9 10 11
Xian 12 13 14 15
"""
Selection with loc
and iloc
相比上面的索引操作,loc
和 iloc
可以帮助我们更灵活的对 DataFrame 进行索引,得到不同行列的组合。
.loc
主要基于 label 来索引。例如,我们想要选择一行多列:
df.loc['Beijing', ['two', 'three']]
"""
two 0
three 0
Name: Beijing, dtype: int32
"""
.iloc
则主要用整数来进行索引。同样是选择一行多列:
df.iloc[0, [3, 0, 1]]
"""
four 0
one 0
two 0
Name: Beijing, dtype: int32
"""
选择第三行:
df.iloc[2]
"""
one 8
two 9
three 10
four 11
Name: Guangzhou, dtype: int32
"""
任意行列:
print(df.iloc[[1, 2], [3, 0, 1]])
"""
four one two
Shanghai 7 0 5
Guangzhou 11 8 9
"""
注意上面一行多列的情形 .loc
和 .iloc
返回的都是 Series 对象。但如果我们将
df.loc['Beijing', ['two', 'three']]
df.iloc[0, [3, 0, 1]]
写为
df.loc[['Beijing'], ['two', 'three']]
df.iloc[[0], [3, 0, 1]]
则会返回 DataFrame 对象。
在 .loc
和 .iloc
中我们也可以结合使用切片操作,但注意 .loc
末端仍为闭区间,但 .iloc
为左闭右开:
df.loc[:'Xian', 'two']
"""
Beijing 0
Shanghai 5
Guangzhou 9
Xian 13
Name: two, dtype: int32
"""
print(df.iloc[:, :3][df.three > 5])
"""
one two three
Shanghai 0 5 6
Guangzhou 8 9 10
Xian 12 13 14
"""
总结
DataFrame 的索引操作:
Type | Description |
---|---|
df[val] | 一般为选择一列或多列(通过列标记);但如果使用切片时也可以选择多行(使用行标记) |
df.loc[val] | 通过行标记选择一行或多行 |
df.loc[:, val] | 通过列标记选择一列或多列 |
df.loc[val1, val2] | 同时选择行和列 (by label) |
df.iloc[where] | 通过整数位置选择一行或多行 |
df.iloc[:, where] | 通过整数位置选择一列或多列 |
df.iloc[where_i, where_j] | 同时选择行和列 (by integer) |
References
Python for Data Analysis, 2 n d ^{\rm nd} nd edition. Wes McKinney.