注意事项:
Dataframe既有行索引也有列索引,可以被看做由Series组成的字典(共用一个索引)
1. 选择列
1.1 df[]
一般用于选择列,也可以选择行(默认是进行列选择的)
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print(df)
data1 = df['a']
data2 = df[['b','c']] #
print(data1)
print(data2)
–> 输出的结果为:
a b c d
one 58.508966 95.955052 21.001119 11.598748
two 39.940444 4.822591 63.117561 24.915640
three 10.141366 42.279737 81.585248 99.513415
one 58.508966
two 39.940444
three 10.141366
Name: a, dtype: float64
b c
one 95.955052 21.001119
two 4.822591 63.117561
three 42.279737 81.585248
1.2df[]
用于选择行(一般不这么使用,但是可以这么操作),后面有专门对于行的操作方法
1.3df[]
不能通过索引标签名来选择行(比如这里df[‘one’])
data3 = df[:1]
print(data3)
print(type(data3))
–> 输出的结果为:
a b c d
one 58.508966 95.955052 21.001119 11.598748
<class 'pandas.core.frame.DataFrame'>
2. 选择行
2.1 df.loc[]
- 按index选择行
df.loc[label]
主要针对index选择行,同时支持指定index,及默认数字index
2.1.1 首先创建数组**
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = ['a','b','c','d'])
print(df1)
print(df2)
–> 输出的结果为:
a b c d
one 32.739293 74.631681 57.738041 64.283459
two 49.329576 96.607287 37.576970 21.803517
three 62.766459 49.264659 71.193031 22.111200
four 48.914713 84.778627 49.706254 7.874963
a b c d
0 79.514782 45.871142 57.086445 11.709671
1 3.236386 61.162491 18.101219 38.525494
2 46.595874 13.619774 15.503499 0.832061
3 52.592679 18.123406 54.248833 59.938835
2.1.2 单标签索引(根据有无标签名进行索引),返回Series
data1 = df1.loc['one']
data2 = df2.loc[1]
print(data1)
print(data2)
–> 输出的结果为:(Series的name会以索引的标签为名)
a 32.739293
b 74.631681
c 57.738041
d 64.283459
Name: one, dtype: float64
a 3.236386
b 61.162491
c 18.101219
d 38.525494
Name: 1, dtype: float64
2.1.3 多标签索引,如果标签不存在,则返回NaN(索引顺序可变)
data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print(data3)
print(data4)
–> 输出的结果为:(注意pandas版本的问题)
a b c d
two 49.329576 96.607287 37.576970 21.803517
three 62.766459 49.264659 71.193031 22.111200
five NaN NaN NaN NaN
a b c d
3 52.592679 18.123406 54.248833 59.938835
2 46.595874 13.619774 15.503499 0.832061
1 3.236386 61.162491 18.101219 38.525494
2.1.4 切片索引,末端包含
data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print(data5)
print(data6)
–> 输出的结果为:
a b c d
one 32.739293 74.631681 57.738041 64.283459
two 49.329576 96.607287 37.576970 21.803517
three 62.766459 49.264659 71.193031 22.111200
a b c d
1 3.236386 61.162491 18.101219 38.525494
2 46.595874 13.619774 15.503499 0.832061
3 52.592679 18.123406 54.248833 59.938835
2.2 df.iloc[]
- 按照整数位置选择行
类似list的索引,其顺序就是dataframe的整数位置(从轴的0到length-1)
2.2.1 首先创建数组
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
–> 输出的结果为:
a b c d
one 64.196153 3.181391 71.407232 66.672682
two 46.100913 51.140302 92.888548 12.207747
three 55.724660 28.906997 21.150581 6.250792
four 80.663114 36.770303 88.255988 21.949060
2.2.2 单标签索引,和loc[]
索引不同,不能索引超出数据行数的整数位置,比如下面的.iloc[4]
print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])
–> 输出的结果为:
a 64.196153
b 3.181391
c 71.407232
d 66.672682
Name: one, dtype: float64
a 80.663114
b 36.770303
c 88.255988
d 21.949060
Name: four, dtype: float64
2.2.3 多标签索引,索引顺序可变
print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])
–> 输出的结果为:
a b c d
one 64.196153 3.181391 71.407232 66.672682
three 55.724660 28.906997 21.150581 6.250792
a b c d
four 80.663114 36.770303 88.255988 21.949060
three 55.724660 28.906997 21.150581 6.250792
two 46.100913 51.140302 92.888548 12.207747
2.2.4 切片索引,末端不包含(注意和上面的区别)
print(df.iloc[1:3])
print(df.iloc[::2])
–> 输出的结果为:
a b c d
two 46.100913 51.140302 92.888548 12.207747
three 55.724660 28.906997 21.150581 6.250792
a b c d
one 64.196153 3.181391 71.407232 66.672682
three 55.724660 28.906997 21.150581 6.250792
3 布尔型索引
3.1 前期准备数据
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
–> 输出的结果为:
a b c d
one 38.986549 81.009721 57.779180 6.768009
two 61.818468 24.443819 72.064397 87.910932
three 66.612955 48.643065 36.655897 37.299216
four 3.155591 25.298921 1.175081 49.936492
3.2 全局索引
b1 = df < 20
print(b1,type(b1))
print(df[b1])
# 也可以书写为 df[df < 20]
–> 输出的结果为:
a b c d
one False False False True
two False False False False
three False False False False
four True False True False <class 'pandas.core.frame.DataFrame'>
a b c d
one NaN NaN NaN 6.768009
two NaN NaN NaN NaN
three NaN NaN NaN NaN
four 3.155591 NaN 1.175081 NaN
3.3 单列(行)判断索引
b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2])
# 也可以书写为 df[df['a'] > 50]
–> 输出的结果为:
one False
two True
three True
four False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
a b c d
two 61.818468 24.443819 72.064397 87.910932
three 66.612955 48.643065 36.655897 37.299216
3.4 多列做判断索引
b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])
# 也可以书写为 df[df[['a','b']] > 50]
–> 输出的结果为:
a b
one False True
two True False
three True False
four False False <class 'pandas.core.frame.DataFrame'>
a b c d
one NaN 81.009721 NaN NaN
two 61.818468 NaN NaN NaN
three 66.612955 NaN NaN NaN
four NaN NaN NaN NaN
3.5 多行做判断索引
b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])
# 也可以书写为 df[df.loc[['one','three']] < 50]
–> 输出的结果为:
a b c d
one True False False True
three False True True True <class 'pandas.core.frame.DataFrame'>
a b c d
one 38.986549 NaN NaN 6.768009
two NaN NaN NaN NaN
three NaN 48.643065 36.655897 37.299216
four NaN NaN NaN NaN