第三章 索引
一、索引器
1、表的列索引
通过类似字典标记的方式或属性的方式,可以将DataFramed的列获取为一个Series。
- 只取出一列
frame2['state']
,列名中不包含空格时,可以直接用.loc
取出:frame2.state
- 取出多个列 则可以通过
[列名组成的列表]
,得到一个DataFrame
2、序列的行索引
以字符串为索引的Series (也可以理解为必须给索引的值的显示索引)
- 只取出单个索引 使用
[item]
,得到该item对应的所有值 - 取出多个索引 使用
[item的列表]
- 取出某两个索引之间的元素 (索引唯一出现)这里的切片会包含左右两个端点值
s['c': 'b': -2]
以整数为索引的Series(不给出索引的值,会自动生成整数)
- 取出单个索引或多个索引 使用
[int]
和[int_list]
,则可以取出对应索引元素的值 - 取出某两个索引之间的元素 这里的整数切片不包含右端点
3、loc索引器
对DataFrame的行进行选取,一种是基于元素的loc索引器。
以函数为索引
In [42]: def condition(x):
....: condition_1_1 = x.School == 'Fudan University'
....: condition_1_2 = x.Grade == 'Senior'
....: condition_1_3 = x.Weight > 70
....: condition_1 = condition_1_1 & condition_1_2 & condition_1_3
....: condition_2_1 = x.School == 'Peking University'
....: condition_2_2 = x.Grade == 'Senior'
....: condition_2_3 = x.Weight > 80
....: condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
....: result = condition_1 | condition_2
....: return result
....:
In [43]: df_demo.loc[condition]
Out[43]:
School Grade Gender Weight Transfer
Name
Qiang Han Peking University Freshman Male 87.0 N
Chengpeng Zhou Fudan University Senior Male 81.0 N
Changpeng Zhao Peking University Freshman Male 83.0 N
Chengpeng Qian Fudan University Senior Male 73.0 Y
这里主要记录一下,在进行赋值操作时,容易出现的一种报错及解决方案。
# 之前自己使用过程中的问题
cases_over_10000_china = cases_china_copy[cases_china_copy.cases > 10000]
cases_not_china_copy[cases_not_china_copy['date']==cases_over_10000_china['date'].squeeze()]['cases']
# 其实我不记得是在哪用的时候报错了,找不到赋值的那句了
SettingWithCopyWarning:A value is trying to be set on a copy of slice from a DataFrame
Try using .loc[row_indexer,col_indexer] = value instead
pandas无法判断对原始DataFrame进行选取时,产生的是视图还是副本,如果是视图,则赋值会修改原始DataFrame,如果产生的是副本,则不会修改原始的DataFrame。
使用.loc[*,*]
则会产生一个新的DataFrame,可以解决以上的问题
或者,使用.copy()
创建一个副本
cases_over_10000_china = cases_china_copy[cases_china_copy.cases > 10000].copy()
练一练
select_dtypes
是一个实用函数,它能够从表中选出相应类型的列,若要选出所有数值型的列,只需使用 .select_dtypes('number')
,请利用布尔列表选择的方法结合 DataFrame
的 dtypes
属性在 learn_pandas
数据集上实现这个功能。
df = pd.read_csv('joyful-pandas/data/learn_pandas.csv')
df.select_dtypes('number').head()
# Height Weight Test_Number
# 0 158.9 46.0 1
# 1 166.5 70.0 1
# 2 188.9 89.0 2
# 3 NaN 41.0 2
# 4 174.0 74.0 2
# dtypes 中数值型有int64和float64
condition = (df.dtypes == 'int64') | (df.dtypes == 'float64')
df.loc[:,condition].head()
# Height Weight Test_Number
# 0 158.9 46.0 1
# 1 166.5 70.0 1
# 2 188.9 89.0 2
# 3 NaN 41.0 2
# 4 174.0 74.0 2
4、iloc索引器
iloc的使用方法与loc完全类似,只不过是针对位置进行筛选。
5、query方法
在pandas,支持把字符串形式的查询表达式传入query方法来查询数据,其表达式的执行结果必须返回布尔列表。在进行复杂索引时,由于这种检索方式无需像普通方法一样重复使用DataFrame的名字来引用列名,一般而言会使代码长度在不降低可读性的情况下有所减少。
将loc一节中的复合条件查询进行改写:
In [61]: df.query('((School == "Fudan University")&'
....: ' (Grade == "Senior")&'
....: ' (Weight > 70))|'
....: '((School == "Peking University")&'
....: ' (Grade != "Senior")&'
....: ' (Weight > 80))')
....:
Out[61]:
School Grade Name Gender Weight Transfer
38 Peking University Freshman Qiang Han Male 87.0 N
66 Fudan University Senior Chengpeng Zhou Male 81.0 N
99 Peking University Freshman Changpeng Zhao Male 83.0 N
131 Fudan University Senior Chengpeng Qian Male 73.0 Y
对于含有空格的列名,需要使用col name
的方式进行引用
这几天在外面旅游,把习题做一下好了,之前的内容以后有时间再补。
五、练习
1、公司员工数据集
现有一份公司员工数据集:
df = pd.read_csv('joyful-pandas/data/company.csv')
df.head(3)
# EmployeeID birthdate_key age city_name department job_title gender
# 0 1318 1/3/1954 61 Vancouver Executive CEO M
# 1 1319 1/3/1957 58 Vancouver Executive VP Stores F
# 2 1320 1/2/1955 60 Vancouver Executive Legal Counsel F
#1 分别只使用 query
和 loc
选出年龄不超过四十岁且工作部门为 Dairy
或 Bakery
的男性。
# 使用query
df.query('(department == ["Dairy", "Bakery"])&(age <= 40) &(gender == "M")').head()
# EmployeeID birthdate_key age city_name department job_title gender
# 3611 5791 1/14/1975 40 Kelowna Dairy Dairy Person M
# 3613 5793 1/22/1975 40 Richmond Bakery Baker M
# 3615 5795 1/30/1975 40 Nanaimo Dairy Dairy Person M
# 3617 5797 2/3/1975 40 Nanaimo Dairy Dairy Person M
# 3618 5798 2/4/1975 40 Surrey Dairy Dairy Person M
#使用loc
df.loc[(df.age<=40)&df.department.isin(["Dairy", "Bakery"])&(df.gender=='M')].head()
# EmployeeID birthdate_key age city_name department job_title gender
# 3611 5791 1/14/1975 40 Kelowna Dairy Dairy Person M
# 3613 5793 1/22/1975 40 Richmond Bakery Baker M
# 3615 5795 1/30/1975 40 Nanaimo Dairy Dairy Person M
# 3617 5797 2/3/1975 40 Nanaimo Dairy Dairy Person M
# 3618 5798 2/4/1975 40 Surrey Dairy Dairy Person M
#2 选出员工 ID
号 为奇数所在行的第1、第3和倒数第2列。
df.iloc[(df.EmployeeID %2 ==1).values,[0,2,-2]].head()
# EmployeeID age job_title
# 1 1319 58 VP Stores
# 3 1321 56 VP Human Resources
# 5 1323 53 Exec Assistant, VP Stores
# 6 1325 51 Exec Assistant, Legal Counsel
# 8 1329 48 Store Manager
#3 按照以下步骤进行索引操作:
- 把后三列设为索引后交换内外两层
- 恢复中间一层
- 修改外层索引名为
Gender
- 用下划线合并两层行索引
- 把行索引拆分为原状态
- 修改索引名为原表名称
- 恢复默认索引并将列保持为原表的相对位置
# 把后三列设为索引后交换内外两层
df.set_index(['department','job_title','gender']).swaplevel(0,2,axis=0).head()
# EmployeeID birthdate_key age city_name
# gender job_title department
# M CEO Executive 1318 1/3/1954 61 Vancouver
# F VP Stores Executive 1319 1/3/1957 58 Vancouver
# Legal Counsel Executive 1320 1/2/1955 60 Vancouver
# M VP Human Resources Executive 1321 1/2/1959 56 Vancouver
# VP Finance Executive 1322 1/9/1958 57 Vancouver
# 恢复中间一层
df_copy.reset_index('job_title',drop=False,inplace=True)
df_copy.head()
# EmployeeID birthdate_key age city_name
# gender job_title department
# M CEO Executive 1318 1/3/1954 61 Vancouver
# F VP Stores Executive 1319 1/3/1957 58 Vancouver
# Legal Counsel Executive 1320 1/2/1955 60 Vancouver
# M VP Human Resources Executive 1321 1/2/1959 56 Vancouver
# VP Finance Executive 1322 1/9/1958 57 Vancouver
# 修改外层索引名为 `Gender`
df_copy.rename_axis(index={'gender':'Gender'},inplace=True)
df_copy.head()
# job_title EmployeeID birthdate_key age city_name
# Gender department
# M Executive CEO 1318 1/3/1954 61 Vancouver
# F Executive VP Stores 1319 1/3/1957 58 Vancouver
# Executive Legal Counsel 1320 1/2/1955 60 Vancouver
# M Executive VP Human Resources 1321 1/2/1959 56 Vancouver
# Executive VP Finance 1322 1/9/1958 57 Vancouver
#用下划线合并两层行索引
df_copy.index = df_copy.index.map(lambda x:(x[0]+'_'+x[1]))
df_copy.head()
# job_title EmployeeID birthdate_key age city_name
# M_Executive CEO 1318 1/3/1954 61 Vancouver
# F_Executive VP Stores 1319 1/3/1957 58 Vancouver
# F_Executive Legal Counsel 1320 1/2/1955 60 Vancouver
# M_Executive VP Human Resources 1321 1/2/1959 56 Vancouver
# M_Executive VP Finance 1322 1/9/1958 57 Vancouver
# 把行索引拆分为原状态
df_copy.index = df_copy.index.map(lambda x:tuple(x.split('_')))
df_copy.head()
# job_title EmployeeID birthdate_key age city_name
# M Executive CEO 1318 1/3/1954 61 Vancouver
# F Executive VP Stores 1319 1/3/1957 58 Vancouver
# Executive Legal Counsel 1320 1/2/1955 60 Vancouver
# M Executive VP Human Resources 1321 1/2/1959 56 Vancouver
# Executive VP Finance 1322 1/9/1958 57 Vancouver
# 修改索引名为原表名称
df_op.rename_axis(index=['gender', 'department'],inplace=True)
print(df_copy.head())
# job_title EmployeeID birthdate_key age city_name
# gender department
# M Executive CEO 1318 1/3/1954 61 Vancouver
# F Executive VP Stores 1319 1/3/1957 58 Vancouver
# Executive Legal Counsel 1320 1/2/1955 60 Vancouver
# M Executive VP Human Resources 1321 1/2/1959 56 Vancouver
# Executive VP Finance 1322 1/9/1958 57 Vancouver
# 恢复默认索引并将列保持为原表的相对位置
df_copy.reset_index().reindex(columns=df.columns).equals(df)
# True
2、巧克力数据集
chocolate = pd.read_csv('joyful-pandas/data/chocolate.csv')
chocolate.head()
# Company Review\nDate Cocoa\nPercent Company\nLocation Rating
# 0 A. Morin 2016 63% France 3.75
# 1 A. Morin 2015 70% France 2.75
# 2 A. Morin 2015 70% France 3.00
# 3 A. Morin 2015 70% France 3.50
# 4 A. Morin 2015 70% France 3.50
#1 把列索引名中的 \n
替换为空格。
chocolate.rename(columns=lambda x:x.replace('\n',' '), inplace=True)
chocolate.head()
# Company Review Date Cocoa Percent Company Location Rating
# 0 A. Morin 2016 63% France 3.75
# 1 A. Morin 2015 70% France 2.75
# 2 A. Morin 2015 70% France 3.00
# 3 A. Morin 2015 70% France 3.50
# 4 A. Morin 2015 70% France 3.50
#2 巧克力 Rating
评分为1至5,每0.25分一档,请选出2.75分及以下且可可含量 Cocoa Percent
高于中位数的样本。
chocolate['Cocoa Percent'] = chocolate['Cocoa Percent'].apply(lambda x:float(x[:-1])/100)
chocolate.query('(Rating<3)&(`Cocoa Percent`>`Cocoa Percent`.median())').head()
# Company Review Date Cocoa Percent Company Location Rating
# 33 Akesson's (Pralus) 2010 0.75 Switzerland 2.75
# 34 Akesson's (Pralus) 2010 0.75 Switzerland 2.75
# 36 Alain Ducasse 2014 0.75 France 2.75
# 38 Alain Ducasse 2013 0.75 France 2.50
3 39 Alain Ducasse 2013 0.75 France 2.50
#3 将 Review Date
和 Company Location
设为索引后,选出 Review Date
在2012年之后且 Company Location
不属于 France, Canada, Amsterdam, Belgium
的样本。