第三章 索引

本文详细介绍了Pandas中的索引操作,包括表的列索引、序列的行索引(字符串和整数类型)、loc和iloc索引器的使用,以及query方法。通过实例展示了如何选取特定列、行,以及如何处理包含空格的列名。文章还提出了在赋值操作中避免修改原始DataFrame的策略,并提供了多个练习题以巩固所学知识。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

第三章 索引

一、索引器

1、表的列索引

通过类似字典标记的方式或属性的方式,可以将DataFramed的列获取为一个Series。

  • 只取出一列 frame2['state'],列名中不包含空格时,可以直接用.loc取出:frame2.state
  • 取出多个列 则可以通过[列名组成的列表],得到一个DataFrame

2、序列的行索引

以字符串为索引的Series (也可以理解为必须给索引的值的显示索引)
  • 只取出单个索引 使用[item],得到该item对应的所有值
  • 取出多个索引 使用[item的列表]
  • 取出某两个索引之间的元素 (索引唯一出现)这里的切片会包含左右两个端点值s['c': 'b': -2]
以整数为索引的Series(不给出索引的值,会自动生成整数)
  • 取出单个索引或多个索引 使用[int][int_list],则可以取出对应索引元素的值
  • 取出某两个索引之间的元素 这里的整数切片不包含右端点

3、loc索引器

对DataFrame的行进行选取,一种是基于元素的loc索引器。

以函数为索引

In [42]: def condition(x):
   ....:     condition_1_1 = x.School == 'Fudan University'
   ....:     condition_1_2 = x.Grade == 'Senior'
   ....:     condition_1_3 = x.Weight > 70
   ....:     condition_1 = condition_1_1 & condition_1_2 & condition_1_3
   ....:     condition_2_1 = x.School == 'Peking University'
   ....:     condition_2_2 = x.Grade == 'Senior'
   ....:     condition_2_3 = x.Weight > 80
   ....:     condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
   ....:     result = condition_1 | condition_2
   ....:     return result
   ....: 

In [43]: df_demo.loc[condition]
Out[43]: 
                           School     Grade Gender  Weight Transfer
Name                                                               
Qiang Han       Peking University  Freshman   Male    87.0        N
Chengpeng Zhou   Fudan University    Senior   Male    81.0        N
Changpeng Zhao  Peking University  Freshman   Male    83.0        N
Chengpeng Qian   Fudan University    Senior   Male    73.0        Y

这里主要记录一下,在进行赋值操作时,容易出现的一种报错及解决方案。

# 之前自己使用过程中的问题
cases_over_10000_china = cases_china_copy[cases_china_copy.cases > 10000]
cases_not_china_copy[cases_not_china_copy['date']==cases_over_10000_china['date'].squeeze()]['cases']
# 其实我不记得是在哪用的时候报错了,找不到赋值的那句了
SettingWithCopyWarning:A value is trying to be set on a copy of slice from a DataFrame

Try using .loc[row_indexer,col_indexer] = value instead

pandas无法判断对原始DataFrame进行选取时,产生的是视图还是副本,如果是视图,则赋值会修改原始DataFrame,如果产生的是副本,则不会修改原始的DataFrame。

使用.loc[*,*]则会产生一个新的DataFrame,可以解决以上的问题

或者,使用.copy()创建一个副本

cases_over_10000_china = cases_china_copy[cases_china_copy.cases > 10000].copy()

练一练

select_dtypes 是一个实用函数,它能够从表中选出相应类型的列,若要选出所有数值型的列,只需使用 .select_dtypes('number') ,请利用布尔列表选择的方法结合 DataFramedtypes 属性在 learn_pandas 数据集上实现这个功能。

df = pd.read_csv('joyful-pandas/data/learn_pandas.csv')
df.select_dtypes('number').head()
# Height	Weight	Test_Number
# 0	158.9	46.0	1
# 1	166.5	70.0	1
# 2	188.9	89.0	2
# 3	NaN	41.0	2
# 4	174.0	74.0	2

# dtypes 中数值型有int64和float64
condition = (df.dtypes == 'int64') | (df.dtypes == 'float64')
df.loc[:,condition].head()
# Height	Weight	Test_Number
# 0	158.9	46.0	1
# 1	166.5	70.0	1
# 2	188.9	89.0	2
# 3	NaN	41.0	2
# 4	174.0	74.0	2
4、iloc索引器

iloc的使用方法与loc完全类似,只不过是针对位置进行筛选。

5、query方法

在pandas,支持把字符串形式的查询表达式传入query方法来查询数据,其表达式的执行结果必须返回布尔列表。在进行复杂索引时,由于这种检索方式无需像普通方法一样重复使用DataFrame的名字来引用列名,一般而言会使代码长度在不降低可读性的情况下有所减少。

将loc一节中的复合条件查询进行改写:

In [61]: df.query('((School == "Fudan University")&'
   ....:          ' (Grade == "Senior")&'
   ....:          ' (Weight > 70))|'
   ....:          '((School == "Peking University")&'
   ....:          ' (Grade != "Senior")&'
   ....:          ' (Weight > 80))')
   ....: 
Out[61]: 
                School     Grade            Name Gender  Weight Transfer
38   Peking University  Freshman       Qiang Han   Male    87.0        N
66    Fudan University    Senior  Chengpeng Zhou   Male    81.0        N
99   Peking University  Freshman  Changpeng Zhao   Male    83.0        N
131   Fudan University    Senior  Chengpeng Qian   Male    73.0        Y

对于含有空格的列名,需要使用col name的方式进行引用

这几天在外面旅游,把习题做一下好了,之前的内容以后有时间再补。

五、练习

1、公司员工数据集

现有一份公司员工数据集:

df = pd.read_csv('joyful-pandas/data/company.csv')
df.head(3)
# EmployeeID	birthdate_key	age	city_name	department	job_title	gender
# 0	1318	1/3/1954	61	Vancouver	Executive	CEO	M
# 1	1319	1/3/1957	58	Vancouver	Executive	VP Stores	F
# 2	1320	1/2/1955	60	Vancouver	Executive	Legal Counsel	F

#1 分别只使用 queryloc 选出年龄不超过四十岁且工作部门为 DairyBakery 的男性。

# 使用query
df.query('(department == ["Dairy", "Bakery"])&(age <= 40) &(gender == "M")').head()
# EmployeeID	birthdate_key	age	city_name	department	job_title	gender
# 3611	5791	1/14/1975	40	Kelowna	Dairy	Dairy Person	M
# 3613	5793	1/22/1975	40	Richmond	Bakery	Baker	M
# 3615	5795	1/30/1975	40	Nanaimo	Dairy	Dairy Person	M
# 3617	5797	2/3/1975	40	Nanaimo	Dairy	Dairy Person	M
# 3618	5798	2/4/1975	40	Surrey	Dairy	Dairy Person	M

#使用loc
df.loc[(df.age<=40)&df.department.isin(["Dairy", "Bakery"])&(df.gender=='M')].head()
# EmployeeID	birthdate_key	age	city_name	department	job_title	gender
# 3611	5791	1/14/1975	40	Kelowna	Dairy	Dairy Person	M
# 3613	5793	1/22/1975	40	Richmond	Bakery	Baker	M
# 3615	5795	1/30/1975	40	Nanaimo	Dairy	Dairy Person	M
# 3617	5797	2/3/1975	40	Nanaimo	Dairy	Dairy Person	M
# 3618	5798	2/4/1975	40	Surrey	Dairy	Dairy Person	M

#2 选出员工 ID 号 为奇数所在行的第1、第3和倒数第2列。

df.iloc[(df.EmployeeID %2 ==1).values,[0,2,-2]].head()
# 	EmployeeID	age	job_title
# 1	1319	58	VP Stores
# 3	1321	56	VP Human Resources
# 5	1323	53	Exec Assistant, VP Stores
# 6	1325	51	Exec Assistant, Legal Counsel
# 8	1329	48	Store Manager

#3 按照以下步骤进行索引操作:

  • 把后三列设为索引后交换内外两层
  • 恢复中间一层
  • 修改外层索引名为 Gender
  • 用下划线合并两层行索引
  • 把行索引拆分为原状态
  • 修改索引名为原表名称
  • 恢复默认索引并将列保持为原表的相对位置
# 把后三列设为索引后交换内外两层
df.set_index(['department','job_title','gender']).swaplevel(0,2,axis=0).head()

#                                      EmployeeID birthdate_key  age  city_name
# gender job_title          department                                          
# M      CEO                Executive         1318      1/3/1954   61  Vancouver
# F      VP Stores          Executive         1319      1/3/1957   58  Vancouver
#        Legal Counsel      Executive         1320      1/2/1955   60  Vancouver
# M      VP Human Resources Executive         1321      1/2/1959   56  Vancouver
#        VP Finance         Executive         1322      1/9/1958   57  Vancouver

# 恢复中间一层
df_copy.reset_index('job_title',drop=False,inplace=True)
df_copy.head()
#                                       EmployeeID birthdate_key  age  city_name
# gender job_title          department                                          
# M      CEO                Executive         1318      1/3/1954   61  Vancouver
# F      VP Stores          Executive         1319      1/3/1957   58  Vancouver
#        Legal Counsel      Executive         1320      1/2/1955   60  Vancouver
# M      VP Human Resources Executive         1321      1/2/1959   56  Vancouver
#        VP Finance         Executive         1322      1/9/1958   57  Vancouver

# 修改外层索引名为 `Gender`
df_copy.rename_axis(index={'gender':'Gender'},inplace=True)
df_copy.head()
#                             job_title  EmployeeID birthdate_key  age   city_name 
# Gender department                                                      
# M      Executive                  CEO        1318      1/3/1954   61   Vancouver
# F      Executive            VP Stores        1319      1/3/1957   58   Vancouver
#        Executive        Legal Counsel        1320      1/2/1955   60   Vancouver
# M      Executive   VP Human Resources        1321      1/2/1959   56   Vancouver
#        Executive           VP Finance        1322      1/9/1958   57   Vancouver

#用下划线合并两层行索引
df_copy.index = df_copy.index.map(lambda x:(x[0]+'_'+x[1]))
df_copy.head()
#                       job_title  EmployeeID birthdate_key  age  city_name
# M_Executive                 CEO        1318      1/3/1954   61  Vancouver
# F_Executive           VP Stores        1319      1/3/1957   58  Vancouver
# F_Executive       Legal Counsel        1320      1/2/1955   60  Vancouver
# M_Executive  VP Human Resources        1321      1/2/1959   56  Vancouver
# M_Executive          VP Finance        1322      1/9/1958   57  Vancouver

# 把行索引拆分为原状态
df_copy.index = df_copy.index.map(lambda x:tuple(x.split('_')))
df_copy.head()
#                       job_title  EmployeeID birthdate_key  age  city_name
# M Executive                 CEO        1318      1/3/1954   61  Vancouver
# F Executive           VP Stores        1319      1/3/1957   58  Vancouver
#   Executive       Legal Counsel        1320      1/2/1955   60  Vancouver
# M Executive  VP Human Resources        1321      1/2/1959   56  Vancouver
#   Executive          VP Finance        1322      1/9/1958   57  Vancouver

# 修改索引名为原表名称
df_op.rename_axis(index=['gender', 'department'],inplace=True)
print(df_copy.head())
#                             job_title  EmployeeID birthdate_key  age   city_name 
# gender department                                                      
# M      Executive                  CEO        1318      1/3/1954   61   Vancouver
# F      Executive            VP Stores        1319      1/3/1957   58   Vancouver
#        Executive        Legal Counsel        1320      1/2/1955   60   Vancouver
# M      Executive   VP Human Resources        1321      1/2/1959   56   Vancouver
#        Executive           VP Finance        1322      1/9/1958   57   Vancouver

# 恢复默认索引并将列保持为原表的相对位置
df_copy.reset_index().reindex(columns=df.columns).equals(df)
# True

2、巧克力数据集

chocolate = pd.read_csv('joyful-pandas/data/chocolate.csv')
chocolate.head()
#     Company  Review\nDate Cocoa\nPercent Company\nLocation  Rating
# 0  A. Morin          2016            63%            France    3.75
# 1  A. Morin          2015            70%            France    2.75
# 2  A. Morin          2015            70%            France    3.00
# 3  A. Morin          2015            70%            France    3.50
# 4  A. Morin          2015            70%            France    3.50

#1 把列索引名中的 \n 替换为空格。

chocolate.rename(columns=lambda x:x.replace('\n',' '), inplace=True)
chocolate.head()
#     Company  Review Date Cocoa Percent Company Location  Rating
# 0  A. Morin         2016           63%           France    3.75
# 1  A. Morin         2015           70%           France    2.75
# 2  A. Morin         2015           70%           France    3.00
# 3  A. Morin         2015           70%           France    3.50
# 4  A. Morin         2015           70%           France    3.50

#2 巧克力 Rating 评分为1至5,每0.25分一档,请选出2.75分及以下且可可含量 Cocoa Percent 高于中位数的样本。

chocolate['Cocoa Percent'] = chocolate['Cocoa Percent'].apply(lambda x:float(x[:-1])/100)
chocolate.query('(Rating<3)&(`Cocoa Percent`>`Cocoa Percent`.median())').head()
#                Company  Review Date  Cocoa Percent Company Location  Rating
# 33  Akesson's (Pralus)         2010           0.75      Switzerland    2.75
# 34  Akesson's (Pralus)         2010           0.75      Switzerland    2.75
# 36       Alain Ducasse         2014           0.75           France    2.75
# 38       Alain Ducasse         2013           0.75           France    2.50
3 39       Alain Ducasse         2013           0.75           France    2.50

#3 将 Review DateCompany Location 设为索引后,选出 Review Date 在2012年之后且 Company Location 不属于 France, Canada, Amsterdam, Belgium 的样本。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值