Data Whale第20期组队学习 Pandas学习—Task Special & 综合练习
一、企业收入的多样性
【题目描述】一个企业的产业收入多样性可以仿照信息熵的概念来定义收入熵指标: I = − ∑ i p ( x i ) l o g ( p ( x i ) ) I=-\sum_{i} p(x_i)log(p(x_i)) I=−i∑p(xi)log(p(xi))
其中 p ( x i ) p(x_i) p(xi)是企业该年某产业收入额占该年所有产业总收入的比重。在company.csv中存有需要计算的企业和年份,在company_data.csv中存有企业、各类收入额和收入年份的信息。现请利用后一张表中的数据,在前一张表中增加一列表示该公司该年份的收入熵指标 I。
【数据下载】链接:https://pan.baidu.com/s/1leZZctxMUSW55kZY5WwgIw 20 密码:u6fd
解题思路:
1、获取表格中的企业证劵代码,并进行筛选删除重复行
2、根据获取的证劵代码筛选出每个企业相关数据
3、按照年份筛选企业每一年收入数据,然后求和与计算
p
(
x
i
)
p(x_i)
p(xi),接着计算
I
I
I
4、将计算结果进行汇总,并添加到第一张表。
代码:
import pandas as pd
import numpy as np
df1 = pd.read_csv('D:/binchen/txzq/datatask/company.csv')
df2 = pd.read_csv('D:/binchen/txzq/datatask/company_data.csv')
print("df1.head(9)=\n",df1.head(9))
# df1.head(9)=
# 证券代码 日期
# 0 #000007 2014
# 1 #000403 2015
# 2 #000408 2016
# 3 #000408 2017
# 4 #000426 2015
# 5 #000426 2016
# 6 #000426 2017
# 7 #000511 2012
# 8 #000511 2013
print("df2.head(9)=\n",df2.head(9))
# df2.head(9)=
# 证券代码 日期 收入类型 收入额
# 0 1 2008/12/31 1 1.084218e+10
# 1 1 2008/12/31 2 1.259789e+10
# 2 1 2008/12/31 3 1.451312e+10
# 3 1 2008/12/31 4 1.063843e+09
# 4 1 2008/12/31 5 8.513880e+08
# 5 1 2008/12/31 6 2.062714e+09
# 6 1 2008/12/31 7 -3.070010e+08
# 7 1 2008/12/31 8 9.605849e+09
# 8 1 2008/12/31 9 -8.697750e+08
df2['证券代码']=df2['证券代码'].map(lambda x:('#'+'0'*(6-len(str(x)))+str(x)))
df2['日期'] = df2['日期'].str[:4].astype('int64')
print("df2.head(10)=\n",df2.head(10))
# df2.head(10)=
# 证券代码 日期 收入类型 收入额
# 0 #000001 2008 1 1.084218e+10
# 1 #000001 2008 2 1.259789e+10
# 2 #000001 2008 3 1.451312e+10
# 3 #000001 2008 4 1.063843e+09
# 4 #000001 2008 5 8.513880e+08
# 5 #000001 2008 6 2.062714e+09
# 6 #000001 2008 7 -3.070010e+08
# 7 #000001 2008 8 9.605849e+09
# 8 #000001 2008 9 -8.697750e+08
# 9 #000001 2008 10 9.174519e+09
df_merge = df1.merge(df2, on=['证券代码','日期'], how='left')
print("df_merge=\n",df_merge)
# df_merge=
# 证券代码 日期 收入类型 收入额
# 0 #000007 2014 1.0 6.146172e+07
# 1 #000007 2014 2.0 0.000000e+00
# 2 #000007 2014 3.0 6.146172e+07
# 3 #000007 2014 4.0 9.863111e+07
# 4 #000007 2014 5.0 8.000000e+07
# ... ... ... ... ...
# 51541 #600978 2016 33.0 4.591668e+09
# 51542 #600978 2016 34.0 3.554853e+09
# 51543 #600978 2016 35.0 1.728503e+08
# 51544 #600978 2016 36.0 3.727704e+09
# 51545 #600978 2017 NaN NaN
#
# [51546 rows x 4 columns]
def entropy(x):
if x.any(): # any函数指当序列至少 存在一个 True 或非零元素时返回 True ,否则返回 False
p = x/x.sum()
return -(p*np.log2(p)).sum()
return np.nan
df_m = df_merge.groupby(['证券代码','日期'])['收入额'].apply(entropy).reset_index()
print("df_m=\n",df_m)
# df_m=
# 证券代码 日期 收入额
# 0 #000007 2014 4.429740
# 1 #000403 2015 4.025963
# 2 #000408 2016 4.066295
# 3 #000408 2017 NaN
# 4 #000426 2015 4.449655
# ... ... ... ...
# 1043 #600978 2011 4.788391
# 1044 #600978 2014 4.022378
# 1045 #600978 2015 4.346303
# 1046 #600978 2016 4.358608
# 1047 #600978 2017 NaN
#
# [1048 rows x 3 columns]
df_m.rename(columns={'收入额':'收入熵指标'},inplace=True)
print("df_m=\n",df_m)
# df_m=
# 证券代码 日期 收入熵指标
# 0 #000007 2014 4.429740
# 1 #000403 2015 4.025963
# 2 #000408 2016 4.066295
# 3 #000408 2017 NaN
# 4 #000426 2015 4.449655
# ... ... ... ...
# 1043 #600978 2011 4.788391
# 1044 #600978 2014 4.022378
# 1045 #600978 2015 4.346303
# 1046 #600978 2016 4.358608
# 1047 #600978 2017 NaN
#
# [1048 rows x 3 columns]
二、组队学习信息表的变换
【题目描述】请把组队学习的队伍信息表变换为如下形态,其中“是否队长”一列取1表示队长,否则为0
是否队长 队伍名称 昵称 编号
0 1 你说的都对队 山枫叶纷飞 5
1 0 你说的都对队 蔡 6
2 0 你说的都对队 安慕希 7
3 0 你说的都对队 信仰 8
4 0 你说的都对队 biubiu🙈🙈 20
… … … … …
141 0 七星联盟 Daisy 63
142 0 七星联盟 One Better 131
143 0 七星联盟 rain 112
144 1 应如是 思无邪 54
145 0 应如是 Justzer0 58
【数据下载】链接:https://pan.baidu.com/s/1ses24cTwUCbMx3rvYXaz-Q 9 密码:iz57
解题思路:
1、读取表格数据,利用wide_to_long函数将长表转为宽表
2、对表columns进行重命名,对照结果表中的名字 , 分别将队长和队员用Captain和Player区分 , 结果表中队长和队员分别用1和0分类。
代码:
import pandas as pd
import numpy as np
data=pd.read_excel('D:/binchen/txzq/datatask/组队信息汇总表(Pandas).xlsx',engine='openpyxl')
print("data=\n",data)
print("data.head(6)=\n",data.head(6))
data=data.drop(['所在群'], axis=1)
col_1 = np.array(['队伍名称','编号_captain01','昵称_captain01'])
col_2 = np.array([[f'编号_player{i}0', f'昵称_player{i}0']for i in range(1,11)]).flatten()
data.columns = np.r_[col_1,col_2]
print("data=\n",data)
res = pd.wide_to_long( data.reset_index(),stubnames = ['昵称','编号'],i = ['index','队伍名称'],
j = '是否队长',sep = '_',suffix = '.+').dropna().reset_index().drop(columns='index')
res['是否队长'],res['编号'] = res['是否队长'].str[-1],res['编号'].astype('int64')
print("res.reindex(columns=['是否队长','队伍名称','昵称','编号'])=\n",
res.reindex(columns=['是否队长','队伍名称','昵称','编号']))
# res.reindex(columns=['是否队长','队伍名称','昵称','编号'])=
# 是否队长 队伍名称 昵称 编号
# 0 1 你说的都对队 山枫叶纷飞 5
# 1 0 你说的都对队 蔡 6
# 2 0 你说的都对队 安慕希 7
# 3 0 你说的都对队 信仰 8
# 4 0 你说的都对队 biubiu🙈🙈 20
# .. ... ... ... ...
# 141 0 七星联盟 Daisy 63
# 142 0 七星联盟 One Better 131
# 143 0 七星联盟 rain 112
# 144 1 应如是 思无邪 54
# 145 0 应如是 Justzer0 58
#
# [146 rows x 4 columns]
三、美国大选投票情况
【题目描述】两张数据表中分别给出了美国各县(county)的人口数以及大选的投票情况,请解决以下问题:
有多少县满足总投票数超过县人口数的一半
把州(state)作为行索引,把投票候选人作为列名,列名的顺序按照候选人在全美的总票数由高到低排序,行列对应的元素为该候选人在该州获得的总票数
此处是一个样例,实际的州或人名用原表的英语代替
拜登 川普
威斯康星州 2 1
德克萨斯州 3 4
每一个州下设若干县,定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标,若某个州所有县BT指标的中位数大于0,则称该州为Biden State,请找出所有的Biden State
【数据下载】链接:https://pan.baidu.com/s/182rr3CpstVux2CFdFd_Pcg 11 提取码:q674
代码:
import pandas as pd
import numpy as np
df1 = pd.read_csv('D:/binchen/txzq/datatask/county_population.csv')
df2 = pd.read_csv('D:/binchen/txzq/datatask/president_county_candidate.csv')
res=[x.strip('.') for x in df1['US County']]
state = [x.split(', ')[1] for x in res]
county = [x.split(', ')[0] for x in res]
print("df1.head(10)=\n",df1.head(10))
# df1.head(10)=
# US County Population
# 0 .Autauga County, Alabama 55869
# 1 .Baldwin County, Alabama 223234
# 2 .Barbour County, Alabama 24686
# 3 .Bibb County, Alabama 22394
# 4 .Blount County, Alabama 57826
# 5 .Bullock County, Alabama 10101
# 6 .Butler County, Alabama 19448
# 7 .Calhoun County, Alabama 113605
# 8 .Chambers County, Alabama 33254
# 9 .Cherokee County, Alabama 26196
df1['state'],df1['county'] = state, county
df1 = df1.iloc[:,[2,3,1]]
print("df1.head(10)=\n",df1.head(10))
# df1.head(10)=
# state county Population
# 0 Alabama Autauga County 55869
# 1 Alabama Baldwin County 223234
# 2 Alabama Barbour County 24686
# 3 Alabama Bibb County 22394
# 4 Alabama Blount County 57826
# 5 Alabama Bullock County 10101
# 6 Alabama Butler County 19448
# 7 Alabama Calhoun County 113605
# 8 Alabama Chambers County 33254
# 9 Alabama Cherokee County 26196
print("df2.head()=\n",df2.head())
# df2.head()=
# state county candidate party total_votes won
# 0 Delaware Kent County Joe Biden DEM 44552 True
# 1 Delaware Kent County Donald Trump REP 41009 False
# 2 Delaware Kent County Jo Jorgensen LIB 1044 False
# 3 Delaware Kent County Howie Hawkins GRN 420 False
# 4 Delaware New Castle County Joe Biden DEM 195034 True
df_m = df2.merge(df1, on = ['state','county'], how = 'left')
print("df_m.tail()=\n",df_m.tail())
# df_m.tail()=
# state county candidate ... total_votes won Population
# 32172 Arizona Maricopa County Write-ins ... 1331 False 4485414.0
# 32173 Arizona Mohave County Donald Trump ... 78535 True 212181.0
# 32174 Arizona Mohave County Joe Biden ... 24831 False 212181.0
# 32175 Arizona Mohave County Jo Jorgensen ... 1302 False 212181.0
# 32176 Arizona Mohave County Write-ins ... 37 False 212181.0
#
# [5 rows x 7 columns]
#将州县设为行索引
df_m['total_county'] = df_m.groupby(['state','county'])['total_votes'].transform('sum')
vote_c = df_m.groupby(['state','county'])['total_votes'].agg('sum').reset_index()
df1 = df1.merge(vote_c,on=['state','county'], how = 'inner')
df1['prop']=df1.total_votes / df1.Population
print("df1.loc[df1.prop > 0.5].count()=\n",df1.loc[df1.prop > 0.5].count())
# df1.loc[df1.prop > 0.5].count()=
# state 1435
# county 1435
# Population 1435
# total_votes 1435
# prop 1435
# dtype: int64
print("df2.head(10)=\n",df2.head(10))
# df2.head(10)=
# state county candidate party total_votes won
# 0 Delaware Kent County Joe Biden DEM 44552 True
# 1 Delaware Kent County Donald Trump REP 41009 False
# 2 Delaware Kent County Jo Jorgensen LIB 1044 False
# 3 Delaware Kent County Howie Hawkins GRN 420 False
# 4 Delaware New Castle County Joe Biden DEM 195034 True
# 5 Delaware New Castle County Donald Trump REP 88364 False
# 6 Delaware New Castle County Jo Jorgensen LIB 2953 False
# 7 Delaware New Castle County Howie Hawkins GRN 1282 False
# 8 Delaware Sussex County Donald Trump REP 71230 True
# 9 Delaware Sussex County Joe Biden DEM 56682 False
res=df2.groupby(['state','candidate'])['total_votes'].agg('sum').reset_index()
print("res.head(10)=\n",res.head(10))
# res.head(10)=
# state candidate total_votes
# 0 Alabama Write-ins 7312
# 1 Alabama Donald Trump 1441168
# 2 Alabama Jo Jorgensen 25176
# 3 Alabama Joe Biden 849648
# 4 Alaska Write-ins 34210
# 5 Alaska Brock Pierce 825
# 6 Alaska Don Blankenship 1127
# 7 Alaska Donald Trump 189892
# 8 Alaska Jesse Ventura 2673
# 9 Alaska Jo Jorgensen 8896
# 将长表转为宽表
res= res.pivot(index='state',columns='candidate',values='total_votes')
res_total=res.sum()
res_total.name='US'
res_total = res_total.to_frame().T
res= res.append(res_total)
res_final=res.T.sort_values('US',ascending=False).T.iloc[:-1,:]
print("res_final.head(10)=\n",res_final.head(10))
# res_final.head(10)=
# candidate Joe Biden Donald Trump ... Keith McCormic Zachary Scalf
# Alabama 849648.0 1441168.0 ... NaN NaN
# Alaska 153405.0 189892.0 ... NaN NaN
# Arizona 1672143.0 1661686.0 ... NaN NaN
# Arkansas 423932.0 760647.0 ... NaN NaN
# California 11109764.0 6005961.0 ... NaN NaN
# Colorado 1804352.0 1364607.0 ... NaN NaN
# Connecticut 1080680.0 715291.0 ... NaN NaN
# Delaware 296268.0 200603.0 ... NaN NaN
# District of Columbia 317323.0 18586.0 ... NaN NaN
# Florida 5297045.0 5668731.0 ... NaN NaN
#
# [10 rows x 38 columns]
df2['total_county'] = df2.groupby(['state','county'])['total_votes'].transform('sum')
print("df2=\n",df2)
# df2=
# state county ... won total_county
# 0 Delaware Kent County ... True 87025
# 1 Delaware Kent County ... False 87025
# 2 Delaware Kent County ... False 87025
# 3 Delaware Kent County ... False 87025
# 4 Delaware New Castle County ... True 287633
# ... ... ... ... ... ...
# 32172 Arizona Maricopa County ... False 2069475
# 32173 Arizona Mohave County ... True 104705
# 32174 Arizona Mohave County ... False 104705
# 32175 Arizona Mohave County ... False 104705
# 32176 Arizona Mohave County ... False 104705
#
# [32177 rows x 7 columns]
# 仅保留拜登和川普
BT = df2.loc[df2.candidate.isin(['Joe Biden','Donald Trump'])].reset_index(drop=True)
BT['rate'] = BT.total_votes / BT.total_county
# 保留有效信息
BT = BT[['state','county','candidate','rate']]
# 长变宽
BT = BT.pivot(index = ['state','county'], columns = 'candidate', values = 'rate')
print("BT.head()=\n",BT.head())
# BT.head()=
# candidate Donald Trump Joe Biden
# state county
# Alabama Autauga County 0.714368 0.270184
# Baldwin County 0.761714 0.224090
# Barbour County 0.534512 0.457882
# Bibb County 0.784263 0.206983
# Blount County 0.895716 0.095694
BT['BT'] = BT['Joe Biden'] - BT['Donald Trump']
btc = BT.reset_index().groupby('state')['BT'].agg(lambda x: x.median()>0)
Biden_Sate = btc[btc].index.to_list()
for x in Biden_Sate: print("x=",x)
# x= California
# x= Connecticut
# x= Delaware
# x= District of Columbia
# x= Hawaii
# x= Massachusetts
# x= New Jersey
# x= Rhode Island
# x= Vermont
参考文献
1、http://datawhale.club/t/topic/579/2
2、https://blog.youkuaiyun.com/qq_41358220/article/details/112059598
3、https://github.com/SongzeLiCEMA/DataWhale_Pandas/blob/main/Task%20Special.ipynb
本文介绍了如何使用Python Pandas计算企业收入的多样性指标,通过合并数据并实现组队学习信息表的结构转换,同时解析美国大选投票情况,涉及数据清洗、聚合和复杂数据操作技巧。
259

被折叠的 条评论
为什么被折叠?



