[ github ] 10道题串起 Pandas 常用知识点

Github 地址: https://github.com/guipsamora/pandas_exercises
数据集: https://www.aliyundrive.com/s/Wk6L6R67rrn

1. 了解数据
1.1 导入数据集并查看数据集基本情况
import pandas as pd
chipotle = pd.read_csv('001_chipotle.tsv',sep='\t')
chipotle.head()
order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
chipotle.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')
chipotle.index
RangeIndex(start=0, stop=4622, step=1)
1.2 被下单最多的商品是什么
  • as_index = false 使用默认索引列,
  • 显示效果上, 使列名都在一行, “不合并单元格”, 更符合使用习惯
item_buy_count = chipotle[['item_name','quantity']].groupby('item_name').sum()['quantity'].nlargest(1)
item_buy_count
item_name
Chicken Bowl    761
Name: quantity, dtype: int64
  • 原答案
item_buy_count_1 = chipotle[['item_name','quantity']].groupby('item_name', as_index=False).agg({'quantity':'sum'})
item_buy_count_1.sort_values('quantity', ascending=False, inplace=True)
item_buy_count_1.head()
item_namequantity
17Chicken Bowl761
18Chicken Burrito591
25Chips and Guacamole506
39Steak Burrito386
10Canned Soft Drink351
1.3 有多少种商品被下单
  • nunique() 非重复计数
chipotle['item_name'].nunique()
50
# 简单做个图
s = chipotle['item_name'].value_counts()

from matplotlib import pyplot as plt
plt.figure(figsize=(20,4),dpi=100)
plt.xticks(fontsize=13, rotation='vertical')
plt.bar(s.index,s.values)
<BarContainer object of 50 artists>


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-S9a7xR1e-1658480041724)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_15_1.png)]

1.4 被下单的商品中, choice_description 的排名
  • 没搞明白什么意思
chipotle['choice_description'].value_counts().head()
[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64
1.5 一共有多少商品被下单
chipotle['quantity'].sum()
4972
1.6 将item_price转换为浮点数
chipotle['item_price'] = chipotle['item_price'].str.lstrip('$').astype('float')
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   order_id            4622 non-null   int64  
 1   quantity            4622 non-null   int64  
 2   item_name           4622 non-null   object 
 3   choice_description  3376 non-null   object 
 4   item_price          4622 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.7+ KB
  • 原解法
chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_19644/772157282.py in <module>
----> 1 chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))


NameError: name 'chipo' is not defined
1.7 总收入
chipotle['sub_total'] = (chipotle['item_price'] * chipotle['quantity'])
chipotle['sub_total'].sum()
39237.02
1.8 总订单数
chipotle['order_id'].nunique()
1834
1.9 单均价
chipotle['sub_total'].sum()/chipotle['order_id'].nunique()
21.39423118865867
  • 原答案
chipotle[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
21.39423118865867
2. 数据过滤与排序
2.1 数据导入, 查看数据集基本情况
import pandas as pd
euro = pd.read_csv('Euro2012_stats.csv')
euro.head()
TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
0Croatia4131251.9%16.0%32000...1381.3%41622909916
1Czech Republic4131841.9%12.9%39000...960.1%5373870111119
2Denmark4101050.0%20.0%27100...1066.7%25388407715
3England5111850.0%17.2%40000...2288.1%4345650111116
4France3222437.9%6.5%65100...654.6%3651560111119

5 rows × 35 columns

euro.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Team                        16 non-null     object 
 1   Goals                       16 non-null     int64  
 2   Shots on target             16 non-null     int64  
 3   Shots off target            16 non-null     int64  
 4   Shooting Accuracy           16 non-null     object 
 5   % Goals-to-shots            16 non-null     object 
 6   Total shots (inc. Blocked)  16 non-null     int64  
 7   Hit Woodwork                16 non-null     int64  
 8   Penalty goals               16 non-null     int64  
 9   Penalties not scored        16 non-null     int64  
 10  Headed goals                16 non-null     int64  
 11  Passes                      16 non-null     int64  
 12  Passes completed            16 non-null     int64  
 13  Passing Accuracy            16 non-null     object 
 14  Touches                     16 non-null     int64  
 15  Crosses                     16 non-null     int64  
 16  Dribbles                    16 non-null     int64  
 17  Corners Taken               16 non-null     int64  
 18  Tackles                     16 non-null     int64  
 19  Clearances                  16 non-null     int64  
 20  Interceptions               16 non-null     int64  
 21  Clearances off line         15 non-null     float64
 22  Clean Sheets                16 non-null     int64  
 23  Blocks                      16 non-null     int64  
 24  Goals conceded              16 non-null     int64  
 25  Saves made                  16 non-null     int64  
 26  Saves-to-shots ratio        16 non-null     object 
 27  Fouls Won                   16 non-null     int64  
 28  Fouls Conceded              16 non-null     int64  
 29  Offsides                    16 non-null     int64  
 30  Yellow Cards                16 non-null     int64  
 31  Red Cards                   16 non-null     int64  
 32  Subs on                     16 non-null     int64  
 33  Subs off                    16 non-null     int64  
 34  Players Used                16 non-null     int64  
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
2.1 有多少球队参与了 2012 年欧洲杯
euro['Team'].nunique()
16
2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里
discripline = euro[['Team','Yellow Cards', 'Red Cards']]
discripline
TeamYellow CardsRed Cards
0Croatia90
1Czech Republic70
2Denmark40
3England50
4France60
5Germany40
6Greece91
7Italy160
8Netherlands50
9Poland71
10Portugal120
11Republic of Ireland61
12Russia60
13Spain110
14Sweden70
15Ukraine50
2.3 对 discripline 排序
discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
discripline
C:\Users\LiCN\AppData\Local\Temp/ipykernel_19644/3135130171.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
TeamYellow CardsRed Cards
7Italy160
10Portugal120
13Spain110
6Greece91
0Croatia90
9Poland71
1Czech Republic70
14Sweden70
11Republic of Ireland61
4France60
12Russia60
3England50
8Netherlands50
15Ukraine50
2Denmark40
5Germany40
2.4 计算每个球队拿到的黄牌数的平均值
discripline['Yellow Cards'].mean()
7.4375
2.5 进球大于 6 的球队
discripline[discripline['Yellow Cards']>6]
# discripline[discripline.Yellow Cards >6]
TeamYellow CardsRed Cards
7Italy160
10Portugal120
13Spain110
6Greece91
0Croatia90
9Poland71
1Czech Republic70
14Sweden70
2.6 选取以字母 G 开头的球队数据
euro[euro['Team'].str.startswith('G')]
TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220

2 rows × 35 columns

2.7 选取前 7 列
euro.iloc[:,0:7]
TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)
0Croatia4131251.9%16.0%32
1Czech Republic4131841.9%12.9%39
2Denmark4101050.0%20.0%27
3England5111850.0%17.2%40
4France3222437.9%6.5%65
5Germany10323247.8%15.6%80
6Greece581830.7%19.2%32
7Italy6344543.0%7.5%110
8Netherlands2123625.0%4.1%60
9Poland2152339.4%5.2%48
10Portugal6224234.3%9.3%82
11Republic of Ireland171236.8%5.2%28
12Russia593122.5%12.5%59
13Spain12423355.9%16.0%100
14Sweden5171947.2%13.8%39
15Ukraine272621.2%6.0%38
2.8 选取除了最后 3 列之外的全部列
euro.iloc[:,:-3]
TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Clean SheetsBlocksGoals concededSaves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed Cards
0Croatia4131251.9%16.0%32000...01031381.3%4162290
1Czech Republic4131841.9%12.9%39000...1106960.1%5373870
2Denmark4101050.0%20.0%27100...11051066.7%2538840
3England5111850.0%17.2%40000...22932288.1%4345650
4France3222437.9%6.5%65100...175654.6%3651560
5Germany10323247.8%15.6%80210...11161062.6%63491240
6Greece581830.7%19.2%32111...12371365.1%67481291
7Italy6344543.0%7.5%110200...21872074.1%1018916160
8Netherlands2123625.0%4.1%60200...0951270.6%3530350
9Poland2152339.4%5.2%48000...083666.7%4856371
10Portugal6224234.3%9.3%82600...21141071.5%739010120
11Republic of Ireland171236.8%5.2%28000...02391765.4%43511161
12Russia593122.5%12.5%59200...0831077.0%3443460
13Spain12423355.9%16.0%100010...5811593.8%1028319110
14Sweden5171947.2%13.8%39300...1125861.6%3551770
15Ukraine272621.2%6.0%38000...0441376.5%4831450

16 rows × 32 columns

2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)
euro[euro['Team'].isin(['England', 'Italy', 'Russia'])][['Team','Shooting Accuracy']]
TeamShooting Accuracy
3England50.0%
7Italy43.0%
12Russia22.5%
  • 原解法
euro.loc[euro['Team'].isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
TeamShooting Accuracy
3England50.0%
7Italy43.0%
12Russia22.5%
3. 数据分组 探索酒类消费数据
import pandas as pd

drinks = pd.read_csv('drinks.csv')
drinks.head()
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0AS
1Albania89132544.9EU
2Algeria250140.7AF
3Andorra24513831212.4EU
4Angola21757455.9AF
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
3.1 哪个大陆平均消耗的啤酒更多?
drinks.groupby('continent')['beer_servings'].mean().sort_values().head(1)
continent
AS    37.045455
Name: beer_servings, dtype: float64
3.2 打印出每个大陆红酒消耗的描述性统计值
drinks.groupby('continent')['wine_servings'].describe()
countmeanstdmin25%50%75%max
continent
AF53.016.26415138.8464190.01.02.013.00233.0
AS44.09.06818221.6670340.00.01.08.00123.0
EU45.0142.22222297.4217380.059.0128.0195.00370.0
OC16.035.62500064.5557900.01.08.523.25212.0
SA12.062.41666788.6201891.03.012.098.50221.0
3.3 打印出每个大陆每种酒类的消耗平均值
drinks.groupby('continent').mean()
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF61.47169816.33962316.2641513.007547
AS37.04545560.8409099.0681822.170455
EU193.777778132.555556142.2222228.617778
OC89.68750058.43750035.6250003.381250
SA175.083333114.75000062.4166676.308333
3.4 打印出每个大陆每种酒类的消耗的中位数
drinks.groupby('continent').median()
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF32.03.02.02.30
AS17.516.01.01.20
EU219.0122.0128.010.00
OC52.537.08.51.75
SA162.5108.512.06.85
3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值
drinks.groupby('continent')['spirit_servings'].agg(['mean','max','min'])
meanmaxmin
continent
AF16.3396231520
AS60.8409093260
EU132.5555563730
OC58.4375002540
SA114.75000030225
4. apply 函数 (1960-2014 美国犯罪数据探索)
4.1 数据导入及数据集基本情况
import pandas as pd

crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')

crime.head()
YearPopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
01960179323175338420028846030957009110171901078401543209121001855400328200
11961182992000348800028939031986008740172201066701567609496001913000336000
21962185771000375220030151034507008530175501108601645709943002089600366800
319631884830004109500316970379250086401765011647017421010864002297800408300
419641911410004564600364220420040093602142013039020305012132002514400472800
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Year                55 non-null     int64
 1   Population          55 non-null     int64
 2   Total               55 non-null     int64
 3   Violent             55 non-null     int64
 4   Property            55 non-null     int64
 5   Murder              55 non-null     int64
 6   Forcible_Rape       55 non-null     int64
 7   Robbery             55 non-null     int64
 8   Aggravated_assault  55 non-null     int64
 9   Burglary            55 non-null     int64
 10  Larceny_Theft       55 non-null     int64
 11  Vehicle_Theft       55 non-null     int64
dtypes: int64(12)
memory usage: 5.3 KB
4.2 将 Year 数据类型转换为 datetime
crime['Year'] = pd.to_datetime(crime['Year'], format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Year                55 non-null     datetime64[ns]
 1   Population          55 non-null     int64         
 2   Total               55 non-null     int64         
 3   Violent             55 non-null     int64         
 4   Property            55 non-null     int64         
 5   Murder              55 non-null     int64         
 6   Forcible_Rape       55 non-null     int64         
 7   Robbery             55 non-null     int64         
 8   Aggravated_assault  55 non-null     int64         
 9   Burglary            55 non-null     int64         
 10  Larceny_Theft       55 non-null     int64         
 11  Vehicle_Theft       55 non-null     int64         
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB
4.3 将 Year 设置为索引
# 好像不要 drop=True 原索引也不会成为新的一列
crime.set_index('Year', inplace=True)
crime.head()
PopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-01179323175338420028846030957009110171901078401543209121001855400328200
1961-01-01182992000348800028939031986008740172201066701567609496001913000336000
1962-01-01185771000375220030151034507008530175501108601645709943002089600366800
1963-01-011884830004109500316970379250086401765011647017421010864002297800408300
1964-01-011911410004564600364220420040093602142013039020305012132002514400472800
4.4 删除名为 Total 的列
del crime['Total']
4.5 每十年重组数据 resample 非常重要
  • Pandas中的resample,重新采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
  • 参数表示采样的规则 常见时间频率 :A year M month W week D day H hour T minute S second
  • 按照每十年为一组进行采样统计
crime_10Ys = crime.resample('10As').sum()
crime_10Ys
PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0119150531754134930451609001061802367201633510215852013321100265477005292100
1970-01-0121211932989607930913838001922305545704159020470212028486000531578009739900
1980-01-0123713700691407432811704890020643986563953831097619130330734947204025311935411
1990-01-01261282525817527048119053499211664998827574893010568963267500157767936614624418
2000-01-0129479691171396805610094436916306892249942303668652124215651766797029111412834
2010-01-011570146307607201744095950728674210591749809376414210125170304016983569080
  • 用 population 的最大值替换重组后的 population 值, 上面每个十年的 population 加总没有意义
  • 用十年中的最大值替换掉这个加总值
crime_10Ys['Population'] = crime['Population'].resample('10As').max()
crime_10Ys.head()
PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-012013850004134930451609001061802367201633510215852013321100265477005292100
1970-01-012200990009607930913838001922305545704159020470212028486000531578009739900
1980-01-012482390001407432811704890020643986563953831097619130330734947204025311935411
1990-01-0127269081317527048119053499211664998827574893010568963267500157767936614624418
2000-01-013070065501396805610094436916306892249942303668652124215651766797029111412834
4.6 何时是美国历史上生存最危险的年代
  • idxmax()
crime
PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0117932317528846030957009110171901078401543209121001855400328200
1961-01-0118299200028939031986008740172201066701567609496001913000336000
1962-01-0118577100030151034507008530175501108601645709943002089600366800
1963-01-01188483000316970379250086401765011647017421010864002297800408300
1964-01-01191141000364220420040093602142013039020305012132002514400472800
1965-01-01193526000387390435200099602341013869021533012825002572600496900
1966-01-011955760004301804793300110402582015799023533014101002822000561200
1967-01-011974570004999305403500122402762020291025716016321003111600659800
1968-01-011993990005950106125200138003167026284028670018589003482700783600
1969-01-012013850006618706749000147603717029885031109019819003888600878500
1970-01-012032352987388207359200160003799034986033497022050004225800928400
1971-01-012062120008165007771700177804226038770036876023993004424200948200
1972-01-012082300008349007413900186704685037629039309023755004151200887200
1973-01-012098510008759107842200196405140038422042065025655004347900928800
1974-01-012113920009747209278700207105540044240045621030392005262500977100
1975-01-012131240001039710102527002051056090470500492620326530059777001009600
1976-01-01214659000100421010345500187805708042781050053031087006270800966000
1977-01-0121633200010295809955000191206350041261053435030715005905700977700
1978-01-012180590001085550101234001956067610426930571460312830059910001004100
1979-01-012200990001208030110415002146076390480700629480332770066010001112800
1980-01-012253492641344520120637002304082990565840672650379520071369001131700
1981-01-012291460001361820120619002252082500592910663900377970071944001087800
1982-01-012315340001322390116520002101078770553130669480344710071425001062400
1983-01-012339810001258090108505001931078920506570653290312990067128001007900
1984-01-012361580001273280106085001869084230485010685350298440065919001032200
1985-01-012387400001328800111026001898088670497870723250307330069264001102900
1986-01-012401328871489169117227002061391459542775834322324141072571531224137
1987-01-012422829181483999120247002009691110517704855088323618474999001288674
1988-01-012458070001566220123569002068092490542970910090321810077059001432900
1989-01-012482390001646040126054002150094500578330951710316820078724001564800
1990-01-01248709873182013012655500234401025606392701054860307390079457001635900
1991-01-01252177000191177012961100247001065906877301092740315720081422001661700
1992-01-01255082000193227012505900237601090606724801126970297990079152001610800
1993-01-01257908000192602012218800245301060106598701135610283480078209001563100
1994-01-01260341000185767012131900233301022206189501113180271280078798001539300
1995-01-0126275500017987901206390021610974705805101099210259380079977001472400
1996-01-0126522857216885401180530019650962505355901037050250640079047001394200
1997-01-0126763700016347701155817518208961534985341023201246052677437601354189
1998-01-012702960001531044109445901691493103446625974402232995073738861240754
1999-01-012726908131426044102083341552289411409371911740210073969555201152075
2000-01-012814219061425486101825861558690178408016911706205099269715901160002
2001-01-012853175591439480104374801603790863423557909023211653170922671228391
2002-01-012879739241423677104552771622995235420806891407215125270573701246646
2003-01-012906907881383676104428621652893883414235859030215483470268021261226
2004-01-012936568421360088103193861614895089401470847381214444669370891237851
2005-01-012965070611390745101747541674094347417438862220215544867834471235859
2006-01-01299398484141804399835681703092757447403860853218374666070131192809
2007-01-01301621157140833798434811692990427445125855856217614065685721095769
2008-01-0130437484613926289767915164429047944357484213422284746588046958629
2009-01-0130700655013258969337060153998924140874281251422033136338095795652
2010-01-0130933021912512489112625147728559336908978184421684576204601739565
2011-01-0131158781612060319052743146618417535477275242321851406151095716508
2012-01-0131387368512170679001992148668514135505176200921099326168874723186
2013-01-0131649753111996848650761143198210934509572657519318356018632700294
2014-01-0131885705611979878277829142498404132580274129117298065858496689527
crime.idxmax()
Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]
  • 各种犯罪数量的最大值集中出现在1991年~1993年, 从整个数据集看人口数量是在逐渐增长的(从1960年的1.8亿到2014年的3.2亿), 说明这一时期犯罪率高于其他时期(但并不能断言整个90年代是最危险的年代)
5. 合并(探索虚拟姓名数据)
5.1 构建虚拟数据
import numpy as np
import pandas as pd

raw_data_1 = {
 'subject_id': ['1', '2', '3', '4', '5'],
 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
 'subject_id': ['4', '5', '6', '7', '8'],
 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

data_1 = pd.DataFrame(raw_data_1)
data_2 = pd.DataFrame(raw_data_2)
data_3 = pd.DataFrame(raw_data_3)

print(data_1)
print(data_2)
print(data_3)
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
  subject_id  test_id
0          1       51
1          2       15
2          3       15
3          4       61
4          5       16
5          7       14
6          8       15
7          9        1
8         10       61
9         11       16
5.2 上下合并, 鸡蛋下落方向, axis=0
all_data = pd.concat([data_1,data_2], ignore_index=True)
all_data
subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
54BillyBonder
65BrianBlack
76BranBalwner
87BryceBrice
98BettyBtisan
5.3 左右合并, axis=1
by_col = pd.concat([data_1,data_2], axis=1)
by_col
subject_idfirst_namelast_namesubject_idfirst_namelast_name
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan
5.4 按照 subject_id 的值对 all_data 和 data3 作合并
all_data = pd.merge(all_data,data_3,on='subject_id')
all_data
subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15
5.5 对 data1 和 data2 按照 subject_id 作内连接
inner_join = pd.merge(data_1,data_2, on='subject_id', how='inner')
inner_join
subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack
5.6 对 data1 和 data2 按照 subject_id 作外连接
outer_join = pd.merge(data_1,data_2, on='subject_id', how='outer')
outer_join
subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
56NaNNaNBranBalwner
67NaNNaNBryceBrice
78NaNNaNBettyBtisan
6. 统计 探索风速数据
6.1 数据导入等
import pandas as pd
import datetime

wind = pd.read_table('wind.data', sep='\s+', parse_dates=[[0,1,2]])
wind.head()
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
02061-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
12061-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
22061-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
32061-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
42061-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83
wind.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Yr_Mo_Dy  6574 non-null   datetime64[ns]
 1   RPT       6568 non-null   float64       
 2   VAL       6571 non-null   float64       
 3   ROS       6572 non-null   float64       
 4   KIL       6569 non-null   float64       
 5   SHA       6572 non-null   float64       
 6   BIR       6574 non-null   float64       
 7   DUB       6571 non-null   float64       
 8   CLA       6572 non-null   float64       
 9   MUL       6571 non-null   float64       
 10  CLO       6573 non-null   float64       
 11  BEL       6574 non-null   float64       
 12  MAL       6570 non-null   float64       
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB
6.2 2061 年是错误数据, 尝试修复
  • 原答案少一层’pd.to_datetime()', 这样的话应用函数后 wind[‘Yr_Mo_Dy’] 将不再是 datetime格式\
    • 单个元素的话可以直接用 x.year
def fix_year(x):
    year=x.year-100 if x.year>1978 else x.year
    return pd.to_datetime(datetime.date(year, x.month, x.day),format='%Y/%m/%d')

wind['Yr_Mo_Dy']  = wind['Yr_Mo_Dy'].apply(fix_year)
wind['Yr_Mo_Dy'].dt.year.unique()
array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)
6.3 将日期设为索引, 注意数据类型, 应该是 datetime64
wind.set_index('Yr_Mo_Dy', inplace=True)
wind.head()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83
6.4 对应每一个 location, 一共有多少数据缺失
wind.isna().sum()
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64
6.5 对应每一个 location,一共有多少完整的数据值
  • notna() 和 info() 显示的都能显示非空行
wind.notna().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64
# 原答案
wind.shape[0] - wind.isnull().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64
6.6 对于全体数据, 计算风速的平均值
wind.mean().mean()
10.227982360836938
6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差
  • describe() 是沿columns方向的
wind_des = wind.describe().T.iloc[:,[3,7,1,2]]
wind_des
minmaxmeanstd
RPT0.6735.8012.3629875.618413
VAL0.2133.3710.6443145.267356
ROS1.5033.8411.6605265.008450
KIL0.0028.466.3064683.605811
SHA0.1337.5410.4558344.936125
BIR0.0026.167.0922543.968683
DUB0.0030.379.7973434.977555
CLA0.0031.088.4950534.499449
MUL0.0025.888.4935904.166872
CLO0.0428.218.7073324.503954
BEL0.1342.3813.1210075.835037
MAL0.6742.5415.5990796.699794
# 一般写法
day_stats = pd.DataFrame()
day_stats['min'] = wind.min(axis=1)
day_stats['max'] = wind.max(axis=1)
day_stats['mean'] = wind.mean(axis=1)
day_stats['std'] = wind.std(axis=1)
day_stats.head()
6.8 对于每一个 location, 计算一月份的平均风速
  • 注意, 原po主说把1961年的1月和1962年的1月都当作1月份

  • 注意 query() 的用法

wind['date'] = wind.index
wind.head()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdate
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.831961-01-02
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.711961-01-03
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.881961-01-04
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.831961-01-05
wind['year']= wind['date'].dt.year
wind['month']= wind['date'].dt.month
wind['day']= wind['date'].dt.day
wind.head()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdateyearmonthday
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01196111
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.831961-01-02196112
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.711961-01-03196113
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.881961-01-04196114
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.831961-01-05196115
# 原
wind['year'] = wind['date'].apply(lambda x: x.year)
wind['month'] = wind['date'].apply(lambda x: x.month)
wind['day'] = wind['date'].apply(lambda x: x.day)
wind[wind['month']==1].iloc[:,0:12].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64
  • 解法二, loc 定位
wind.loc[wind.index.month==1].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64
6.9 对每个地区的记录进行年度向下采样
  • DatetimeIndex 多个独立的日期组合起来
  • PeriodIndex 指定了间隔频率的 DatetimeIndex, 通过将 DatetimeIndex.to_period (“”),指定间隔频率来得到
wind.groupby(wind.index.to_period('A')).mean()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
196112.29958310.35179611.3623696.95822710.8817637.7297269.7339238.8587888.6476529.83557713.50279513.680773
196212.24692310.11043811.7327126.96044010.6579187.39306811.0207128.7937538.3168229.67624712.93068514.323956
196312.81345210.83698612.5411517.33005511.7241108.43471211.07569910.3365488.90358910.22443813.63887714.999014
196412.36366110.92016412.1043726.78778711.4544817.57087410.2591539.4673507.78901610.20795113.74054614.910301
196512.45137011.07553411.8487676.85846611.0247957.47811010.6187128.8799187.9074259.91808212.96424715.591644
196613.46197311.55720512.0206307.34572611.8050417.79367110.5798088.8350968.5144389.76895914.26583616.307260
196712.73715110.99098611.7393977.14342511.6307407.36816410.6520279.3256168.6450149.54742514.77454817.135945
196811.83562810.46819711.4097546.47767810.7607656.0673228.8591808.2555197.2249457.83297812.80863415.017486
196911.1663569.72369910.9020005.7679739.8739186.1899738.5644937.7113977.9245217.75438412.62123315.762904
197012.60032910.72693211.7302476.21717810.5673707.6094529.6098908.3346309.2976168.28980813.18364416.456027
197111.2731239.09517811.0883295.2415079.4403296.0971518.3858906.7573157.9153707.22975312.20893215.025233
197212.46396210.56131112.0583335.9296999.4304106.3588259.7045087.6807928.3572957.51527312.72737715.028716
197311.82846610.68049310.6804935.5478639.6408776.5487408.4821107.6142748.2455347.81241112.16969915.441096
197413.64309611.81178112.3363566.42704111.1109866.80978110.0846039.8969869.3317538.73635613.25295916.947671
197512.00857510.29383611.5647125.2690969.1900825.6685218.5626037.8438368.7979457.38282212.63167115.307863
197611.73784210.20311510.7612305.1094268.8463396.3110389.1491267.1462028.8837167.88308712.33237715.471448
197713.09961611.14449312.6278366.07394510.0038368.58643811.5232058.3783849.0981928.82161613.45906816.590849
197812.50435611.04427411.3800006.08235610.1672337.6506589.4893428.8004669.0897538.30169912.96739716.771370
6.10 对每个地区的记录进行月度向下采样
wind.groupby(wind.index.to_period('M')).mean()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-0114.84133311.98833313.4316137.73677411.0727598.58806511.1848399.2453339.08580610.10741913.88096814.703226
1961-0216.26928614.97535714.4414819.23074113.85214310.93750011.89071411.84607111.82142912.71428618.58321415.411786
1961-0310.89000011.29645210.7529037.28400010.5093558.8667749.6441949.82967710.29413811.25193516.41096815.720000
1961-0410.7226679.4276679.9980005.8306678.4350006.4950006.9253337.0946677.3423337.23700011.14733310.278333
1961-059.8609688.85000010.8180655.9053339.4903236.5748397.6040008.1770978.0393558.49935511.90032312.011613
.......................................
1978-089.6451618.2593559.0322584.5029037.3680655.9351615.6503235.4177427.2412905.53677410.46677412.054194
1978-0910.91366710.89500010.6350005.72500010.3720009.27833310.7903339.58300010.0693338.93900015.68033319.391333
1978-109.8977428.6709689.2958064.7212908.5251616.7741948.1154847.3377428.2977428.24387113.77677417.150000
1978-1116.15166714.80266713.5080007.31733311.4750008.74300011.4923339.65733310.70133310.67600017.40466720.723000
1978-1216.17548413.74806515.6351617.09483911.3987109.24161312.07741910.19483910.61677411.02871013.85967721.371613

216 rows × 12 columns

6.11 对每个地区的记录进行周度向下采样
wind.groupby(wind.index.to_period('W')).mean()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1960-12-26/1961-01-0115.04000014.96000013.1700009.290000NaN9.87000013.67000010.25000010.83000012.58000018.50000015.040000
1961-01-02/1961-01-0813.54142911.48666710.4871436.4171439.4742866.43571411.0614296.6166678.4342868.49714312.48142913.238571
1961-01-09/1961-01-1512.4685718.96714311.9585714.6300007.3514295.0728577.5357146.8200005.7128577.57142911.12571411.024286
1961-01-16/1961-01-2213.2042869.86285712.9828576.3285718.9666677.4171439.2571437.8757147.1457148.1242869.82142911.434286
1961-01-23/1961-01-2919.88000016.14142918.22571412.72000017.43285714.82857115.52857115.16000014.48000015.64000020.93000022.530000
.......................................
1978-11-27/1978-12-0314.93428611.23285713.9414295.56571410.2157148.6185719.6428577.6857149.0114299.54714311.83571418.728571
1978-12-04/1978-12-1020.74000019.19000017.0342869.77714315.28714312.77428614.43714312.48857113.87000014.08285718.51714323.061429
1978-12-11/1978-12-1716.75857114.69285714.9871436.91714311.3971437.27285710.2085717.9671439.1685718.56571411.10285715.562857
1978-12-18/1978-12-2411.1557148.00857113.1728574.0042867.8257146.2900007.7985718.6671437.1514298.07285711.84571418.977143
1978-12-25/1978-12-3114.95142911.80142916.0357146.5071439.6600008.62000013.70857110.47714310.86857111.47142912.94714326.844286

940 rows × 12 columns

6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。
# 重采样并聚合
weekly = wind.resample('W').agg(['min','max','mean','std'])
weekly
RPTVALROS...CLOBELMAL
minmaxmeanstdminmaxmeanstdminmax...meanstdminmaxmeanstdminmaxmeanstd
Yr_Mo_Dy
1961-01-0115.0415.0415.040000NaN14.9614.9614.960000NaN13.1713.17...12.580000NaN18.5018.5018.500000NaN15.0415.0415.040000NaN
1961-01-0810.5818.5013.5414292.6313216.6316.8811.4866673.9495257.6212.33...8.4971431.7049415.4617.5412.4814294.34913910.8816.4613.2385711.773062
1961-01-159.0419.7512.4685713.5553923.5412.088.9671433.1489457.0819.50...7.5714294.0842935.2520.7111.1257145.5522155.1716.9211.0242864.692355
1961-01-224.9219.8313.2042865.3374023.4214.379.8628573.8377857.2920.79...8.1242864.7839526.5015.929.8214293.6265846.7917.9611.4342864.237239
1961-01-2913.6225.0419.8800004.6190619.9623.9116.1414295.17022412.6725.84...15.6400003.71336814.0427.7120.9300005.21072617.5027.6322.5300003.874721
..................................................................
1978-12-039.0821.2914.9342864.9317544.5421.3411.2328575.9789688.2124.04...9.5471436.2849734.9221.4211.8357145.95011211.5025.7518.7285716.393188
1978-12-109.9229.3320.7400007.21501212.5424.7919.1900004.9530607.2125.37...14.0828575.5164059.5426.0818.5171435.60038915.3434.5923.0614298.093976
1978-12-179.8723.1316.7585714.4994313.2124.0414.6928577.5786658.0418.05...8.5657145.4878015.0021.5011.1028576.6319256.9222.8315.5628576.005594
1978-12-246.2116.6211.1557143.5227593.6313.298.0085713.8829008.5022.21...8.0728573.0231313.2119.7911.8457145.75030110.2931.7118.9771437.194108
1978-12-317.2120.3314.9514294.3504005.4617.4111.8014294.7053927.8327.29...11.4714295.5333971.2121.7912.9471437.52314811.9641.4626.84428611.621233

940 rows × 48 columns

weekly.loc[weekly.index[1:53],:].head()
RPTVALROS...CLOBELMAL
minmaxmeanstdminmaxmeanstdminmax...meanstdminmaxmeanstdminmaxmeanstd
Yr_Mo_Dy
1961-01-0810.5818.5013.5414292.6313216.6316.8811.4866673.9495257.6212.33...8.4971431.7049415.4617.5412.4814294.34913910.8816.4613.2385711.773062
1961-01-159.0419.7512.4685713.5553923.5412.088.9671433.1489457.0819.50...7.5714294.0842935.2520.7111.1257145.5522155.1716.9211.0242864.692355
1961-01-224.9219.8313.2042865.3374023.4214.379.8628573.8377857.2920.79...8.1242864.7839526.5015.929.8214293.6265846.7917.9611.4342864.237239
1961-01-2913.6225.0419.8800004.6190619.9623.9116.1414295.17022412.6725.84...15.6400003.71336814.0427.7120.9300005.21072617.5027.6322.5300003.874721
1961-02-0510.5824.2116.8271435.2514089.4624.2115.4600005.1873959.0419.70...9.4600002.8395019.1719.3314.0128574.2108587.1719.2511.9357144.336104

5 rows × 48 columns

7. 探索泰坦尼克灾难数据(可视化)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

plt.rcParams["font.sans-serif"]=['SimHei']  #正常显示中文
plt.rcParams["axes.unicode_minus"]=False   # 正常显示负号

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
7.1 导入数据等
titanic = pd.read_csv('train.csv')
titanic.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
7.1 绘制一个展示乘客性别比例的扇形图
males = (titanic['Sex']=='male').sum()
females = (titanic['Sex']=='female').sum()

proportions = [males, females]
#plt.figure(figsize=(8,8), dpi=100)
#plt.Subplot(111,facecolor='#F0F0F0')

plt.pie(
    proportions,
    labels = ['Males', 'Females'],
    shadow = True,
    colors = ['#E76278','#FAC699'],
    explode = (0.15,0),
    startangle = 90,
    autopct = '%1.1f%%'
    )

plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JYeSODzv-1658480041729)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_149_0.png)]

7.2 绘制一个展示船票费用与乘客年龄和船票的散点图
lm = sns.lmplot(x='Age', y='Fare', data=titanic, hue='Sex', fit_reg=False)
lm.set(title='Fare_Age Scatter')

plt.tight_layout()
plt.show()

在这里插入图片描述

7.3 有多少人生还
titanic['Survived'].sum()
342
7.4 绘制一个展示船票价格的直方图
plt.figure(figsize=(10,4), dpi=100)
plt.subplot(111, facecolor='#F0F0F0')

plt.title('船票价格分布')
plt.xlabel('船票价格')
plt.ylabel('人数')

plt.hist(titanic['Fare'], bins=50, color='#447C69')
plt.xticks(np.arange(0,300,20), fontsize=10)
plt.yticks(fontsize=10)

plt.tight_layout()

在这里插入图片描述

8. 探索 Pokemon 数据
8.1 数据导入及其他
import pandas as pd
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
 "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
 "type": ['grass', 'fire', 'water', 'bug'],
 "hp": [45, 39, 44, 45],
 "pokedex": ['yes', 'no','yes','no']
 }
8.2 创建 DataFrame
pokemon = pd.DataFrame(raw_data)
pokemon.head()
nameevolutiontypehppokedex
0BulbasaurIvysaurgrass45yes
1CharmanderCharmeleonfire39no
2SquirtleWartortlewater44yes
3CaterpieMetapodbug45no
8.3 列重新排序
pokemon = pokemon[['name','type','hp','evolution','pokedex']]
pokemon.head()
nametypehpevolutionpokedex
0Bulbasaurgrass45Ivysauryes
1Charmanderfire39Charmeleonno
2Squirtlewater44Wartortleyes
3Caterpiebug45Metapodno
8.4 添加列 place
pokemon['place'] = ['park','street','lake','forest']
pokemon
nametypehpevolutionpokedexplace
0Bulbasaurgrass45Ivysauryespark
1Charmanderfire39Charmeleonnostreet
2Squirtlewater44Wartortleyeslake
3Caterpiebug45Metapodnoforest
8.5 查看每个列的数据类型
pokemon.dtypes
name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object
9. 探索 Apple 公司股价数据
9.1 导入数据等
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
apple = pd.read_csv('Apple_stock.csv')
apple.head()
DateOpenHighLowCloseVolumeAdj Close
02014-07-0896.2796.8093.9295.356513000095.35
12014-07-0794.1495.9994.1095.975630540095.97
22014-07-0393.6794.1093.2094.032289180094.03
32014-07-0293.8794.0693.0993.482842090093.48
42014-07-0193.5294.0793.1393.523817020093.52
apple.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8465 entries, 1980-12-12 to 2014-07-08
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       8465 non-null   float64
 1   High       8465 non-null   float64
 2   Low        8465 non-null   float64
 3   Close      8465 non-null   float64
 4   Volume     8465 non-null   int64  
 5   Adj Close  8465 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 721.0 KB
9.1 有重复的日期吗
  • 两种方法
apple['Date'].duplicated().sum()
# 没有重复数据
0
apple['Date'].is_unique
True
# 也可以用在索引上
apple.index.is_unique
True
  • 将时间列设为索引
apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace=True)
apple.head()
OpenHighLowCloseVolumeAdj Close
Date
2014-07-0896.2796.8093.9295.356513000095.35
2014-07-0794.1495.9994.1095.975630540095.97
2014-07-0393.6794.1093.2094.032289180094.03
2014-07-0293.8794.0693.0993.482842090093.48
2014-07-0193.5294.0793.1393.523817020093.52
9.2 将 index 设置为升序
apple.sort_index(ascending=True, inplace=True)
apple.head()
OpenHighLowCloseVolumeAdj Close
Date
1980-12-1228.7528.8728.7528.751172584000.45
1980-12-1527.3827.3827.2527.25439712000.42
1980-12-1625.3725.3725.2525.25264320000.39
1980-12-1725.8726.0025.8725.87216104000.40
1980-12-1826.6326.7526.6326.63183624000.41
9.3 找到每个月的最后一个交易日(business day)
# 'BM' business month end frequency
apple_month = apple.resample('BM').mean()
apple_month.head()
OpenHighLowCloseVolumeAdj Close
Date
1980-12-3130.48153830.56769230.44307730.4430772.586252e+070.473077
1981-01-3031.75476231.82666731.65476231.6547627.249867e+060.493810
1981-02-2726.48000026.57210526.40789526.4078954.231832e+060.411053
1981-03-3124.93772725.01681824.83636424.8363647.962691e+060.387727
1981-04-3027.28666727.36809527.22714327.2271436.392000e+060.423333
9.4 数据集中最早的日期和最晚的日期相差多少天?
(apple.index.max()-apple.index.min()).days
12261
9.5 数据集中一共有多少个月?
len(apple.resample('BM').mean())
404
9.6 按照时间顺序可视化 Adj Close 值
plt.figure(figsize=(8,3), dpi=100)
apple['Adj Close'].plot(title='Apple Stock')
<AxesSubplot:title={'center':'Apple Stock'}, xlabel='Date'>

在这里插入图片描述

10. 探索 Iris 纸鸢花数据
10.1 导入数据等
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.head()
5.13.51.40.2Iris-setosa
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   5.1          149 non-null    float64
 1   3.5          149 non-null    float64
 2   1.4          149 non-null    float64
 3   0.2          149 non-null    float64
 4   Iris-setosa  149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
10.2 添加列名
iris.columns = ['sepal_length','sepal_width', 'petal_length','petal_width', 'class']
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
10.3 数据集有缺失值吗
iris.isna().sum()
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64
10.4 将petal_lenth 的第10到19行设置为缺失值
  • python range() 函数是左闭右开,下面的stop=20,不包括20
for i in range(10,20):
    iris['petal_length'].at[i] = np.nan

iris.head(21)
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
54.63.41.40.3Iris-setosa
65.03.41.50.2Iris-setosa
74.42.91.40.2Iris-setosa
84.93.11.50.1Iris-setosa
95.43.71.50.2Iris-setosa
104.83.4NaN0.2Iris-setosa
114.83.0NaN0.1Iris-setosa
124.33.0NaN0.1Iris-setosa
135.84.0NaN0.2Iris-setosa
145.74.4NaN0.4Iris-setosa
155.43.9NaN0.4Iris-setosa
165.13.5NaN0.3Iris-setosa
175.73.8NaN0.3Iris-setosa
185.13.8NaN0.3Iris-setosa
195.43.4NaN0.2Iris-setosa
205.13.71.50.4Iris-setosa
iris.iloc[10:20, 1:2] = np.nan
iris.head(21)
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
54.63.41.40.3Iris-setosa
65.03.41.50.2Iris-setosa
74.42.91.40.2Iris-setosa
84.93.11.50.1Iris-setosa
95.43.71.50.2Iris-setosa
104.8NaN1.60.2Iris-setosa
114.8NaN1.40.1Iris-setosa
124.3NaN1.10.1Iris-setosa
135.8NaN1.20.2Iris-setosa
145.7NaN1.50.4Iris-setosa
155.4NaN1.30.4Iris-setosa
165.1NaN1.40.3Iris-setosa
175.7NaN1.70.3Iris-setosa
185.1NaN1.50.3Iris-setosa
195.4NaN1.70.2Iris-setosa
205.13.71.50.4Iris-setosa
10.5 将缺失值全部替换为 1.0
iris.fillna(1, inplace=True)
iris.head(21)
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
54.63.41.40.3Iris-setosa
65.03.41.50.2Iris-setosa
74.42.91.40.2Iris-setosa
84.93.11.50.1Iris-setosa
95.43.71.50.2Iris-setosa
104.83.41.00.2Iris-setosa
114.83.01.00.1Iris-setosa
124.33.01.00.1Iris-setosa
135.84.01.00.2Iris-setosa
145.74.41.00.4Iris-setosa
155.43.91.00.4Iris-setosa
165.13.51.00.3Iris-setosa
175.73.81.00.3Iris-setosa
185.13.81.00.3Iris-setosa
195.43.41.00.2Iris-setosa
205.13.71.50.4Iris-setosa
10.6 删除列
del iris['class']
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
04.93.01.40.2
14.73.21.30.2
24.63.11.50.2
35.03.61.40.2
45.43.91.70.4
10.7 将数据的前三行设置为缺失值
iris.iloc[0:3,:] = np.nan
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
0NaNNaNNaNNaN
1NaNNaNNaNNaN
2NaNNaNNaNNaN
35.03.61.40.2
45.43.91.70.4
10.8 删除有缺失值的行
iris.drop_duplicates(inplace=True)
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
0NaNNaNNaNNaN
35.03.61.40.2
45.43.91.70.4
54.63.41.40.3
65.03.41.50.2
10.9 重新设置索引
iris.reset_index(inplace=True, drop=True)
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
0NaNNaNNaNNaN
15.03.61.40.2
25.43.91.70.4
34.63.41.40.3
45.03.41.50.2
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值