Github 地址: https://github.com/guipsamora/pandas_exercises 数据集: https://www.aliyundrive.com/s/Wk6L6R67rrn
1. 了解数据
1.1 导入数据集并查看数据集基本情况
import pandas as pd
chipotle = pd. read_csv( '001_chipotle.tsv' , sep= '\t' )
chipotle. head( )
order_id quantity item_name choice_description item_price 0 1 1 Chips and Fresh Tomato Salsa NaN $2.39 1 1 1 Izze [Clementine] $3.39 2 1 1 Nantucket Nectar [Apple] $3.39 3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39 4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
chipotle. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 4622 non-null int64
1 quantity 4622 non-null int64
2 item_name 4622 non-null object
3 choice_description 3376 non-null object
4 item_price 4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
chipotle. columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
chipotle. index
RangeIndex(start=0, stop=4622, step=1)
1.2 被下单最多的商品是什么
as_index = false 使用默认索引列, 显示效果上, 使列名都在一行, “不合并单元格”, 更符合使用习惯
item_buy_count = chipotle[ [ 'item_name' , 'quantity' ] ] . groupby( 'item_name' ) . sum ( ) [ 'quantity' ] . nlargest( 1 )
item_buy_count
item_name
Chicken Bowl 761
Name: quantity, dtype: int64
item_buy_count_1 = chipotle[ [ 'item_name' , 'quantity' ] ] . groupby( 'item_name' , as_index= False ) . agg( { 'quantity' : 'sum' } )
item_buy_count_1. sort_values( 'quantity' , ascending= False , inplace= True )
item_buy_count_1. head( )
item_name quantity 17 Chicken Bowl 761 18 Chicken Burrito 591 25 Chips and Guacamole 506 39 Steak Burrito 386 10 Canned Soft Drink 351
1.3 有多少种商品被下单
chipotle[ 'item_name' ] . nunique( )
50
s = chipotle[ 'item_name' ] . value_counts( )
from matplotlib import pyplot as plt
plt. figure( figsize= ( 20 , 4 ) , dpi= 100 )
plt. xticks( fontsize= 13 , rotation= 'vertical' )
plt. bar( s. index, s. values)
<BarContainer object of 50 artists>
1.4 被下单的商品中, choice_description 的排名
chipotle[ 'choice_description' ] . value_counts( ) . head( )
[Diet Coke] 134
[Coke] 123
[Sprite] 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]] 42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] 40
Name: choice_description, dtype: int64
1.5 一共有多少商品被下单
chipotle[ 'quantity' ] . sum ( )
4972
1.6 将item_price转换为浮点数
chipotle[ 'item_price' ] = chipotle[ 'item_price' ] . str . lstrip( '$' ) . astype( 'float' )
chipotle. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 4622 non-null int64
1 quantity 4622 non-null int64
2 item_name 4622 non-null object
3 choice_description 3376 non-null object
4 item_price 4622 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.7+ KB
chipotle[ 'item_price' ] = chipo[ 'item_price' ] . apply ( lambda x: float ( x[ 1 : - 1 ] ) )
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19644/772157282.py in <module>
----> 1 chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))
NameError: name 'chipo' is not defined
1.7 总收入
chipotle[ 'sub_total' ] = ( chipotle[ 'item_price' ] * chipotle[ 'quantity' ] )
chipotle[ 'sub_total' ] . sum ( )
39237.02
1.8 总订单数
chipotle[ 'order_id' ] . nunique( )
1834
1.9 单均价
chipotle[ 'sub_total' ] . sum ( ) / chipotle[ 'order_id' ] . nunique( )
21.39423118865867
chipotle[ [ 'order_id' , 'sub_total' ] ] . groupby( by= [ 'order_id' ] ) . agg( { 'sub_total' : 'sum' } ) [ 'sub_total' ] . mean( )
21.39423118865867
2. 数据过滤与排序
2.1 数据导入, 查看数据集基本情况
import pandas as pd
euro = pd. read_csv( 'Euro2012_stats.csv' )
euro. head( )
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used 0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16 1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19 2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15 3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16 4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19
5 rows × 35 columns
euro. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Team 16 non-null object
1 Goals 16 non-null int64
2 Shots on target 16 non-null int64
3 Shots off target 16 non-null int64
4 Shooting Accuracy 16 non-null object
5 % Goals-to-shots 16 non-null object
6 Total shots (inc. Blocked) 16 non-null int64
7 Hit Woodwork 16 non-null int64
8 Penalty goals 16 non-null int64
9 Penalties not scored 16 non-null int64
10 Headed goals 16 non-null int64
11 Passes 16 non-null int64
12 Passes completed 16 non-null int64
13 Passing Accuracy 16 non-null object
14 Touches 16 non-null int64
15 Crosses 16 non-null int64
16 Dribbles 16 non-null int64
17 Corners Taken 16 non-null int64
18 Tackles 16 non-null int64
19 Clearances 16 non-null int64
20 Interceptions 16 non-null int64
21 Clearances off line 15 non-null float64
22 Clean Sheets 16 non-null int64
23 Blocks 16 non-null int64
24 Goals conceded 16 non-null int64
25 Saves made 16 non-null int64
26 Saves-to-shots ratio 16 non-null object
27 Fouls Won 16 non-null int64
28 Fouls Conceded 16 non-null int64
29 Offsides 16 non-null int64
30 Yellow Cards 16 non-null int64
31 Red Cards 16 non-null int64
32 Subs on 16 non-null int64
33 Subs off 16 non-null int64
34 Players Used 16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
2.1 有多少球队参与了 2012 年欧洲杯
euro[ 'Team' ] . nunique( )
16
2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里
discripline = euro[ [ 'Team' , 'Yellow Cards' , 'Red Cards' ] ]
discripline
Team Yellow Cards Red Cards 0 Croatia 9 0 1 Czech Republic 7 0 2 Denmark 4 0 3 England 5 0 4 France 6 0 5 Germany 4 0 6 Greece 9 1 7 Italy 16 0 8 Netherlands 5 0 9 Poland 7 1 10 Portugal 12 0 11 Republic of Ireland 6 1 12 Russia 6 0 13 Spain 11 0 14 Sweden 7 0 15 Ukraine 5 0
2.3 对 discripline 排序
discripline. sort_values( [ 'Yellow Cards' , 'Red Cards' ] , ascending= [ False , False ] , inplace= True )
discripline
C:\Users\LiCN\AppData\Local\Temp/ipykernel_19644/3135130171.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
Team Yellow Cards Red Cards 7 Italy 16 0 10 Portugal 12 0 13 Spain 11 0 6 Greece 9 1 0 Croatia 9 0 9 Poland 7 1 1 Czech Republic 7 0 14 Sweden 7 0 11 Republic of Ireland 6 1 4 France 6 0 12 Russia 6 0 3 England 5 0 8 Netherlands 5 0 15 Ukraine 5 0 2 Denmark 4 0 5 Germany 4 0
2.4 计算每个球队拿到的黄牌数的平均值
discripline[ 'Yellow Cards' ] . mean( )
7.4375
2.5 进球大于 6 的球队
discripline[ discripline[ 'Yellow Cards' ] > 6 ]
Team Yellow Cards Red Cards 7 Italy 16 0 10 Portugal 12 0 13 Spain 11 0 6 Greece 9 1 0 Croatia 9 0 9 Poland 7 1 1 Czech Republic 7 0 14 Sweden 7 0
2.6 选取以字母 G 开头的球队数据
euro[ euro[ 'Team' ] . str . startswith( 'G' ) ]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17 6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20
2 rows × 35 columns
2.7 选取前 7 列
euro. iloc[ : , 0 : 7 ]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) 0 Croatia 4 13 12 51.9% 16.0% 32 1 Czech Republic 4 13 18 41.9% 12.9% 39 2 Denmark 4 10 10 50.0% 20.0% 27 3 England 5 11 18 50.0% 17.2% 40 4 France 3 22 24 37.9% 6.5% 65 5 Germany 10 32 32 47.8% 15.6% 80 6 Greece 5 8 18 30.7% 19.2% 32 7 Italy 6 34 45 43.0% 7.5% 110 8 Netherlands 2 12 36 25.0% 4.1% 60 9 Poland 2 15 23 39.4% 5.2% 48 10 Portugal 6 22 42 34.3% 9.3% 82 11 Republic of Ireland 1 7 12 36.8% 5.2% 28 12 Russia 5 9 31 22.5% 12.5% 59 13 Spain 12 42 33 55.9% 16.0% 100 14 Sweden 5 17 19 47.2% 13.8% 39 15 Ukraine 2 7 26 21.2% 6.0% 38
2.8 选取除了最后 3 列之外的全部列
euro. iloc[ : , : - 3 ]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards 0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0 1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0 2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0 3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0 4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0 6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1 7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0 8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0 9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1 10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0 11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1 12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0 13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0 14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0 15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0
16 rows × 32 columns
2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)
euro[ euro[ 'Team' ] . isin( [ 'England' , 'Italy' , 'Russia' ] ) ] [ [ 'Team' , 'Shooting Accuracy' ] ]
Team Shooting Accuracy 3 England 50.0% 7 Italy 43.0% 12 Russia 22.5%
euro. loc[ euro[ 'Team' ] . isin( [ 'England' , 'Italy' , 'Russia' ] ) , [ 'Team' , 'Shooting Accuracy' ] ]
Team Shooting Accuracy 3 England 50.0% 7 Italy 43.0% 12 Russia 22.5%
3. 数据分组 探索酒类消费数据
import pandas as pd
drinks = pd. read_csv( 'drinks.csv' )
drinks. head( )
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent 0 Afghanistan 0 0 0 0.0 AS 1 Albania 89 132 54 4.9 EU 2 Algeria 25 0 14 0.7 AF 3 Andorra 245 138 312 12.4 EU 4 Angola 217 57 45 5.9 AF
drinks. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 193 non-null object
1 beer_servings 193 non-null int64
2 spirit_servings 193 non-null int64
3 wine_servings 193 non-null int64
4 total_litres_of_pure_alcohol 193 non-null float64
5 continent 170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
3.1 哪个大陆平均消耗的啤酒更多?
drinks. groupby( 'continent' ) [ 'beer_servings' ] . mean( ) . sort_values( ) . head( 1 )
continent
AS 37.045455
Name: beer_servings, dtype: float64
3.2 打印出每个大陆红酒消耗的描述性统计值
drinks. groupby( 'continent' ) [ 'wine_servings' ] . describe( )
count mean std min 25% 50% 75% max continent AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0 AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0 EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0 OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0 SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
3.3 打印出每个大陆每种酒类的消耗平均值
drinks. groupby( 'continent' ) . mean( )
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent AF 61.471698 16.339623 16.264151 3.007547 AS 37.045455 60.840909 9.068182 2.170455 EU 193.777778 132.555556 142.222222 8.617778 OC 89.687500 58.437500 35.625000 3.381250 SA 175.083333 114.750000 62.416667 6.308333
3.4 打印出每个大陆每种酒类的消耗的中位数
drinks. groupby( 'continent' ) . median( )
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent AF 32.0 3.0 2.0 2.30 AS 17.5 16.0 1.0 1.20 EU 219.0 122.0 128.0 10.00 OC 52.5 37.0 8.5 1.75 SA 162.5 108.5 12.0 6.85
3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值
drinks. groupby( 'continent' ) [ 'spirit_servings' ] . agg( [ 'mean' , 'max' , 'min' ] )
mean max min continent AF 16.339623 152 0 AS 60.840909 326 0 EU 132.555556 373 0 OC 58.437500 254 0 SA 114.750000 302 25
4. apply 函数 (1960-2014 美国犯罪数据探索)
4.1 数据导入及数据集基本情况
import pandas as pd
crime = pd. read_csv( 'US_Crime_Rates_1960_2014.csv' )
crime. head( )
Year Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft 0 1960 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200 1 1961 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000 2 1962 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800 3 1963 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300 4 1964 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
crime. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 55 non-null int64
1 Population 55 non-null int64
2 Total 55 non-null int64
3 Violent 55 non-null int64
4 Property 55 non-null int64
5 Murder 55 non-null int64
6 Forcible_Rape 55 non-null int64
7 Robbery 55 non-null int64
8 Aggravated_assault 55 non-null int64
9 Burglary 55 non-null int64
10 Larceny_Theft 55 non-null int64
11 Vehicle_Theft 55 non-null int64
dtypes: int64(12)
memory usage: 5.3 KB
4.2 将 Year 数据类型转换为 datetime
crime[ 'Year' ] = pd. to_datetime( crime[ 'Year' ] , format = '%Y' )
crime. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 55 non-null datetime64[ns]
1 Population 55 non-null int64
2 Total 55 non-null int64
3 Violent 55 non-null int64
4 Property 55 non-null int64
5 Murder 55 non-null int64
6 Forcible_Rape 55 non-null int64
7 Robbery 55 non-null int64
8 Aggravated_assault 55 non-null int64
9 Burglary 55 non-null int64
10 Larceny_Theft 55 non-null int64
11 Vehicle_Theft 55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB
4.3 将 Year 设置为索引
crime. set_index( 'Year' , inplace= True )
crime. head( )
Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft Year 1960-01-01 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200 1961-01-01 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000 1962-01-01 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800 1963-01-01 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300 1964-01-01 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
4.4 删除名为 Total 的列
del crime[ 'Total' ]
4.5 每十年重组数据 resample 非常重要
Pandas中的resample,重新采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。 参数表示采样的规则 常见时间频率 :A year M month W week D day H hour T minute S second 按照每十年为一组进行采样统计
crime_10Ys = crime. resample( '10As' ) . sum ( )
crime_10Ys
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft Year 1960-01-01 1915053175 4134930 45160900 106180 236720 1633510 2158520 13321100 26547700 5292100 1970-01-01 2121193298 9607930 91383800 192230 554570 4159020 4702120 28486000 53157800 9739900 1980-01-01 2371370069 14074328 117048900 206439 865639 5383109 7619130 33073494 72040253 11935411 1990-01-01 2612825258 17527048 119053499 211664 998827 5748930 10568963 26750015 77679366 14624418 2000-01-01 2947969117 13968056 100944369 163068 922499 4230366 8652124 21565176 67970291 11412834 2010-01-01 1570146307 6072017 44095950 72867 421059 1749809 3764142 10125170 30401698 3569080
用 population 的最大值替换重组后的 population 值, 上面每个十年的 population 加总没有意义 用十年中的最大值替换掉这个加总值
crime_10Ys[ 'Population' ] = crime[ 'Population' ] . resample( '10As' ) . max ( )
crime_10Ys. head( )
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft Year 1960-01-01 201385000 4134930 45160900 106180 236720 1633510 2158520 13321100 26547700 5292100 1970-01-01 220099000 9607930 91383800 192230 554570 4159020 4702120 28486000 53157800 9739900 1980-01-01 248239000 14074328 117048900 206439 865639 5383109 7619130 33073494 72040253 11935411 1990-01-01 272690813 17527048 119053499 211664 998827 5748930 10568963 26750015 77679366 14624418 2000-01-01 307006550 13968056 100944369 163068 922499 4230366 8652124 21565176 67970291 11412834
4.6 何时是美国历史上生存最危险的年代
crime
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft Year 1960-01-01 179323175 288460 3095700 9110 17190 107840 154320 912100 1855400 328200 1961-01-01 182992000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000 1962-01-01 185771000 301510 3450700 8530 17550 110860 164570 994300 2089600 366800 1963-01-01 188483000 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300 1964-01-01 191141000 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800 1965-01-01 193526000 387390 4352000 9960 23410 138690 215330 1282500 2572600 496900 1966-01-01 195576000 430180 4793300 11040 25820 157990 235330 1410100 2822000 561200 1967-01-01 197457000 499930 5403500 12240 27620 202910 257160 1632100 3111600 659800 1968-01-01 199399000 595010 6125200 13800 31670 262840 286700 1858900 3482700 783600 1969-01-01 201385000 661870 6749000 14760 37170 298850 311090 1981900 3888600 878500 1970-01-01 203235298 738820 7359200 16000 37990 349860 334970 2205000 4225800 928400 1971-01-01 206212000 816500 7771700 17780 42260 387700 368760 2399300 4424200 948200 1972-01-01 208230000 834900 7413900 18670 46850 376290 393090 2375500 4151200 887200 1973-01-01 209851000 875910 7842200 19640 51400 384220 420650 2565500 4347900 928800 1974-01-01 211392000 974720 9278700 20710 55400 442400 456210 3039200 5262500 977100 1975-01-01 213124000 1039710 10252700 20510 56090 470500 492620 3265300 5977700 1009600 1976-01-01 214659000 1004210 10345500 18780 57080 427810 500530 3108700 6270800 966000 1977-01-01 216332000 1029580 9955000 19120 63500 412610 534350 3071500 5905700 977700 1978-01-01 218059000 1085550 10123400 19560 67610 426930 571460 3128300 5991000 1004100 1979-01-01 220099000 1208030 11041500 21460 76390 480700 629480 3327700 6601000 1112800 1980-01-01 225349264 1344520 12063700 23040 82990 565840 672650 3795200 7136900 1131700 1981-01-01 229146000 1361820 12061900 22520 82500 592910 663900 3779700 7194400 1087800 1982-01-01 231534000 1322390 11652000 21010 78770 553130 669480 3447100 7142500 1062400 1983-01-01 233981000 1258090 10850500 19310 78920 506570 653290 3129900 6712800 1007900 1984-01-01 236158000 1273280 10608500 18690 84230 485010 685350 2984400 6591900 1032200 1985-01-01 238740000 1328800 11102600 18980 88670 497870 723250 3073300 6926400 1102900 1986-01-01 240132887 1489169 11722700 20613 91459 542775 834322 3241410 7257153 1224137 1987-01-01 242282918 1483999 12024700 20096 91110 517704 855088 3236184 7499900 1288674 1988-01-01 245807000 1566220 12356900 20680 92490 542970 910090 3218100 7705900 1432900 1989-01-01 248239000 1646040 12605400 21500 94500 578330 951710 3168200 7872400 1564800 1990-01-01 248709873 1820130 12655500 23440 102560 639270 1054860 3073900 7945700 1635900 1991-01-01 252177000 1911770 12961100 24700 106590 687730 1092740 3157200 8142200 1661700 1992-01-01 255082000 1932270 12505900 23760 109060 672480 1126970 2979900 7915200 1610800 1993-01-01 257908000 1926020 12218800 24530 106010 659870 1135610 2834800 7820900 1563100 1994-01-01 260341000 1857670 12131900 23330 102220 618950 1113180 2712800 7879800 1539300 1995-01-01 262755000 1798790 12063900 21610 97470 580510 1099210 2593800 7997700 1472400 1996-01-01 265228572 1688540 11805300 19650 96250 535590 1037050 2506400 7904700 1394200 1997-01-01 267637000 1634770 11558175 18208 96153 498534 1023201 2460526 7743760 1354189 1998-01-01 270296000 1531044 10944590 16914 93103 446625 974402 2329950 7373886 1240754 1999-01-01 272690813 1426044 10208334 15522 89411 409371 911740 2100739 6955520 1152075 2000-01-01 281421906 1425486 10182586 15586 90178 408016 911706 2050992 6971590 1160002 2001-01-01 285317559 1439480 10437480 16037 90863 423557 909023 2116531 7092267 1228391 2002-01-01 287973924 1423677 10455277 16229 95235 420806 891407 2151252 7057370 1246646 2003-01-01 290690788 1383676 10442862 16528 93883 414235 859030 2154834 7026802 1261226 2004-01-01 293656842 1360088 10319386 16148 95089 401470 847381 2144446 6937089 1237851 2005-01-01 296507061 1390745 10174754 16740 94347 417438 862220 2155448 6783447 1235859 2006-01-01 299398484 1418043 9983568 17030 92757 447403 860853 2183746 6607013 1192809 2007-01-01 301621157 1408337 9843481 16929 90427 445125 855856 2176140 6568572 1095769 2008-01-01 304374846 1392628 9767915 16442 90479 443574 842134 2228474 6588046 958629 2009-01-01 307006550 1325896 9337060 15399 89241 408742 812514 2203313 6338095 795652 2010-01-01 309330219 1251248 9112625 14772 85593 369089 781844 2168457 6204601 739565 2011-01-01 311587816 1206031 9052743 14661 84175 354772 752423 2185140 6151095 716508 2012-01-01 313873685 1217067 9001992 14866 85141 355051 762009 2109932 6168874 723186 2013-01-01 316497531 1199684 8650761 14319 82109 345095 726575 1931835 6018632 700294 2014-01-01 318857056 1197987 8277829 14249 84041 325802 741291 1729806 5858496 689527
crime. idxmax( )
Population 2014-01-01
Violent 1992-01-01
Property 1991-01-01
Murder 1991-01-01
Forcible_Rape 1992-01-01
Robbery 1991-01-01
Aggravated_assault 1993-01-01
Burglary 1980-01-01
Larceny_Theft 1991-01-01
Vehicle_Theft 1991-01-01
dtype: datetime64[ns]
各种犯罪数量的最大值集中出现在1991年~1993年, 从整个数据集看人口数量是在逐渐增长的(从1960年的1.8亿到2014年的3.2亿), 说明这一时期犯罪率高于其他时期(但并不能断言整个90年代是最危险的年代)
5. 合并(探索虚拟姓名数据)
5.1 构建虚拟数据
import numpy as np
import pandas as pd
raw_data_1 = {
'subject_id' : [ '1' , '2' , '3' , '4' , '5' ] ,
'first_name' : [ 'Alex' , 'Amy' , 'Allen' , 'Alice' , 'Ayoung' ] ,
'last_name' : [ 'Anderson' , 'Ackerman' , 'Ali' , 'Aoni' , 'Atiches' ] }
raw_data_2 = {
'subject_id' : [ '4' , '5' , '6' , '7' , '8' ] ,
'first_name' : [ 'Billy' , 'Brian' , 'Bran' , 'Bryce' , 'Betty' ] ,
'last_name' : [ 'Bonder' , 'Black' , 'Balwner' , 'Brice' , 'Btisan' ] }
raw_data_3 = {
'subject_id' : [ '1' , '2' , '3' , '4' , '5' , '7' , '8' , '9' , '10' , '11' ] ,
'test_id' : [ 51 , 15 , 15 , 61 , 16 , 14 , 15 , 1 , 61 , 16 ] }
data_1 = pd. DataFrame( raw_data_1)
data_2 = pd. DataFrame( raw_data_2)
data_3 = pd. DataFrame( raw_data_3)
print ( data_1)
print ( data_2)
print ( data_3)
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
subject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
5.2 上下合并, 鸡蛋下落方向, axis=0
all_data = pd. concat( [ data_1, data_2] , ignore_index= True )
all_data
subject_id first_name last_name 0 1 Alex Anderson 1 2 Amy Ackerman 2 3 Allen Ali 3 4 Alice Aoni 4 5 Ayoung Atiches 5 4 Billy Bonder 6 5 Brian Black 7 6 Bran Balwner 8 7 Bryce Brice 9 8 Betty Btisan
5.3 左右合并, axis=1
by_col = pd. concat( [ data_1, data_2] , axis= 1 )
by_col
subject_id first_name last_name subject_id first_name last_name 0 1 Alex Anderson 4 Billy Bonder 1 2 Amy Ackerman 5 Brian Black 2 3 Allen Ali 6 Bran Balwner 3 4 Alice Aoni 7 Bryce Brice 4 5 Ayoung Atiches 8 Betty Btisan
5.4 按照 subject_id 的值对 all_data 和 data3 作合并
all_data = pd. merge( all_data, data_3, on= 'subject_id' )
all_data
subject_id first_name last_name test_id 0 1 Alex Anderson 51 1 2 Amy Ackerman 15 2 3 Allen Ali 15 3 4 Alice Aoni 61 4 4 Billy Bonder 61 5 5 Ayoung Atiches 16 6 5 Brian Black 16 7 7 Bryce Brice 14 8 8 Betty Btisan 15
5.5 对 data1 和 data2 按照 subject_id 作内连接
inner_join = pd. merge( data_1, data_2, on= 'subject_id' , how= 'inner' )
inner_join
subject_id first_name_x last_name_x first_name_y last_name_y 0 4 Alice Aoni Billy Bonder 1 5 Ayoung Atiches Brian Black
5.6 对 data1 和 data2 按照 subject_id 作外连接
outer_join = pd. merge( data_1, data_2, on= 'subject_id' , how= 'outer' )
outer_join
subject_id first_name_x last_name_x first_name_y last_name_y 0 1 Alex Anderson NaN NaN 1 2 Amy Ackerman NaN NaN 2 3 Allen Ali NaN NaN 3 4 Alice Aoni Billy Bonder 4 5 Ayoung Atiches Brian Black 5 6 NaN NaN Bran Balwner 6 7 NaN NaN Bryce Brice 7 8 NaN NaN Betty Btisan
6. 统计 探索风速数据
6.1 数据导入等
import pandas as pd
import datetime
wind = pd. read_table( 'wind.data' , sep= '\s+' , parse_dates= [ [ 0 , 1 , 2 ] ] )
wind. head( )
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL 0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
wind. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Yr_Mo_Dy 6574 non-null datetime64[ns]
1 RPT 6568 non-null float64
2 VAL 6571 non-null float64
3 ROS 6572 non-null float64
4 KIL 6569 non-null float64
5 SHA 6572 non-null float64
6 BIR 6574 non-null float64
7 DUB 6571 non-null float64
8 CLA 6572 non-null float64
9 MUL 6571 non-null float64
10 CLO 6573 non-null float64
11 BEL 6574 non-null float64
12 MAL 6570 non-null float64
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB
6.2 2061 年是错误数据, 尝试修复
原答案少一层’pd.to_datetime()', 这样的话应用函数后 wind[‘Yr_Mo_Dy’] 将不再是 datetime格式\
def fix_year ( x) :
year= x. year- 100 if x. year> 1978 else x. year
return pd. to_datetime( datetime. date( year, x. month, x. day) , format = '%Y/%m/%d' )
wind[ 'Yr_Mo_Dy' ] = wind[ 'Yr_Mo_Dy' ] . apply ( fix_year)
wind[ 'Yr_Mo_Dy' ] . dt. year. unique( )
array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)
6.3 将日期设为索引, 注意数据类型, 应该是 datetime64
wind. set_index( 'Yr_Mo_Dy' , inplace= True )
wind. head( )
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
6.4 对应每一个 location, 一共有多少数据缺失
wind. isna( ) . sum ( )
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
6.5 对应每一个 location,一共有多少完整的数据值
notna() 和 info() 显示的都能显示非空行
wind. notna( ) . sum ( )
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
wind. shape[ 0 ] - wind. isnull( ) . sum ( )
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
6.6 对于全体数据, 计算风速的平均值
wind. mean( ) . mean( )
10.227982360836938
6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差
wind_des = wind. describe( ) . T. iloc[ : , [ 3 , 7 , 1 , 2 ] ]
wind_des
min max mean std RPT 0.67 35.80 12.362987 5.618413 VAL 0.21 33.37 10.644314 5.267356 ROS 1.50 33.84 11.660526 5.008450 KIL 0.00 28.46 6.306468 3.605811 SHA 0.13 37.54 10.455834 4.936125 BIR 0.00 26.16 7.092254 3.968683 DUB 0.00 30.37 9.797343 4.977555 CLA 0.00 31.08 8.495053 4.499449 MUL 0.00 25.88 8.493590 4.166872 CLO 0.04 28.21 8.707332 4.503954 BEL 0.13 42.38 13.121007 5.835037 MAL 0.67 42.54 15.599079 6.699794
day_stats = pd. DataFrame( )
day_stats[ 'min' ] = wind. min ( axis= 1 )
day_stats[ 'max' ] = wind. max ( axis= 1 )
day_stats[ 'mean' ] = wind. mean( axis= 1 )
day_stats[ 'std' ] = wind. std( axis= 1 )
day_stats. head( )
6.8 对于每一个 location, 计算一月份的平均风速
wind[ 'date' ] = wind. index
wind. head( )
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date Yr_Mo_Dy 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 1961-01-02 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 1961-01-03 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1961-01-04 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1961-01-05
wind[ 'year' ] = wind[ 'date' ] . dt. year
wind[ 'month' ] = wind[ 'date' ] . dt. month
wind[ 'day' ] = wind[ 'date' ] . dt. day
wind. head( )
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date year month day Yr_Mo_Dy 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1961 1 1 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 1961-01-02 1961 1 2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 1961-01-03 1961 1 3 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1961-01-04 1961 1 4 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1961-01-05 1961 1 5
wind[ 'year' ] = wind[ 'date' ] . apply ( lambda x: x. year)
wind[ 'month' ] = wind[ 'date' ] . apply ( lambda x: x. month)
wind[ 'day' ] = wind[ 'date' ] . apply ( lambda x: x. day)
wind[ wind[ 'month' ] == 1 ] . iloc[ : , 0 : 12 ] . mean( )
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
wind. loc[ wind. index. month== 1 ] . mean( )
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
6.9 对每个地区的记录进行年度向下采样
DatetimeIndex 多个独立的日期组合起来 PeriodIndex 指定了间隔频率的 DatetimeIndex, 通过将 DatetimeIndex.to_period (“”),指定间隔频率来得到
wind. groupby( wind. index. to_period( 'A' ) ) . mean( )
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy 1961 12.299583 10.351796 11.362369 6.958227 10.881763 7.729726 9.733923 8.858788 8.647652 9.835577 13.502795 13.680773 1962 12.246923 10.110438 11.732712 6.960440 10.657918 7.393068 11.020712 8.793753 8.316822 9.676247 12.930685 14.323956 1963 12.813452 10.836986 12.541151 7.330055 11.724110 8.434712 11.075699 10.336548 8.903589 10.224438 13.638877 14.999014 1964 12.363661 10.920164 12.104372 6.787787 11.454481 7.570874 10.259153 9.467350 7.789016 10.207951 13.740546 14.910301 1965 12.451370 11.075534 11.848767 6.858466 11.024795 7.478110 10.618712 8.879918 7.907425 9.918082 12.964247 15.591644 1966 13.461973 11.557205 12.020630 7.345726 11.805041 7.793671 10.579808 8.835096 8.514438 9.768959 14.265836 16.307260 1967 12.737151 10.990986 11.739397 7.143425 11.630740 7.368164 10.652027 9.325616 8.645014 9.547425 14.774548 17.135945 1968 11.835628 10.468197 11.409754 6.477678 10.760765 6.067322 8.859180 8.255519 7.224945 7.832978 12.808634 15.017486 1969 11.166356 9.723699 10.902000 5.767973 9.873918 6.189973 8.564493 7.711397 7.924521 7.754384 12.621233 15.762904 1970 12.600329 10.726932 11.730247 6.217178 10.567370 7.609452 9.609890 8.334630 9.297616 8.289808 13.183644 16.456027 1971 11.273123 9.095178 11.088329 5.241507 9.440329 6.097151 8.385890 6.757315 7.915370 7.229753 12.208932 15.025233 1972 12.463962 10.561311 12.058333 5.929699 9.430410 6.358825 9.704508 7.680792 8.357295 7.515273 12.727377 15.028716 1973 11.828466 10.680493 10.680493 5.547863 9.640877 6.548740 8.482110 7.614274 8.245534 7.812411 12.169699 15.441096 1974 13.643096 11.811781 12.336356 6.427041 11.110986 6.809781 10.084603 9.896986 9.331753 8.736356 13.252959 16.947671 1975 12.008575 10.293836 11.564712 5.269096 9.190082 5.668521 8.562603 7.843836 8.797945 7.382822 12.631671 15.307863 1976 11.737842 10.203115 10.761230 5.109426 8.846339 6.311038 9.149126 7.146202 8.883716 7.883087 12.332377 15.471448 1977 13.099616 11.144493 12.627836 6.073945 10.003836 8.586438 11.523205 8.378384 9.098192 8.821616 13.459068 16.590849 1978 12.504356 11.044274 11.380000 6.082356 10.167233 7.650658 9.489342 8.800466 9.089753 8.301699 12.967397 16.771370
6.10 对每个地区的记录进行月度向下采样
wind. groupby( wind. index. to_period( 'M' ) ) . mean( )
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy 1961-01 14.841333 11.988333 13.431613 7.736774 11.072759 8.588065 11.184839 9.245333 9.085806 10.107419 13.880968 14.703226 1961-02 16.269286 14.975357 14.441481 9.230741 13.852143 10.937500 11.890714 11.846071 11.821429 12.714286 18.583214 15.411786 1961-03 10.890000 11.296452 10.752903 7.284000 10.509355 8.866774 9.644194 9.829677 10.294138 11.251935 16.410968 15.720000 1961-04 10.722667 9.427667 9.998000 5.830667 8.435000 6.495000 6.925333 7.094667 7.342333 7.237000 11.147333 10.278333 1961-05 9.860968 8.850000 10.818065 5.905333 9.490323 6.574839 7.604000 8.177097 8.039355 8.499355 11.900323 12.011613 ... ... ... ... ... ... ... ... ... ... ... ... ... 1978-08 9.645161 8.259355 9.032258 4.502903 7.368065 5.935161 5.650323 5.417742 7.241290 5.536774 10.466774 12.054194 1978-09 10.913667 10.895000 10.635000 5.725000 10.372000 9.278333 10.790333 9.583000 10.069333 8.939000 15.680333 19.391333 1978-10 9.897742 8.670968 9.295806 4.721290 8.525161 6.774194 8.115484 7.337742 8.297742 8.243871 13.776774 17.150000 1978-11 16.151667 14.802667 13.508000 7.317333 11.475000 8.743000 11.492333 9.657333 10.701333 10.676000 17.404667 20.723000 1978-12 16.175484 13.748065 15.635161 7.094839 11.398710 9.241613 12.077419 10.194839 10.616774 11.028710 13.859677 21.371613
216 rows × 12 columns
6.11 对每个地区的记录进行周度向下采样
wind. groupby( wind. index. to_period( 'W' ) ) . mean( )
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy 1960-12-26/1961-01-01 15.040000 14.960000 13.170000 9.290000 NaN 9.870000 13.670000 10.250000 10.830000 12.580000 18.500000 15.040000 1961-01-02/1961-01-08 13.541429 11.486667 10.487143 6.417143 9.474286 6.435714 11.061429 6.616667 8.434286 8.497143 12.481429 13.238571 1961-01-09/1961-01-15 12.468571 8.967143 11.958571 4.630000 7.351429 5.072857 7.535714 6.820000 5.712857 7.571429 11.125714 11.024286 1961-01-16/1961-01-22 13.204286 9.862857 12.982857 6.328571 8.966667 7.417143 9.257143 7.875714 7.145714 8.124286 9.821429 11.434286 1961-01-23/1961-01-29 19.880000 16.141429 18.225714 12.720000 17.432857 14.828571 15.528571 15.160000 14.480000 15.640000 20.930000 22.530000 ... ... ... ... ... ... ... ... ... ... ... ... ... 1978-11-27/1978-12-03 14.934286 11.232857 13.941429 5.565714 10.215714 8.618571 9.642857 7.685714 9.011429 9.547143 11.835714 18.728571 1978-12-04/1978-12-10 20.740000 19.190000 17.034286 9.777143 15.287143 12.774286 14.437143 12.488571 13.870000 14.082857 18.517143 23.061429 1978-12-11/1978-12-17 16.758571 14.692857 14.987143 6.917143 11.397143 7.272857 10.208571 7.967143 9.168571 8.565714 11.102857 15.562857 1978-12-18/1978-12-24 11.155714 8.008571 13.172857 4.004286 7.825714 6.290000 7.798571 8.667143 7.151429 8.072857 11.845714 18.977143 1978-12-25/1978-12-31 14.951429 11.801429 16.035714 6.507143 9.660000 8.620000 13.708571 10.477143 10.868571 11.471429 12.947143 26.844286
940 rows × 12 columns
6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。
weekly = wind. resample( 'W' ) . agg( [ 'min' , 'max' , 'mean' , 'std' ] )
weekly
RPT VAL ROS ... CLO BEL MAL min max mean std min max mean std min max ... mean std min max mean std min max mean std Yr_Mo_Dy 1961-01-01 15.04 15.04 15.040000 NaN 14.96 14.96 14.960000 NaN 13.17 13.17 ... 12.580000 NaN 18.50 18.50 18.500000 NaN 15.04 15.04 15.040000 NaN 1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88 11.486667 3.949525 7.62 12.33 ... 8.497143 1.704941 5.46 17.54 12.481429 4.349139 10.88 16.46 13.238571 1.773062 1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08 8.967143 3.148945 7.08 19.50 ... 7.571429 4.084293 5.25 20.71 11.125714 5.552215 5.17 16.92 11.024286 4.692355 1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37 9.862857 3.837785 7.29 20.79 ... 8.124286 4.783952 6.50 15.92 9.821429 3.626584 6.79 17.96 11.434286 4.237239 1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91 16.141429 5.170224 12.67 25.84 ... 15.640000 3.713368 14.04 27.71 20.930000 5.210726 17.50 27.63 22.530000 3.874721 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 1978-12-03 9.08 21.29 14.934286 4.931754 4.54 21.34 11.232857 5.978968 8.21 24.04 ... 9.547143 6.284973 4.92 21.42 11.835714 5.950112 11.50 25.75 18.728571 6.393188 1978-12-10 9.92 29.33 20.740000 7.215012 12.54 24.79 19.190000 4.953060 7.21 25.37 ... 14.082857 5.516405 9.54 26.08 18.517143 5.600389 15.34 34.59 23.061429 8.093976 1978-12-17 9.87 23.13 16.758571 4.499431 3.21 24.04 14.692857 7.578665 8.04 18.05 ... 8.565714 5.487801 5.00 21.50 11.102857 6.631925 6.92 22.83 15.562857 6.005594 1978-12-24 6.21 16.62 11.155714 3.522759 3.63 13.29 8.008571 3.882900 8.50 22.21 ... 8.072857 3.023131 3.21 19.79 11.845714 5.750301 10.29 31.71 18.977143 7.194108 1978-12-31 7.21 20.33 14.951429 4.350400 5.46 17.41 11.801429 4.705392 7.83 27.29 ... 11.471429 5.533397 1.21 21.79 12.947143 7.523148 11.96 41.46 26.844286 11.621233
940 rows × 48 columns
weekly. loc[ weekly. index[ 1 : 53 ] , : ] . head( )
RPT VAL ROS ... CLO BEL MAL min max mean std min max mean std min max ... mean std min max mean std min max mean std Yr_Mo_Dy 1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88 11.486667 3.949525 7.62 12.33 ... 8.497143 1.704941 5.46 17.54 12.481429 4.349139 10.88 16.46 13.238571 1.773062 1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08 8.967143 3.148945 7.08 19.50 ... 7.571429 4.084293 5.25 20.71 11.125714 5.552215 5.17 16.92 11.024286 4.692355 1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37 9.862857 3.837785 7.29 20.79 ... 8.124286 4.783952 6.50 15.92 9.821429 3.626584 6.79 17.96 11.434286 4.237239 1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91 16.141429 5.170224 12.67 25.84 ... 15.640000 3.713368 14.04 27.71 20.930000 5.210726 17.50 27.63 22.530000 3.874721 1961-02-05 10.58 24.21 16.827143 5.251408 9.46 24.21 15.460000 5.187395 9.04 19.70 ... 9.460000 2.839501 9.17 19.33 14.012857 4.210858 7.17 19.25 11.935714 4.336104
5 rows × 48 columns
7. 探索泰坦尼克灾难数据(可视化)
import pandas as pd
import matplotlib. pyplot as plt
import seaborn as sns
import numpy as np
plt. rcParams[ "font.sans-serif" ] = [ 'SimHei' ]
plt. rcParams[ "axes.unicode_minus" ] = False
% matplotlib inline
import warnings
warnings. filterwarnings( 'ignore' )
7.1 导入数据等
titanic = pd. read_csv( 'train.csv' )
titanic. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
titanic. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
7.1 绘制一个展示乘客性别比例的扇形图
males = ( titanic[ 'Sex' ] == 'male' ) . sum ( )
females = ( titanic[ 'Sex' ] == 'female' ) . sum ( )
proportions = [ males, females]
plt. pie(
proportions,
labels = [ 'Males' , 'Females' ] ,
shadow = True ,
colors = [ '#E76278' , '#FAC699' ] ,
explode = ( 0.15 , 0 ) ,
startangle = 90 ,
autopct = '%1.1f%%'
)
plt. axis( 'equal' )
plt. title( 'Sex Proportion' )
plt. tight_layout( )
plt. show( )
7.2 绘制一个展示船票费用与乘客年龄和船票的散点图
lm = sns. lmplot( x= 'Age' , y= 'Fare' , data= titanic, hue= 'Sex' , fit_reg= False )
lm. set ( title= 'Fare_Age Scatter' )
plt. tight_layout( )
plt. show( )
7.3 有多少人生还
titanic[ 'Survived' ] . sum ( )
342
7.4 绘制一个展示船票价格的直方图
plt. figure( figsize= ( 10 , 4 ) , dpi= 100 )
plt. subplot( 111 , facecolor= '#F0F0F0' )
plt. title( '船票价格分布' )
plt. xlabel( '船票价格' )
plt. ylabel( '人数' )
plt. hist( titanic[ 'Fare' ] , bins= 50 , color= '#447C69' )
plt. xticks( np. arange( 0 , 300 , 20 ) , fontsize= 10 )
plt. yticks( fontsize= 10 )
plt. tight_layout( )
8. 探索 Pokemon 数据
8.1 数据导入及其他
import pandas as pd
raw_data = { "name" : [ 'Bulbasaur' , 'Charmander' , 'Squirtle' , 'Caterpie' ] ,
"evolution" : [ 'Ivysaur' , 'Charmeleon' , 'Wartortle' , 'Metapod' ] ,
"type" : [ 'grass' , 'fire' , 'water' , 'bug' ] ,
"hp" : [ 45 , 39 , 44 , 45 ] ,
"pokedex" : [ 'yes' , 'no' , 'yes' , 'no' ]
}
8.2 创建 DataFrame
pokemon = pd. DataFrame( raw_data)
pokemon. head( )
name evolution type hp pokedex 0 Bulbasaur Ivysaur grass 45 yes 1 Charmander Charmeleon fire 39 no 2 Squirtle Wartortle water 44 yes 3 Caterpie Metapod bug 45 no
8.3 列重新排序
pokemon = pokemon[ [ 'name' , 'type' , 'hp' , 'evolution' , 'pokedex' ] ]
pokemon. head( )
name type hp evolution pokedex 0 Bulbasaur grass 45 Ivysaur yes 1 Charmander fire 39 Charmeleon no 2 Squirtle water 44 Wartortle yes 3 Caterpie bug 45 Metapod no
8.4 添加列 place
pokemon[ 'place' ] = [ 'park' , 'street' , 'lake' , 'forest' ]
pokemon
name type hp evolution pokedex place 0 Bulbasaur grass 45 Ivysaur yes park 1 Charmander fire 39 Charmeleon no street 2 Squirtle water 44 Wartortle yes lake 3 Caterpie bug 45 Metapod no forest
8.5 查看每个列的数据类型
pokemon. dtypes
name object
type object
hp int64
evolution object
pokedex object
place object
dtype: object
9. 探索 Apple 公司股价数据
9.1 导入数据等
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
% matplotlib inline
apple = pd. read_csv( 'Apple_stock.csv' )
apple. head( )
Date Open High Low Close Volume Adj Close 0 2014-07-08 96.27 96.80 93.92 95.35 65130000 95.35 1 2014-07-07 94.14 95.99 94.10 95.97 56305400 95.97 2 2014-07-03 93.67 94.10 93.20 94.03 22891800 94.03 3 2014-07-02 93.87 94.06 93.09 93.48 28420900 93.48 4 2014-07-01 93.52 94.07 93.13 93.52 38170200 93.52
apple. info( )
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8465 entries, 1980-12-12 to 2014-07-08
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Open 8465 non-null float64
1 High 8465 non-null float64
2 Low 8465 non-null float64
3 Close 8465 non-null float64
4 Volume 8465 non-null int64
5 Adj Close 8465 non-null float64
dtypes: float64(5), int64(1)
memory usage: 721.0 KB
9.1 有重复的日期吗
apple[ 'Date' ] . duplicated( ) . sum ( )
0
apple[ 'Date' ] . is_unique
True
apple. index. is_unique
True
apple[ 'Date' ] = pd. to_datetime( apple[ 'Date' ] )
apple. set_index( 'Date' , inplace= True )
apple. head( )
Open High Low Close Volume Adj Close Date 2014-07-08 96.27 96.80 93.92 95.35 65130000 95.35 2014-07-07 94.14 95.99 94.10 95.97 56305400 95.97 2014-07-03 93.67 94.10 93.20 94.03 22891800 94.03 2014-07-02 93.87 94.06 93.09 93.48 28420900 93.48 2014-07-01 93.52 94.07 93.13 93.52 38170200 93.52
9.2 将 index 设置为升序
apple. sort_index( ascending= True , inplace= True )
apple. head( )
Open High Low Close Volume Adj Close Date 1980-12-12 28.75 28.87 28.75 28.75 117258400 0.45 1980-12-15 27.38 27.38 27.25 27.25 43971200 0.42 1980-12-16 25.37 25.37 25.25 25.25 26432000 0.39 1980-12-17 25.87 26.00 25.87 25.87 21610400 0.40 1980-12-18 26.63 26.75 26.63 26.63 18362400 0.41
9.3 找到每个月的最后一个交易日(business day)
apple_month = apple. resample( 'BM' ) . mean( )
apple_month. head( )
Open High Low Close Volume Adj Close Date 1980-12-31 30.481538 30.567692 30.443077 30.443077 2.586252e+07 0.473077 1981-01-30 31.754762 31.826667 31.654762 31.654762 7.249867e+06 0.493810 1981-02-27 26.480000 26.572105 26.407895 26.407895 4.231832e+06 0.411053 1981-03-31 24.937727 25.016818 24.836364 24.836364 7.962691e+06 0.387727 1981-04-30 27.286667 27.368095 27.227143 27.227143 6.392000e+06 0.423333
9.4 数据集中最早的日期和最晚的日期相差多少天?
( apple. index. max ( ) - apple. index. min ( ) ) . days
12261
9.5 数据集中一共有多少个月?
len ( apple. resample( 'BM' ) . mean( ) )
404
9.6 按照时间顺序可视化 Adj Close 值
plt. figure( figsize= ( 8 , 3 ) , dpi= 100 )
apple[ 'Adj Close' ] . plot( title= 'Apple Stock' )
<AxesSubplot:title={'center':'Apple Stock'}, xlabel='Date'>
10. 探索 Iris 纸鸢花数据
10.1 导入数据等
import pandas as pd
iris = pd. read_csv( 'iris.csv' )
iris. head( )
5.1 3.5 1.4 0.2 Iris-setosa 0 4.9 3.0 1.4 0.2 Iris-setosa 1 4.7 3.2 1.3 0.2 Iris-setosa 2 4.6 3.1 1.5 0.2 Iris-setosa 3 5.0 3.6 1.4 0.2 Iris-setosa 4 5.4 3.9 1.7 0.4 Iris-setosa
iris. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 5.1 149 non-null float64
1 3.5 149 non-null float64
2 1.4 149 non-null float64
3 0.2 149 non-null float64
4 Iris-setosa 149 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
10.2 添加列名
iris. columns = [ 'sepal_length' , 'sepal_width' , 'petal_length' , 'petal_width' , 'class' ]
iris. head( )
sepal_length sepal_width petal_length petal_width class 0 4.9 3.0 1.4 0.2 Iris-setosa 1 4.7 3.2 1.3 0.2 Iris-setosa 2 4.6 3.1 1.5 0.2 Iris-setosa 3 5.0 3.6 1.4 0.2 Iris-setosa 4 5.4 3.9 1.7 0.4 Iris-setosa
10.3 数据集有缺失值吗
iris. isna( ) . sum ( )
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
class 0
dtype: int64
10.4 将petal_lenth 的第10到19行设置为缺失值
python range() 函数是左闭右开,下面的stop=20,不包括20
for i in range ( 10 , 20 ) :
iris[ 'petal_length' ] . at[ i] = np. nan
iris. head( 21 )
sepal_length sepal_width petal_length petal_width class 0 4.9 3.0 1.4 0.2 Iris-setosa 1 4.7 3.2 1.3 0.2 Iris-setosa 2 4.6 3.1 1.5 0.2 Iris-setosa 3 5.0 3.6 1.4 0.2 Iris-setosa 4 5.4 3.9 1.7 0.4 Iris-setosa 5 4.6 3.4 1.4 0.3 Iris-setosa 6 5.0 3.4 1.5 0.2 Iris-setosa 7 4.4 2.9 1.4 0.2 Iris-setosa 8 4.9 3.1 1.5 0.1 Iris-setosa 9 5.4 3.7 1.5 0.2 Iris-setosa 10 4.8 3.4 NaN 0.2 Iris-setosa 11 4.8 3.0 NaN 0.1 Iris-setosa 12 4.3 3.0 NaN 0.1 Iris-setosa 13 5.8 4.0 NaN 0.2 Iris-setosa 14 5.7 4.4 NaN 0.4 Iris-setosa 15 5.4 3.9 NaN 0.4 Iris-setosa 16 5.1 3.5 NaN 0.3 Iris-setosa 17 5.7 3.8 NaN 0.3 Iris-setosa 18 5.1 3.8 NaN 0.3 Iris-setosa 19 5.4 3.4 NaN 0.2 Iris-setosa 20 5.1 3.7 1.5 0.4 Iris-setosa
iris. iloc[ 10 : 20 , 1 : 2 ] = np. nan
iris. head( 21 )
sepal_length sepal_width petal_length petal_width class 0 4.9 3.0 1.4 0.2 Iris-setosa 1 4.7 3.2 1.3 0.2 Iris-setosa 2 4.6 3.1 1.5 0.2 Iris-setosa 3 5.0 3.6 1.4 0.2 Iris-setosa 4 5.4 3.9 1.7 0.4 Iris-setosa 5 4.6 3.4 1.4 0.3 Iris-setosa 6 5.0 3.4 1.5 0.2 Iris-setosa 7 4.4 2.9 1.4 0.2 Iris-setosa 8 4.9 3.1 1.5 0.1 Iris-setosa 9 5.4 3.7 1.5 0.2 Iris-setosa 10 4.8 NaN 1.6 0.2 Iris-setosa 11 4.8 NaN 1.4 0.1 Iris-setosa 12 4.3 NaN 1.1 0.1 Iris-setosa 13 5.8 NaN 1.2 0.2 Iris-setosa 14 5.7 NaN 1.5 0.4 Iris-setosa 15 5.4 NaN 1.3 0.4 Iris-setosa 16 5.1 NaN 1.4 0.3 Iris-setosa 17 5.7 NaN 1.7 0.3 Iris-setosa 18 5.1 NaN 1.5 0.3 Iris-setosa 19 5.4 NaN 1.7 0.2 Iris-setosa 20 5.1 3.7 1.5 0.4 Iris-setosa
10.5 将缺失值全部替换为 1.0
iris. fillna( 1 , inplace= True )
iris. head( 21 )
sepal_length sepal_width petal_length petal_width class 0 4.9 3.0 1.4 0.2 Iris-setosa 1 4.7 3.2 1.3 0.2 Iris-setosa 2 4.6 3.1 1.5 0.2 Iris-setosa 3 5.0 3.6 1.4 0.2 Iris-setosa 4 5.4 3.9 1.7 0.4 Iris-setosa 5 4.6 3.4 1.4 0.3 Iris-setosa 6 5.0 3.4 1.5 0.2 Iris-setosa 7 4.4 2.9 1.4 0.2 Iris-setosa 8 4.9 3.1 1.5 0.1 Iris-setosa 9 5.4 3.7 1.5 0.2 Iris-setosa 10 4.8 3.4 1.0 0.2 Iris-setosa 11 4.8 3.0 1.0 0.1 Iris-setosa 12 4.3 3.0 1.0 0.1 Iris-setosa 13 5.8 4.0 1.0 0.2 Iris-setosa 14 5.7 4.4 1.0 0.4 Iris-setosa 15 5.4 3.9 1.0 0.4 Iris-setosa 16 5.1 3.5 1.0 0.3 Iris-setosa 17 5.7 3.8 1.0 0.3 Iris-setosa 18 5.1 3.8 1.0 0.3 Iris-setosa 19 5.4 3.4 1.0 0.2 Iris-setosa 20 5.1 3.7 1.5 0.4 Iris-setosa
10.6 删除列
del iris[ 'class' ]
iris. head( )
sepal_length sepal_width petal_length petal_width 0 4.9 3.0 1.4 0.2 1 4.7 3.2 1.3 0.2 2 4.6 3.1 1.5 0.2 3 5.0 3.6 1.4 0.2 4 5.4 3.9 1.7 0.4
10.7 将数据的前三行设置为缺失值
iris. iloc[ 0 : 3 , : ] = np. nan
iris. head( )
sepal_length sepal_width petal_length petal_width 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 5.0 3.6 1.4 0.2 4 5.4 3.9 1.7 0.4
10.8 删除有缺失值的行
iris. drop_duplicates( inplace= True )
iris. head( )
sepal_length sepal_width petal_length petal_width 0 NaN NaN NaN NaN 3 5.0 3.6 1.4 0.2 4 5.4 3.9 1.7 0.4 5 4.6 3.4 1.4 0.3 6 5.0 3.4 1.5 0.2
10.9 重新设置索引
iris. reset_index( inplace= True , drop= True )
iris. head( )
sepal_length sepal_width petal_length petal_width 0 NaN NaN NaN NaN 1 5.0 3.6 1.4 0.2 2 5.4 3.9 1.7 0.4 3 4.6 3.4 1.4 0.3 4 5.0 3.4 1.5 0.2