透视表
引入相关库
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
数据获取
df=pd.read_excel('../homework/sales-funnel.xlsx')
df
Acount代表客户账户,Name代表客户名字,Rep代表销售代表名字,Manager代表销售代表老板名字,Product代表卖出的产品,Quantity代表产品质量,Price代表产品价格,Status代表状态
Account | Name | Rep | Manager | Product | Quantity | Price | Status | |
---|---|---|---|---|---|---|---|---|
0 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | CPU | 1 | 30000 | presented |
1 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Software | 1 | 10000 | presented |
2 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Maintenance | 2 | 5000 | pending |
3 | 737550 | Fritsch, Russel and Anderson | Craig Booker | Debra Henley | CPU | 1 | 35000 | declined |
4 | 146832 | Kiehn-Spinka | Daniel Hilton | Debra Henley | CPU | 2 | 65000 | won |
5 | 218895 | Kulas Inc | Daniel Hilton | Debra Henley | CPU | 2 | 40000 | pending |
6 | 218895 | Kulas Inc | Daniel Hilton | Debra Henley | Software | 1 | 10000 | presented |
7 | 412290 | Jerde-Hilpert | John Smith | Debra Henley | Maintenance | 2 | 5000 | pending |
8 | 740150 | Barton LLC | John Smith | Debra Henley | CPU | 1 | 35000 | declined |
9 | 141962 | Herman LLC | Cedric Moss | Fred Anderson | CPU | 2 | 65000 | won |
10 | 163416 | Purdy-Kunde | Cedric Moss | Fred Anderson | CPU | 1 | 30000 | presented |
11 | 239344 | Stokes LLC | Cedric Moss | Fred Anderson | Maintenance | 1 | 5000 | pending |
12 | 239344 | Stokes LLC | Cedric Moss | Fred Anderson | Software | 1 | 10000 | presented |
13 | 307599 | Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | Maintenance | 3 | 7000 | won |
14 | 688981 | Keeling LLC | Wendy Yule | Fred Anderson | CPU | 5 | 100000 | won |
15 | 729833 | Koepp Ltd | Wendy Yule | Fred Anderson | CPU | 2 | 65000 | declined |
16 | 729833 | Koepp Ltd | Wendy Yule | Fred Anderson | Monitor | 2 | 5000 | presented |
生成透视表
使用pivot_table方法生成透视表,这个函数第一个参数表示我们要对什么数据进行一个透视,index表示要对哪一个columns做一个透视
下面对‘name’这一column做透视表,生成了新的DataFrame,index使用原始表里面‘name’,且经过了去重处理,把同一‘name’的行使用取均值的方法(默认聚合方法)进行了合并(例如Quantiy:(1+1+2/3)=1.3)
pd.pivot_table(df,index=['Name'])
Account | Price | Quantity | |
---|---|---|---|
Name | |||
Barton LLC | 740150 | 35000 | 1.000000 |
Fritsch, Russel and Anderson | 737550 | 35000 | 1.000000 |
Herman LLC | 141962 | 65000 | 2.000000 |
Jerde-Hilpert | 412290 | 5000 | 2.000000 |
Kassulke, Ondricka and Metz | 307599 | 7000 | 3.000000 |
Keeling LLC | 688981 | 100000 | 5.000000 |
Kiehn-Spinka | 146832 | 65000 | 2.000000 |
Koepp Ltd | 729833 | 35000 | 2.000000 |
Kulas Inc | 218895 | 25000 | 1.500000 |
Purdy-Kunde | 163416 | 30000 | 1.000000 |
Stokes LLC | 239344 | 7500 | 1.000000 |
Trantow-Barrows | 714466 | 15000 | 1.333333 |
把聚合方法改为sum,Quntity:1+1+2=4
pd.pivot_table(df,index=['Name'],aggfunc='sum')
Account | Price | Quantity | |
---|---|---|---|
Name | |||
Barton LLC | 740150 | 35000 | 1 |
Fritsch, Russel and Anderson | 737550 | 35000 | 1 |
Herman LLC | 141962 | 65000 | 2 |
Jerde-Hilpert | 412290 | 5000 | 2 |
Kassulke, Ondricka and Metz | 307599 | 7000 | 3 |
Keeling LLC | 688981 | 100000 | 5 |
Kiehn-Spinka | 146832 | 65000 | 2 |
Koepp Ltd | 1459666 | 70000 | 4 |
Kulas Inc | 437790 | 50000 | 3 |
Purdy-Kunde | 163416 | 30000 | 1 |
Stokes LLC | 478688 | 15000 | 2 |
Trantow-Barrows | 2143398 | 45000 | 4 |
添加两个另外的index,‘Rep’和‘Manager’,新的透视表的index变成了三个
pd.pivot_table(df,index=['Name','Rep','Manager'])
Account | Price | Quantity | |||
---|---|---|---|---|---|
Name | Rep | Manager | |||
Barton LLC | John Smith | Debra Henley | 740150 | 35000 | 1.000000 |
Fritsch, Russel and Anderson | Craig Booker | Debra Henley | 737550 | 35000 | 1.000000 |
Herman LLC | Cedric Moss | Fred Anderson | 141962 | 65000 | 2.000000 |
Jerde-Hilpert | John Smith | Debra Henley | 412290 | 5000 | 2.000000 |
Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | 307599 | 7000 | 3.000000 |
Keeling LLC | Wendy Yule | Fred Anderson | 688981 | 100000 | 5.000000 |
Kiehn-Spinka | Daniel Hilton | Debra Henley | 146832 | 65000 | 2.000000 |
Koepp Ltd | Wendy Yule | Fred Anderson | 729833 | 35000 | 2.000000 |
Kulas Inc | Daniel Hilton | Debra Henley | 218895 | 25000 | 1.500000 |
Purdy-Kunde | Cedric Moss | Fred Anderson | 163416 | 30000 | 1.000000 |
Stokes LLC | Cedric Moss | Fred Anderson | 239344 | 7500 | 1.000000 |
Trantow-Barrows | Craig Booker | Debra Henley | 714466 | 15000 | 1.333333 |
把‘name’去掉,更换‘Rep’和‘Manager’顺序,可以看到一个Manager下面有多个Rep
pd.pivot_table(df,index=['Manager','Rep'])
Account | Price | Quantity | ||
---|---|---|---|---|
Manager | Rep | |||
Debra Henley | Craig Booker | 720237.0 | 20000.000000 | 1.250000 |
Daniel Hilton | 194874.0 | 38333.333333 | 1.666667 | |
John Smith | 576220.0 | 20000.000000 | 1.500000 | |
Fred Anderson | Cedric Moss | 196016.5 | 27500.000000 | 1.250000 |
Wendy Yule | 614061.5 | 44250.000000 | 3.000000 |
通过values参数来从‘Account Price Quantity‘指定哪个values,例下面只关心Price
pd.pivot_table(df,index=['Manager','Rep'],values=['Price'])
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 20000.000000 |
Daniel Hilton | 38333.333333 | |
John Smith | 20000.000000 | |
Fred Anderson | Cedric Moss | 27500.000000 |
Wendy Yule | 44250.000000 |
改变聚合方法变成求和
pd.pivot_table(df,index=['Manager','Rep'],values=['Price'],aggfunc='sum')
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 80000 |
Daniel Hilton | 115000 | |
John Smith | 40000 | |
Fred Anderson | Cedric Moss | 110000 |
Wendy Yule | 177000 |
还可以指定多个values
pd.pivot_table(df,index=['Manager','Rep'],values=['Price','Quantity'],aggfunc='sum')
Price | Quantity | ||
---|---|---|---|
Manager | Rep | ||
Debra Henley | Craig Booker | 80000 | 5 |
Daniel Hilton | 115000 | 5 | |
John Smith | 40000 | 3 | |
Fred Anderson | Cedric Moss | 110000 | 5 |
Wendy Yule | 177000 | 12 |
columns方法对于每一个values还可以看不同的columns,例如对于Price和Quantity都可以看product的’CPU Maintenance Monitor Software‘等参数
pd.pivot_table(df,index=['Manager','Rep'],values=['Price','Quantity'],columns=['Product'],aggfunc='sum')
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||||||
Debra Henley | Craig Booker | 65000.0 | 5000.0 | NaN | 10000.0 | 2.0 | 2.0 | NaN | 1.0 |
Daniel Hilton | 105000.0 | NaN | NaN | 10000.0 | 4.0 | NaN | NaN | 1.0 | |
John Smith | 35000.0 | 5000.0 | NaN | NaN | 1.0 | 2.0 | NaN | NaN | |
Fred Anderson | Cedric Moss | 95000.0 | 5000.0 | NaN | 10000.0 | 3.0 | 1.0 | NaN | 1.0 |
Wendy Yule | 165000.0 | 7000.0 | 5000.0 | NaN | 7.0 | 3.0 | 2.0 | NaN |
fill_value可以同于NaN数值的填充
pd.pivot_table(df,index=['Manager','Rep'],values=['Price','Quantity'],columns=['Product'],fill_value=0,aggfunc='sum')
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 | 2 | 2 | 0 | 1 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | 4 | 0 | 0 | 1 | |
John Smith | 35000 | 5000 | 0 | 0 | 1 | 2 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 | 3 | 1 | 0 | 1 |
Wendy Yule | 165000 | 7000 | 5000 | 0 | 7 | 3 | 2 | 0 |