目录
目录
1.生成数据表
读取csv文件
pd.read_csv('path/filename.csv',sep=',',index_col=None,header=1,encoding='utf-8',paser_datas=['data'],na_value=['N/A'],dtype={'column1':np.str})
#sep:分隔符
#index_col:指明哪一列作为行索引
#header:指明某一行作为列索引,默认是第0行
#paser_datas:如果数据集里面,有些列时间,通过paser_datas转化为时间格式
#dtype:指明某一列的类型
pd.read_csv('data/bikes.csv', sep=';', parse_dates=['Date'],encoding='latin1',header=0, dayfirst=True, index_col='Date')
(文件名,sep分隔符,parse_date日期格式,encoding编码格式,header列索引取值行,dayfirst不懂,index_col行索引取值)
默认编码格式为‘utf-8’,encoding需要改为'latin1'
csv的默认分割符为';',seq需要传参为';',则会返回一个DataFrame格式数据
In [125]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').head(5)
Out[125]:
Date ... St-Urbain (données non disponibles)
0 01/01/2012 ... NaN
1 02/01/2012 ... NaN
2 03/01/2012 ... NaN
3 04/01/2012 ... NaN
4 05/01/2012 ... NaN
[5 rows x 10 columns]
构建DataFrame
1.2 通过字典生成DataFrame对象
pd.DataFrame(dict,index=[...],columns=[...])#dict是一个字典,key-value中的value是一维数组或者列表
1.3 通过列表或者数组构建DataFrame对象
pd.DataFrame(like-array,index=[....],columns=[...])
1.4 通过字典/list/一维数组构建Series对象
pd.Series(data,index=[....])#data可以是字典、list或者一维数组
2.数据信息查看和索引
查看DataFrame的基本信息
df.header()方法,查看头几行,默认头五行
df.tail()方法,查看后几行,默认后五行
df.index,属性,查看DataFrame行索引
df.columns,属性,查看DataFrme列索引
df.values,属性,查看元素,返回array数组
df.shape,属性,查看DataFrame形状
df.dtypes,属性,查看DataFrame每列的数据类型
df.info(),方法,查看基本信息(列的类型、每列的数量、df占得内存等)
In [125]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').head(5)
Out[125]:
Date ... St-Urbain (données non disponibles)
0 01/01/2012 ... NaN
1 02/01/2012 ... NaN
2 03/01/2012 ... NaN
3 04/01/2012 ... NaN
4 05/01/2012 ... NaN
[5 rows x 10 columns]
In [126]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').tail(5)
Out[126]:
Date ... St-Urbain (données non disponibles)
305 01/11/2012 ... NaN
306 02/11/2012 ... NaN
307 03/11/2012 ... NaN
308 04/11/2012 ... NaN
309 05/11/2012 ... NaN
[5 rows x 10 columns]
In [127]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').index
Out[127]: RangeIndex(start=0, stop=310, step=1)
In [70]: df=pd.read_csv('bikes.csv',encoding='latin1',sep=';').head(5)
In [71]: df
Out[71]:
Date ... St-Urbain (données non disponibles)
0 01/01/2012 ... NaN
1 02/01/2012 ... NaN
2 03/01/2012 ... NaN
3 04/01/2012 ... NaN
4 05/01/2012 ... NaN
[5 rows x 10 columns]
In [72]: df.shape
Out[72]: (5, 10)
In [73]: df.values
Out[73]:
array([['01/01/2012', 35, nan, 0, 38, 51, 26, 10, 16, nan],
['02/01/2012', 83, nan, 1, 68, 153, 53, 6, 43, nan],
['03/01/2012', 135, nan, 2, 104, 248, 89, 3, 58, nan],
['04/01/2012', 144, nan, 1, 116, 318, 111, 8, 61, nan],
['05/01/2012', 197, nan, 2, 124, 330, 97, 13, 95, nan]],
dtype=object)
In [74]: df.columns
Out[74]:
Index(['Date', 'Berri 1', 'Brébeuf (données non disponibles)',
'Côte-Sainte-Catherine', 'Maisonneuve 1', 'Maisonneuve 2', 'du Parc',
'Pierre-Dupuy', 'Rachel1', 'St-Urbain (données non disponibles)'],
dtype='object')
In [75]: df.index
Out[75]: RangeIndex(start=0, stop=5, step=1)
In [76]: df.dtypes
Out[76]:
Date object
Berri 1 int64
Brébeuf (données non disponibles) float64
Côte-Sainte-Catherine int64
Maisonneuve 1 int64
Maisonneuve 2 int64
du Parc int64
Pierre-Dupuy int64
Rachel1 int64
St-Urbain (données non disponibles) float64
dtype: object
In [77]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Date 5 non-null object
Berri 1 5 non-null int64
Brébeuf (données non disponibles) 0 non-null float64
Côte-Sainte-Catherine 5 non-null int64
Maisonneuve 1 5 non-null int64
Maisonneuve 2 5 non-null int64
du Parc 5 non-null int64
Pierre-Dupuy 5 non-null int64
Rachel1 5 non-null int64
St-Urbain (données non disponibles) 0 non-null float64
dtypes: float64(2), int64(7), object(1)
memory usage: 480.0+ bytes
索引
索引某一列或几列
df['columns1']#查看‘columns1’这一列的性信息
df[['col1','col2]]#索引多列
In [12]: df
Out[12]:
o p q r s
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
In [13]: df['o']
Out[13]:
a 0
b 5
c 10
d 15
Name: o, dtype: int64
In [14]: df[['o','p']]
Out[14]:
o p
a 0 1
b 5 6
c 10 11
d 15 16
获取行或多行
iloc或loc[行,列]
df.loc[['index1','index1'],['col1','col2']]#针对索引名进行取数
df.iloc[[0,3,5],[2,4,1]]#针对索引号进行取数
In [16]: df.loc['a']
Out[16]:
o 0
p 1
q 2
r 3
s 4
Name: a, dtype: int64
In [17]: df.loc['a','o']
Out[17]: 0
In [18]: df.loc[['a','b'],['o','p']]
Out[18]:
o p
a 0 1
b 5 6
In [19]: df.iloc[1]
Out[19]:
o 5
p 6
q 7
r 8
s 9
Name: b, dtype: int64
In [20]: df.iloc[[1],[1]]
Out[20]:
p
b 6
3.3 布尔型索引
df[boo_arr]#索引的是行信息,bool_arr是一个跟行数量一致的bool型数组或Series对象
df.loc[bool_arr,['col1']]#布尔型索引针对行
In [27]: df
Out[27]:
o p q r s
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
In [28]: df[[True,False,True,False]]#需要和行数相同的布尔型序列
Out[28]:
o p q r s
a 0 1 2 3 4
c 10 11 12 13 14
In [34]: df.loc[[True,False,True,False],['o','q']]#需要配合loc使用
Out[34]:
o q
a 0 2
c 10 12
4. value_counts(),只能对列对象使用,即针对Series对象的频数统计,即
df['col1].value_counts()#频数统计,按降序返回对DataFrame不生效
In [39]: df['o'].value_counts()
Out[39]:
15 1
5 1
10 1
0 1
Name: o, dtype: int64
4.1 isna()和notna(),列使用针对缺失值的判断,列对象,DataFrame对象均可
df['col1'].isna()#判断‘col1’这一列对应位置缺失值情况,如果是缺失值,返回True,否则返回False
df['col1'].notna()#判断‘col1’这一列对应位置缺失值情况,如果是缺失值,返回False,否则返回True
In [40]: df.isna()
Out[40]:
o p q r s
a False False False False False
b False False False False False
c False False False False False
d False False False False False
In [42]: df['o'].notna()
Out[42]:
a True
b True
c True
d True
Name: o, dtype: bool
df.unique(),唯一值
3.数据可视化
可视化:matplotlib.pyplot绘图
import matplotlib.pyplot as plt
plt.plot(x,y,style)#线状图,style包括颜色和线型
plt.bar(x,y,color=['r','g'],alpha=1)#柱状图,alpha是透明度
plt.hist(x,bins=30,density=True)#直方图,bins指的是说准备x分成对少个区间,density指的是频率
plt.pie(x,labels,autopct='%.2f%%')#饼状图,labels指的是文字标签,autopct对应的是数字标签
#画子图
fig = plt.figure(figsize=(10,8))#figsize用来指明图像的大小
ax=fig.subplots(2,2)#说明子图数量,子图的排列
ax[0][0].plot()#在第一个子图上画线状图
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(8,8))#声明画板大小
((ax1, ax2), (ax3, ax4)) = fig.subplots(2, 2)#分成2×2块画板
ax1.hist(np.random.rand(100),bins=20,color='r',alpha=1)#条形图
ax2.scatter(np.arange(30),np.arange(30))#散点图
ax3.plot(np.random.randn(50).cumsum(),'k--')#线状图
labels = ['Frogs', 'Hogs', 'Dogs', 'Logs']
fracs = [15, 30, 45, 10]
explode = [0.00, 0.00, 0.05, 0]
ax4.pie(fracs, explode=explode, labels=labels, autopct='%.0f%%')#饼状图
plt.savefig('图片/MathPlot12.png',dpi=400,bbox_inches='tight')#保存图片
plt.show()#在终端编码,需要使用此方法显式图片
生成图片如下:
plt.title('title',fontsize=20)#标题
plt.xlabel('label',fontsize=15)#x轴的标签
plt.xlim(0,4)#将x轴的显示范围,显示[0,4]的范围
plt.xticks(x,label,rotation=0)#用来修改刻度,或者刻度的名称
plt.text(x,y,'text',fontsize=10)#在显示的图片里在(x,y)点插入文件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
fig1 = plt.figure(figsize=(10,5))#新建figure,设置图片的大小
t=np.arange(0.0,4.0,0.01)
s=np.sin(2*np.pi*t)
plt.plot(t,s)
plt.title('learn',fontsize=25)
plt.text(1,0,r'sin(2*pi*x)',fontsize=15)#文本坐标,文本内容,文本字体
plt.text(0.4, 0.6, r'$\mathcal{A}\mathrm{sin}(2 \omega t)$',fontsize=15)#设置文本
plt.xlabel('value(x)',fontsize=20)#x轴标签
plt.ylabel('value(y)',fontsize=20)#y轴标签
plt.grid(True)#添加网格
plt.xlim(0,4)#x轴刻度范围
plt.ylim(-1,1)#y轴刻度范围
plt.savefig('learn',dpi=400,bbox_inches='tight')#保存图片,默认png格式
生成图片如下:
可视化:pandas绘图
df = pd.DataFrame(np.random.randn(10,4).cumsum(0),columns=list('ABCD'),index = np.arange(0,100,10))
print(df)
df.plot(grid=True,style=['r-','g-.','b--','k:'],figsize=(8,5))#绘制线状图
生成图片如下:线状图
fig2=plt.figure(figsize=(10,8))
ax= fig2.subplots(2, 1)
data = pd.Series(np.random.rand(16),index = list('abcdefghijklmnop'))
data.plot(kind='bar',ax=ax[0],color='r',alpha=0.7,title='柱状图')#垂直
data.plot(kind='barh',ax=ax[1],color='b',alpha=0.7)#水平
生成图片如下:柱形图
fig4 = plt.figure(figsize=(10,8))
ax=fig4.subplots(1,2)
df = DataFrame(np.random.rand(6,4),index=['one','two','three','four','five','six'],columns = list('ABCD'))
print(df)
df.plot(kind='barh',ax=ax[0])
df.plot(kind='bar',ax=ax[1],stacked=True)#绘制堆积柱状图
ax[1].set_xticklabels(labels=list('abcdef'),rotation=30,fontsize=20)#设置x轴的刻度,改变旋转度,字体尺寸
生成图片如下:堆积柱形图
comp1 = np.random.normal(0,1,size=200)
comp2 = np.random.normal(10,2,size=200)
values = pd.Series(np.concatenate([comp1,comp2]))
values.hist(bins=200,color='g',density=True)
#values.plot(kind='kde',style='b--')
plt.grid(True)
生成图片如下:直方图和密度图
制图实例
import matplotlib.pyplot as plt #载入matplotlib.pyplot模块
import numpy as np
import pandas as pd
fig = plt.figure(figsize=(8,5))#画板
complaints = pd.read_csv('311-service-requests.csv')#读取csv数据
complaints['Complaint Type'].value_counts().head().plot(kind='bar')#统计制图
data=complaints['Complaint Type'].value_counts().head()声明数据
print(data)
plt.xticks(range(len(data.index)),['HEATING','CONSTRUCTION','LIGHT CONDITION','PLUMBING','DOF REQUEST'],rotation=30,fontsize=10)#修改刻度和刻度名称
x=range(len(data.index))
y=list(data.values)
print(y)
#添加文本数字信息
for a,b in zip(x,y):
plt.text(a, b+100,str(b), ha='center', va= 'bottom',fontsize=12)
plt.title('Complain Type Top5',fontsize=20)
plt.ylim(0,15000)
plt.savefig('图片/Image.png',dpi=300,bbox_inches='tight')
plt.show()
生成图片如下:柱形图
4.数据清洗
缺失值处理
dropna,删除缺失值的行,返回一个新数据
In [6]: df
Out[6]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
two 1002 2013-01-03 NaN 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [7]: df.dropna()
Out[7]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [8]: df
Out[8]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
two 1002 2013-01-03 NaN 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
df.dropna(how='all')全为缺失值则删除,{‘any’, ‘all’}, default ‘any’
In [9]: df.dropna(how='all')
Out[9]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
two 1002 2013-01-03 NaN 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
df=df.dropna()和df.dropna(how='all',inplace=True)修改源数据,bool, default False
In [10]: df.dropna(inplace=False)
Out[10]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [11]: df
Out[11]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
two 1002 2013-01-03 NaN 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [12]: df.dropna(inplace=True)
In [13]: df
Out[13]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 5433.0
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
DataFrame.
fillna
(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
df.fillna(300),返回的是新数据,不改变源数据
In [17]: df
Out[17]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 NaN
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [18]: df.fillna(110)
Out[18]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 110.0
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
df['columns'].fillna(300)填充单列
df.fillna({'price':300})
In [19]: df
Out[19]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 NaN
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [20]: df['price'].fillna(110)
Out[20]:
one 1200.0
three 2133.0
four 110.0
six 4432.0
Name: price, dtype: float64
In [22]: df.fillna({'price':110})
Out[22]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 1200.0
three 1003 2013-01-04 guangzhou 110-A 54 2133.0
four 1004 2013-01-05 Shenzhen 110-C 32 110.0
six 1006 2013-01-07 BEIJING 130-F 32 4432.0
method参数,{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
backfill,bfill,后项填充
ffill,前项填充
df.fillna(method='ffill',limit=1)只填充一个
In [30]: df
Out[30]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 NaN
two 1002 2013-01-03 SH 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 NaN
four 1004 2013-01-05 Shenzhen 110-C 32 NaN
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 2233.0
In [31]: df.fillna(method='bfill',limit=2)
Out[31]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 NaN
two 1002 2013-01-03 SH 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 NaN
four 1004 2013-01-05 Shenzhen 110-C 32 2233.0
five 1005 2013-01-06 shanghai 210-A 34 2233.0
six 1006 2013-01-07 BEIJING 130-F 32 2233.0
df['columns'].replace({""""})
In [35]: df
Out[35]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 NaN
two 1002 2013-01-03 SH 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 NaN
four 1004 2013-01-05 Shenzhen 110-C 32 NaN
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 2233.0
In [36]: df['price'].replace(np.nan,222)
Out[36]:
one 222.0
two 222.0
three 222.0
four 222.0
five 222.0
six 2233.0
Name: price, dtype: float64
字符串处理
针对的series类型,即dataframe的列('DataFrame' object has no attribute 'str'
),返回一个新数据
ser.str.upper()
ser.str.lower()
In [38]: df['city'].str.upper()
Out[38]:
one BEIJING
two SH
three GUANGZHOU
four SHENZHEN
five SHANGHAI
six BEIJING
Name: city, dtype: object
In [39]: df
Out[39]:
id date city category age price
one 1001 2013-01-02 Beijing 100-A 23 NaN
two 1002 2013-01-03 SH 100-B 44 NaN
three 1003 2013-01-04 guangzhou 110-A 54 NaN
four 1004 2013-01-05 Shenzhen 110-C 32 NaN
five 1005 2013-01-06 shanghai 210-A 34 NaN
six 1006 2013-01-07 BEIJING 130-F 32 2233.0
ser.str.len()每个字符串的长度
len(ser)series的长度
In [43]: df['city'].str.len()
Out[43]:
one 8
two 2
three 11
four 8
five 8
six 8
Name: city, dtype: int64
In [44]: len(df['city'])
Out[44]: 6
ser.str.strip()去除两边的空白夫
df['price'].replace(1200,100009)
ser.str.lower().str.replace(100.200)
ser.str.split('_')拆分字符串,返回新数据
ser.str.join('_')合并字符串,返回新数据--split的逆函数
In [51]: ser=pd.Series(['a-b-c','c-d-e'])
In [52]: ser.str.split('-')
Out[52]:
0 [a, b, c]
1 [c, d, e]
dtype: object
In [53]: ser
Out[53]:
0 a-b-c
1 c-d-e
dtype: object
In [54]: ser.str.split('-').str.join('+++')
Out[54]:
0 a+++b+++c
1 c+++d+++e
dtype: object
ser.str.contains('a',na=True)是否包含'a',nan值返回True
数据转换:格式转换和异常值处理
重复项
data.duplicated()判断每行是否有重复项,整一行,而不是单个元素
data.drop_duplicates(keep='last')删除重复项
重命名索引
data.index=data.index.str.upper()重命名索引,原函数返回的是新数据,不是视图,所以需要赋值
data.rename(index=str.lower,columns=str.upper)重命名索引
离散化
连续数据需要离散化,或拆分为面元,cut和qcut,左开右闭
pd.cut(data,bins=4).value_counts()等距划分
pd.cut(data,bins=[20,40,50]).value_counts()固定划分
pd.cut(data,bins=4,right=False)默认左开右闭,改为左闭右开
qcut根据分位数划分面元,等量划分
pd.qcut(data,q=5)
pd.qcut(data,q=[0,0.05,0.2,0.5,0.75,1]),默认会排序,传入的列表不等于1也可以
labels
no_clue
data[~bool]取反
iloc不能用布尔型,loc可以
字符串的操作前都得加入.str
连续,填充(平均值),插值(模型)
离散,增加类别
变量,一列的数据
异常值
契比雪夫定理:大部分数据在。。。范围内
标准化
去中心化
箱型图
iloc不能用布尔型,loc可以
字符串的操作前都得加入.str
5.数据规整
丢弃数据
datafrme.drop()丢弃指定轴上的数据,返回一个新数据,不改变源数据
In [12]: data
Out[12]:
one two three four
Shenzhen 0 1 2 3
Guangzhou 4 5 6 7
Beijing 8 9 10 11
Shanghai 12 13 14 15
In [13]: data.drop(['Shenzhen','Guangzhou'])
Out[13]:
one two three four
Beijing 8 9 10 11
Shanghai 12 13 14 15
In [14]: data
Out[14]:
one two three four
Shenzhen 0 1 2 3
Guangzhou 4 5 6 7
Beijing 8 9 10 11
Shanghai 12 13 14 15
In [15]: data.drop(['one'],axis=1)
Out[15]:
two three four
Shenzhen 1 2 3
Guangzhou 5 6 7
Beijing 9 10 11
Shanghai 13 14 15
函数应用和映射
dataframe.apply(afunc)逐列扫描进行处理
dataframe.applymap('afunc'),逐元素处理,将dataframe的元素传如afunc函数内,再生成相同形状新的dataframe
dataframe.map(func)针对series对象,竹元素处理,apply也能在series对象使用,功能和map相同
排序和排名
按行或列索引进行排序,可以使用sort_index(),它将返回一个已经排的新对象
根据指定的索引的对值进行排序,可以使用sort_values()
sort_index()基于索引排序
sort_values()基于值排序,需要指定基于哪一列排序
sort_values([col1,col2])以第一列为主排序,第一列有重复的则以第二列排序
数据合并
merge()合并数据集
inner内联,相同
outer外联,全部出现
left左联,
right右联
concat()拼接
列名相同的拼接,列名不同,则另起
combine_first()合并重贴数据,打补丁
数据重塑
层次化索引为DataFrame数据的重排任务提供了一种具有良好一致性的方式。主要有两种功能:
- stack():将数据的列‘旋转’为行;
- unstack():将数据的行‘旋转为列’;
6.数据计算和统计
算术运算
加,add
df.add(other,lever,axis,fill_value)
相加时,没有重叠的位置就会产生NA值。对于没有重叠的部分,我们想获取某一个DataFrame的值,那么我们可以指定一个填充值:
In [16]: df1 = pd.DataFrame(np.arange(1,13).reshape(3,4),columns=list('abcd'))
...: df2 = pd.DataFrame(np.arange(1,21).reshape(4,5),columns=list('abcde'))
In [17]: df1.add(df2)
Out[17]:
a b c d e
0 2.0 4.0 6.0 8.0 NaN
1 11.0 13.0 15.0 17.0 NaN
2 20.0 22.0 24.0 26.0 NaN
3 NaN NaN NaN NaN NaN
In [19]: df1.add(df2,fill_value=0)
Out[19]:
a b c d e
0 2.0 4.0 6.0 8.0 5.0
1 11.0 13.0 15.0 17.0 10.0
2 20.0 22.0 24.0 26.0 15.0
3 16.0 17.0 18.0 19.0 20.0
减,subtract
df.sub(other,lever,axis,fill_value)
乘,multiply
df.mul(other,lever,axis,fill_value)
除,divide
df.div(other,lever,axis,fill_value)
直接使用+-*/会丢失数据,无法填充na
数值型:1234....等
离散型:1,2c,f,s..等
计算统计描述
约简:求和,平均值,中位数,最大、最小值等
In [21]: df
Out[21]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [22]: df.sum()
Out[22]:
one 9.25
two -5.80
dtype: float64
In [23]: df.sum(axis=1)
Out[23]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
In [24]: df.sum(axis=1,skipna=False)
Out[24]:
a NaN
b 2.60
c NaN
d -0.55
dtype: float64
相关系数和协方差
面向Series对象,DataFrame对象无效
相关性:data1.corr(data2)两组数据是否有联系
方差:data1.var(),一组数据减去平均值的平方之和
协方差:data1.cov(data2)一组数据减去另一组数据的平方之和
标准差:data1.std()
In [33]: df1=pd.DataFrame(np.arange(12).reshape((3,4)))
In [34]: df2=pd.DataFrame(np.arange(12).reshape((3,4)))
#相关系数[-1,1]
In [36]: df1[1].corr(df2[1])
Out[36]: 1.0
In [37]: df1[1].corr(df2[2])
Out[37]: 1.0
In [38]: df1[1].corr(df2[3])
Out[38]: 1.0
#协方差
In [51]: df1[1].cov(df2[1])
Out[51]: 16.0
#var方差
In [46]: df2.var()
Out[46]:
0 16.0
1 16.0
2 16.0
3 16.0
dtype: float64
In [47]: df2[1].var()
Out[47]: 16.0
#标准差
In [52]: df1.std()
Out[52]:
0 4.0
1 4.0
2 4.0
3 4.0
dtype: float64
相关系数=1,完全正相关,相关系数=-1,完全负相关,即完全线性
方差:自己对自己,x,x
协方差:两组数据,x,y
线性相关有什么用:身高和体重的线性相关系数,
In [25]: df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
...: "date":pd.date_range('20130102', periods=6),
...: "city":['Beijing ', 'ShangHai', ' guangzhou ', 'Shenzhen', 'shanghai'
...: , 'BEIJING '],
...: "age":[23,44,54,32,34,32],
...: "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
...: "price":[1200,np.nan,2133,5433,np.nan,4432]},
...: columns =['id','date','city','category','age','price'])
...:
In [26]: df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],
...: "gender":['male','female','male','female','male','female','male','femal
...: e'],
...: "pay":['Y','N','Y','Y','N','Y','N','Y',],
...: "m-point":[10,12,20,40,40,40,30,20]})
In [27]:
In [27]: df
Out[27]:
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 ShangHai 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
In [28]: df1
Out[28]:
id gender pay m-point
0 1001 male Y 10
1 1002 female N 12
2 1003 male Y 20
3 1004 female Y 40
4 1005 male N 40
5 1006 female Y 40
6 1007 male N 30
7 1008 female Y 20
In [29]: df_inner=pd.merge(df,df1,how='inner')
In [30]: df_inner
Out[30]:
id date city category ... price gender pay m-point
0 1001 2013-01-02 Beijing 100-A ... 1200.0 male Y 10
1 1002 2013-01-03 ShangHai 100-B ... NaN female N 12
2 1003 2013-01-04 guangzhou 110-A ... 2133.0 male Y 20
3 1004 2013-01-05 Shenzhen 110-C ... 5433.0 female Y 40
4 1005 2013-01-06 shanghai 210-A ... NaN male N 40
5 1006 2013-01-07 BEIJING 130-F ... 4432.0 female Y 40
[6 rows x 9 columns]
In [32]: df_inner['price'].cov(df_inner['m-point'])
Out[32]: 28771.666666666664
7.数据汇总
数据聚合
df.grouby(['column1',columns2])[columns].ufunc()
In [53]: df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
...: "date":pd.date_range('20130102', periods=6),
...: "city":['Beijing ', 'ShangHai', ' guangzhou ', 'Shenzhen', 'shanghai'
...: , 'BEIJING '],
...: "age":[23,44,54,32,34,32],
...: "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
...: "price":[1200,np.nan,2133,5433,np.nan,4432]},
...: columns =['id','date','city','category','age','price'])
...:
In [54]: df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],
...: "gender":['male','female','male','female','male','female','male','femal
...: e'],
...: "pay":['Y','N','Y','Y','N','Y','N','Y',],
...: "m-point":[10,12,20,40,40,40,30,20]})
In [55]: df_inner=pd.merge(df,df1,how='inner')
#按单个列对所有其他列进行聚合
In [56]: df_inner['city']=df_inner['city'].str.title()#首字母大写
In [63]: df_inner
Out[63]:
id date city category ... price gender pay m-point
0 1001 2013-01-02 Beijing 100-A ... 1200.0 male Y 10
1 1002 2013-01-03 Shanghai 100-B ... NaN female N 12
2 1003 2013-01-04 Guangzhou 110-A ... 2133.0 male Y 20
3 1004 2013-01-05 Shenzhen 110-C ... 5433.0 female Y 40
4 1005 2013-01-06 Shanghai 210-A ... NaN male N 40
5 1006 2013-01-07 Beijing 130-F ... 4432.0 female Y 40
[6 rows x 9 columns]
In [64]: df_inner.groupby('city').count()
Out[64]:
id date category age price gender pay m-point
city
Guangzhou 1 1 1 1 1 1 1 1
Beijing 2 2 2 2 2 2 2 2
Shanghai 2 2 2 2 0 2 2 2
Shenzhen 1 1 1 1 1 1 1 1
In [65]: df_inner.groupby('city')['price'].sum()
Out[65]:
city
Guangzhou 2133.0
Beijing 5632.0
Shanghai 0.0
Shenzhen 5433.0
Name: price, dtype: float64
In [66]: df_inner['city']=df_inner['city'].str.strip()
In [67]: df_inner.groupby('city')['price'].sum()
Out[67]:
city
Beijing 5632.0
Guangzhou 2133.0
Shanghai 0.0
Shenzhen 5433.0
Name: price, dtype: float64
#按多各列对其他列聚合
In [68]: df_inner.groupby(['city','pay'])['id'].count().unstack(0)
Out[68]:
city Beijing Guangzhou Shanghai Shenzhen
pay
N NaN NaN 2.0 NaN
Y 2.0 1.0 NaN 1.0
聚合的多函数应用
In [69]: df_inner.groupby('city')['price'].agg([len,np.sum,np.mean])
Out[69]:
len sum mean
city
Beijing 2.0 5632.0 2816.0
Guangzhou 1.0 2133.0 2133.0
Shanghai 2.0 0.0 NaN
Shenzhen 1.0 5433.0 5433.0
apply() 一般性的"拆分——应用——合并
tips = pd.read_csv('data/tips.csv')
tips
tips['tip_pac']=tips['tip']/tips['total_bill']#添加‘小费占总额百分比’的列
tips.head()
#找出top N
def top(df,n=5,column='tip_pac'):
return df.sort_values(by=column)[-n:]
tips.groupby('smoker').apply(top,n=3,column='total_bill')
透视表和交叉表
pivot_table()透视表,默认计算平均数
pandas.
pivot_table
(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
crosstab()交叉表,是一种计算分组频数的特殊透视表
pandas.
crosstab
(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
tips.pivot_table(index=['sex','smoker'])#计算的是分组平均数
pd.crosstab(index=[tips.time,tips.day],columns=tips.smoker,margins=True)