python之pandas

本文详细介绍了如何使用Python的pandas库进行数据处理,包括读取csv文件、构建DataFrame、数据信息查看、数据清洗、数据可视化、数据规整、数据计算与统计以及数据汇总。内容涵盖DataFrame的基本操作、缺失值处理、数据转换、可视化图表制作,以及数据聚合和透视表等高级技巧。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录

目录

1.生成数据表

读取csv文件

构建DataFrame

2.数据信息查看和索引

查看DataFrame的基本信息

索引

3.数据可视化

可视化:matplotlib.pyplot绘图

可视化:pandas绘图

制图实例

4.数据清洗

缺失值处理

字符串处理

数据转换:格式转换和异常值处理

重复项

重命名索引

离散化

5.数据规整

丢弃数据

函数应用和映射

排序和排名

数据合并

数据重塑

6.数据计算和统计

算术运算

计算统计描述

相关系数和协方差

7.数据汇总

数据聚合

聚合的多函数应用

 apply() 一般性的"拆分——应用——合并

透视表和交叉表


 

1.生成数据表

读取csv文件

pd.read_csv('path/filename.csv',sep=',',index_col=None,header=1,encoding='utf-8',paser_datas=['data'],na_value=['N/A'],dtype={'column1':np.str})

#sep:分隔符
#index_col:指明哪一列作为行索引
#header:指明某一行作为列索引,默认是第0行
#paser_datas:如果数据集里面,有些列时间,通过paser_datas转化为时间格式
#dtype:指明某一列的类型

pd.read_csv('data/bikes.csv', sep=';',  parse_dates=['Date'],encoding='latin1',header=0, dayfirst=True, index_col='Date')

(文件名,sep分隔符,parse_date日期格式,encoding编码格式,header列索引取值行,dayfirst不懂,index_col行索引取值)

默认编码格式为‘utf-8’,encoding需要改为'latin1'

csv的默认分割符为';',seq需要传参为';',则会返回一个DataFrame格式数据

In [125]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').head(5)
Out[125]: 
         Date                 ...                   St-Urbain (données non disponibles)
0  01/01/2012                 ...                                                   NaN
1  02/01/2012                 ...                                                   NaN
2  03/01/2012                 ...                                                   NaN
3  04/01/2012                 ...                                                   NaN
4  05/01/2012                 ...                                                   NaN

[5 rows x 10 columns]

构建DataFrame

1.2 通过字典生成DataFrame对象
pd.DataFrame(dict,index=[...],columns=[...])#dict是一个字典,key-value中的value是一维数组或者列表
1.3 通过列表或者数组构建DataFrame对象
pd.DataFrame(like-array,index=[....],columns=[...])
1.4 通过字典/list/一维数组构建Series对象
pd.Series(data,index=[....])#data可以是字典、list或者一维数组

2.数据信息查看和索引

查看DataFrame的基本信息

df.header()方法,查看头几行,默认头五行

df.tail()方法,查看后几行,默认后五行

df.index,属性,查看DataFrame行索引

df.columns,属性,查看DataFrme列索引

df.values,属性,查看元素,返回array数组

df.shape,属性,查看DataFrame形状

df.dtypes,属性,查看DataFrame每列的数据类型

df.info(),方法,查看基本信息(列的类型、每列的数量、df占得内存等)

In [125]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').head(5)
Out[125]: 
         Date                 ...                   St-Urbain (données non disponibles)
0  01/01/2012                 ...                                                   NaN
1  02/01/2012                 ...                                                   NaN
2  03/01/2012                 ...                                                   NaN
3  04/01/2012                 ...                                                   NaN
4  05/01/2012                 ...                                                   NaN

[5 rows x 10 columns]

In [126]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').tail(5)
Out[126]: 
           Date                 ...                   St-Urbain (données non disponibles)
305  01/11/2012                 ...                                                   NaN
306  02/11/2012                 ...                                                   NaN
307  03/11/2012                 ...                                                   NaN
308  04/11/2012                 ...                                                   NaN
309  05/11/2012                 ...                                                   NaN

[5 rows x 10 columns]

In [127]: pd.read_csv('bikes.csv',encoding='latin1',sep=';').index
Out[127]: RangeIndex(start=0, stop=310, step=1)

In [70]: df=pd.read_csv('bikes.csv',encoding='latin1',sep=';').head(5)

In [71]: df
Out[71]: 
         Date                 ...                   St-Urbain (données non disponibles)
0  01/01/2012                 ...                                                   NaN
1  02/01/2012                 ...                                                   NaN
2  03/01/2012                 ...                                                   NaN
3  04/01/2012                 ...                                                   NaN
4  05/01/2012                 ...                                                   NaN

[5 rows x 10 columns]

In [72]: df.shape
Out[72]: (5, 10)

In [73]: df.values
Out[73]: 
array([['01/01/2012', 35, nan, 0, 38, 51, 26, 10, 16, nan],
       ['02/01/2012', 83, nan, 1, 68, 153, 53, 6, 43, nan],
       ['03/01/2012', 135, nan, 2, 104, 248, 89, 3, 58, nan],
       ['04/01/2012', 144, nan, 1, 116, 318, 111, 8, 61, nan],
       ['05/01/2012', 197, nan, 2, 124, 330, 97, 13, 95, nan]],
      dtype=object)

In [74]: df.columns
Out[74]: 
Index(['Date', 'Berri 1', 'Brébeuf (données non disponibles)',
       'Côte-Sainte-Catherine', 'Maisonneuve 1', 'Maisonneuve 2', 'du Parc',
       'Pierre-Dupuy', 'Rachel1', 'St-Urbain (données non disponibles)'],
      dtype='object')

In [75]: df.index
Out[75]: RangeIndex(start=0, stop=5, step=1)

In [76]: df.dtypes
Out[76]: 
Date                                    object
Berri 1                                  int64
Brébeuf (données non disponibles)      float64
Côte-Sainte-Catherine                    int64
Maisonneuve 1                            int64
Maisonneuve 2                            int64
du Parc                                  int64
Pierre-Dupuy                             int64
Rachel1                                  int64
St-Urbain (données non disponibles)    float64
dtype: object
In [77]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Date                                   5 non-null object
Berri 1                                5 non-null int64
Brébeuf (données non disponibles)      0 non-null float64
Côte-Sainte-Catherine                  5 non-null int64
Maisonneuve 1                          5 non-null int64
Maisonneuve 2                          5 non-null int64
du Parc                                5 non-null int64
Pierre-Dupuy                           5 non-null int64
Rachel1                                5 non-null int64
St-Urbain (données non disponibles)    0 non-null float64
dtypes: float64(2), int64(7), object(1)
memory usage: 480.0+ bytes

索引

索引某一列或几列
df['columns1']#查看‘columns1’这一列的性信息
df[['col1','col2]]#索引多列

In [12]: df
Out[12]: 
    o   p   q   r   s
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19

In [13]: df['o']
Out[13]: 
a     0
b     5
c    10
d    15
Name: o, dtype: int64

In [14]: df[['o','p']]
Out[14]: 
    o   p
a   0   1
b   5   6
c  10  11
d  15  16

获取行或多行

iloc或loc[行,列]
df.loc[['index1','index1'],['col1','col2']]#针对索引名进行取数
df.iloc[[0,3,5],[2,4,1]]#针对索引号进行取数

In [16]: df.loc['a']
Out[16]: 
o    0
p    1
q    2
r    3
s    4
Name: a, dtype: int64

In [17]: df.loc['a','o']
Out[17]: 0

In [18]: df.loc[['a','b'],['o','p']]
Out[18]: 
   o  p
a  0  1
b  5  6

In [19]: df.iloc[1]
Out[19]: 
o    5
p    6
q    7
r    8
s    9
Name: b, dtype: int64

In [20]: df.iloc[[1],[1]]
Out[20]: 
   p
b  6


3.3 布尔型索引
df[boo_arr]#索引的是行信息,bool_arr是一个跟行数量一致的bool型数组或Series对象
df.loc[bool_arr,['col1']]#布尔型索引针对行

In [27]: df
Out[27]: 
    o   p   q   r   s
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19

In [28]: df[[True,False,True,False]]#需要和行数相同的布尔型序列
Out[28]: 
    o   p   q   r   s
a   0   1   2   3   4
c  10  11  12  13  14
In [34]: df.loc[[True,False,True,False],['o','q']]#需要配合loc使用
Out[34]: 
    o   q
a   0   2
c  10  12

4. value_counts(),只能对列对象使用,即针对Series对象的频数统计,即
df['col1].value_counts()#频数统计,按降序返回对DataFrame不生效

In [39]: df['o'].value_counts()
Out[39]: 
15    1
5     1
10    1
0     1
Name: o, dtype: int64


4.1 isna()和notna(),列使用针对缺失值的判断,列对象,DataFrame对象均可
df['col1'].isna()#判断‘col1’这一列对应位置缺失值情况,如果是缺失值,返回True,否则返回False
df['col1'].notna()#判断‘col1’这一列对应位置缺失值情况,如果是缺失值,返回False,否则返回True

In [40]: df.isna()
Out[40]: 
       o      p      q      r      s
a  False  False  False  False  False
b  False  False  False  False  False
c  False  False  False  False  False
d  False  False  False  False  False

In [42]: df['o'].notna()
Out[42]: 
a    True
b    True
c    True
d    True
Name: o, dtype: bool

df.unique(),唯一值

3.数据可视化

可视化:matplotlib.pyplot绘图

import matplotlib.pyplot as plt


plt.plot(x,y,style)#线状图,style包括颜色和线型
plt.bar(x,y,color=['r','g'],alpha=1)#柱状图,alpha是透明度
plt.hist(x,bins=30,density=True)#直方图,bins指的是说准备x分成对少个区间,density指的是频率
plt.pie(x,labels,autopct='%.2f%%')#饼状图,labels指的是文字标签,autopct对应的是数字标签

#画子图
fig = plt.figure(figsize=(10,8))#figsize用来指明图像的大小
ax=fig.subplots(2,2)#说明子图数量,子图的排列
ax[0][0].plot()#在第一个子图上画线状图

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

fig=plt.figure(figsize=(8,8))#声明画板大小
((ax1, ax2), (ax3, ax4)) = fig.subplots(2, 2)#分成2×2块画板
ax1.hist(np.random.rand(100),bins=20,color='r',alpha=1)#条形图
ax2.scatter(np.arange(30),np.arange(30))#散点图
ax3.plot(np.random.randn(50).cumsum(),'k--')#线状图
labels = ['Frogs', 'Hogs', 'Dogs', 'Logs']
fracs = [15, 30, 45, 10]
explode = [0.00, 0.00, 0.05, 0]
ax4.pie(fracs, explode=explode, labels=labels, autopct='%.0f%%')#饼状图
plt.savefig('图片/MathPlot12.png',dpi=400,bbox_inches='tight')#保存图片
plt.show()#在终端编码,需要使用此方法显式图片

生成图片如下:

plt.title('title',fontsize=20)#标题
plt.xlabel('label',fontsize=15)#x轴的标签
plt.xlim(0,4)#将x轴的显示范围,显示[0,4]的范围
plt.xticks(x,label,rotation=0)#用来修改刻度,或者刻度的名称
plt.text(x,y,'text',fontsize=10)#在显示的图片里在(x,y)点插入文件

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False 
fig1 = plt.figure(figsize=(10,5))#新建figure,设置图片的大小

t=np.arange(0.0,4.0,0.01)
s=np.sin(2*np.pi*t)
plt.plot(t,s)
plt.title('learn',fontsize=25)
plt.text(1,0,r'sin(2*pi*x)',fontsize=15)#文本坐标,文本内容,文本字体
plt.text(0.4, 0.6, r'$\mathcal{A}\mathrm{sin}(2 \omega t)$',fontsize=15)#设置文本
plt.xlabel('value(x)',fontsize=20)#x轴标签
plt.ylabel('value(y)',fontsize=20)#y轴标签
plt.grid(True)#添加网格
plt.xlim(0,4)#x轴刻度范围
plt.ylim(-1,1)#y轴刻度范围
plt.savefig('learn',dpi=400,bbox_inches='tight')#保存图片,默认png格式

生成图片如下:

可视化:pandas绘图

df = pd.DataFrame(np.random.randn(10,4).cumsum(0),columns=list('ABCD'),index = np.arange(0,100,10))
print(df)
df.plot(grid=True,style=['r-','g-.','b--','k:'],figsize=(8,5))#绘制线状图

 生成图片如下:线状图

fig2=plt.figure(figsize=(10,8))
ax= fig2.subplots(2, 1)
data = pd.Series(np.random.rand(16),index = list('abcdefghijklmnop'))
data.plot(kind='bar',ax=ax[0],color='r',alpha=0.7,title='柱状图')#垂直
data.plot(kind='barh',ax=ax[1],color='b',alpha=0.7)#水平

生成图片如下:柱形图

fig4 = plt.figure(figsize=(10,8))
ax=fig4.subplots(1,2)
df = DataFrame(np.random.rand(6,4),index=['one','two','three','four','five','six'],columns = list('ABCD'))
print(df)
df.plot(kind='barh',ax=ax[0])
df.plot(kind='bar',ax=ax[1],stacked=True)#绘制堆积柱状图
ax[1].set_xticklabels(labels=list('abcdef'),rotation=30,fontsize=20)#设置x轴的刻度,改变旋转度,字体尺寸

生成图片如下:堆积柱形图

comp1 = np.random.normal(0,1,size=200)
comp2 = np.random.normal(10,2,size=200)
values = pd.Series(np.concatenate([comp1,comp2]))
values.hist(bins=200,color='g',density=True)
#values.plot(kind='kde',style='b--')
plt.grid(True)

生成图片如下:直方图和密度图

制图实例

import matplotlib.pyplot as plt #载入matplotlib.pyplot模块
import numpy as np
import pandas as pd

fig = plt.figure(figsize=(8,5))#画板
complaints = pd.read_csv('311-service-requests.csv')#读取csv数据
complaints['Complaint Type'].value_counts().head().plot(kind='bar')#统计制图
data=complaints['Complaint Type'].value_counts().head()声明数据
print(data)
plt.xticks(range(len(data.index)),['HEATING','CONSTRUCTION','LIGHT CONDITION','PLUMBING','DOF REQUEST'],rotation=30,fontsize=10)#修改刻度和刻度名称
x=range(len(data.index))
y=list(data.values)
print(y)
#添加文本数字信息 
for a,b in zip(x,y):
    plt.text(a, b+100,str(b), ha='center', va= 'bottom',fontsize=12)
plt.title('Complain Type Top5',fontsize=20)
plt.ylim(0,15000)
plt.savefig('图片/Image.png',dpi=300,bbox_inches='tight')
plt.show()

 

生成图片如下:柱形图

 

4.数据清洗

缺失值处理

dropna,删除缺失值的行,返回一个新数据

In [6]: df
Out[6]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
two    1002 2013-01-03          NaN    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

In [7]: df.dropna()
Out[7]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

In [8]: df
Out[8]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
two    1002 2013-01-03          NaN    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

df.dropna(how='all')全为缺失值则删除,{‘any’, ‘all’}, default ‘any’

In [9]: df.dropna(how='all')
Out[9]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
two    1002 2013-01-03          NaN    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

df=df.dropna()和df.dropna(how='all',inplace=True)修改源数据,bool, default False

In [10]: df.dropna(inplace=False)
Out[10]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

In [11]: df
Out[11]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
two    1002 2013-01-03          NaN    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

In [12]: df.dropna(inplace=True)

In [13]: df
Out[13]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32  5433.0
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

DataFrame.fillna(value=Nonemethod=Noneaxis=Noneinplace=Falselimit=Nonedowncast=None**kwargs)

df.fillna(300),返回的是新数据,不改变源数据

In [17]: df
Out[17]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32     NaN
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

In [18]: df.fillna(110)
Out[18]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32   110.0
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

df['columns'].fillna(300)填充单列

df.fillna({'price':300})

In [19]: df
Out[19]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32     NaN
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

In [20]: df['price'].fillna(110)
Out[20]: 
one      1200.0
three    2133.0
four      110.0
six      4432.0
Name: price, dtype: float64


In [22]: df.fillna({'price':110})
Out[22]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23  1200.0
three  1003 2013-01-04   guangzhou     110-A   54  2133.0
four   1004 2013-01-05     Shenzhen    110-C   32   110.0
six    1006 2013-01-07     BEIJING     130-F   32  4432.0

method参数,{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

backfill,bfill,后项填充

ffill,前项填充

df.fillna(method='ffill',limit=1)只填充一个

In [30]: df
Out[30]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23     NaN
two    1002 2013-01-03           SH    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54     NaN
four   1004 2013-01-05     Shenzhen    110-C   32     NaN
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  2233.0

In [31]: df.fillna(method='bfill',limit=2)
Out[31]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23     NaN
two    1002 2013-01-03           SH    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54     NaN
four   1004 2013-01-05     Shenzhen    110-C   32  2233.0
five   1005 2013-01-06     shanghai    210-A   34  2233.0
six    1006 2013-01-07     BEIJING     130-F   32  2233.0

df['columns'].replace({""""})

In [35]: df
Out[35]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23     NaN
two    1002 2013-01-03           SH    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54     NaN
four   1004 2013-01-05     Shenzhen    110-C   32     NaN
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  2233.0

In [36]: df['price'].replace(np.nan,222)
Out[36]: 
one       222.0
two       222.0
three     222.0
four      222.0
five      222.0
six      2233.0
Name: price, dtype: float64

字符串处理

针对的series类型,即dataframe的列('DataFrame' object has no attribute 'str'
),返回一个新数据

ser.str.upper()

ser.str.lower()

In [38]: df['city'].str.upper()
Out[38]: 
one         BEIJING 
two               SH
three     GUANGZHOU 
four        SHENZHEN
five        SHANGHAI
six         BEIJING 
Name: city, dtype: object

In [39]: df
Out[39]: 
         id       date         city category  age   price
one    1001 2013-01-02     Beijing     100-A   23     NaN
two    1002 2013-01-03           SH    100-B   44     NaN
three  1003 2013-01-04   guangzhou     110-A   54     NaN
four   1004 2013-01-05     Shenzhen    110-C   32     NaN
five   1005 2013-01-06     shanghai    210-A   34     NaN
six    1006 2013-01-07     BEIJING     130-F   32  2233.0

ser.str.len()每个字符串的长度

len(ser)series的长度

In [43]: df['city'].str.len()
Out[43]: 
one       8
two       2
three    11
four      8
five      8
six       8
Name: city, dtype: int64

In [44]: len(df['city'])
Out[44]: 6

ser.str.strip()去除两边的空白夫

df['price'].replace(1200,100009)

ser.str.lower().str.replace(100.200)

ser.str.split('_')拆分字符串,返回新数据

ser.str.join('_')合并字符串,返回新数据--split的逆函数

In [51]: ser=pd.Series(['a-b-c','c-d-e'])

In [52]: ser.str.split('-')
Out[52]: 
0    [a, b, c]
1    [c, d, e]
dtype: object

In [53]: ser
Out[53]: 
0    a-b-c
1    c-d-e
dtype: object

In [54]: ser.str.split('-').str.join('+++')
Out[54]: 
0    a+++b+++c
1    c+++d+++e
dtype: object

ser.str.contains('a',na=True)是否包含'a',nan值返回True

数据转换:格式转换和异常值处理

重复项

data.duplicated()判断每行是否有重复项,整一行,而不是单个元素

data.drop_duplicates(keep='last')删除重复项

重命名索引

data.index=data.index.str.upper()重命名索引,原函数返回的是新数据,不是视图,所以需要赋值

data.rename(index=str.lower,columns=str.upper)重命名索引

离散化

连续数据需要离散化,或拆分为面元,cut和qcut,左开右闭

pd.cut(data,bins=4).value_counts()等距划分

pd.cut(data,bins=[20,40,50]).value_counts()固定划分

pd.cut(data,bins=4,right=False)默认左开右闭,改为左闭右开

qcut根据分位数划分面元,等量划分

pd.qcut(data,q=5)

pd.qcut(data,q=[0,0.05,0.2,0.5,0.75,1]),默认会排序,传入的列表不等于1也可以

 

labels

no_clue

data[~bool]取反

 

iloc不能用布尔型,loc可以

字符串的操作前都得加入.str

连续,填充(平均值),插值(模型)

离散,增加类别

变量,一列的数据

异常值

 

契比雪夫定理:大部分数据在。。。范围内

标准化

去中心化

箱型图

 

iloc不能用布尔型,loc可以

字符串的操作前都得加入.str

 

 

5.数据规整

丢弃数据

datafrme.drop()丢弃指定轴上的数据,返回一个新数据,不改变源数据

In [12]: data
Out[12]: 
           one  two  three  four
Shenzhen     0    1      2     3
Guangzhou    4    5      6     7
Beijing      8    9     10    11
Shanghai    12   13     14    15

In [13]: data.drop(['Shenzhen','Guangzhou'])
Out[13]: 
          one  two  three  four
Beijing     8    9     10    11
Shanghai   12   13     14    15

In [14]: data
Out[14]: 
           one  two  three  four
Shenzhen     0    1      2     3
Guangzhou    4    5      6     7
Beijing      8    9     10    11
Shanghai    12   13     14    15

In [15]: data.drop(['one'],axis=1)
Out[15]: 
           two  three  four
Shenzhen     1      2     3
Guangzhou    5      6     7
Beijing      9     10    11
Shanghai    13     14    15

函数应用和映射

dataframe.apply(afunc)逐列扫描进行处理

dataframe.applymap('afunc'),逐元素处理,将dataframe的元素传如afunc函数内,再生成相同形状新的dataframe

dataframe.map(func)针对series对象,竹元素处理,apply也能在series对象使用,功能和map相同

排序和排名

按行或列索引进行排序,可以使用sort_index(),它将返回一个已经排的新对象

根据指定的索引的对值进行排序,可以使用sort_values()

sort_index()基于索引排序

sort_values()基于值排序,需要指定基于哪一列排序

sort_values([col1,col2])以第一列为主排序,第一列有重复的则以第二列排序

数据合并

merge()合并数据集

inner内联,相同

outer外联,全部出现

left左联,

right右联

 

concat()拼接

列名相同的拼接,列名不同,则另起

combine_first()合并重贴数据,打补丁

数据重塑

层次化索引为DataFrame数据的重排任务提供了一种具有良好一致性的方式。主要有两种功能:

  • stack():将数据的列‘旋转’为行;
  • unstack():将数据的行‘旋转为列’;

 

6.数据计算和统计

 

算术运算

加,add

df.add(other,lever,axis,fill_value)

相加时,没有重叠的位置就会产生NA值。对于没有重叠的部分,我们想获取某一个DataFrame的值,那么我们可以指定一个填充值:

In [16]: df1 = pd.DataFrame(np.arange(1,13).reshape(3,4),columns=list('abcd'))
    ...: df2 = pd.DataFrame(np.arange(1,21).reshape(4,5),columns=list('abcde'))


In [17]: df1.add(df2)
Out[17]: 
      a     b     c     d   e
0   2.0   4.0   6.0   8.0 NaN
1  11.0  13.0  15.0  17.0 NaN
2  20.0  22.0  24.0  26.0 NaN
3   NaN   NaN   NaN   NaN NaN


In [19]: df1.add(df2,fill_value=0)
Out[19]: 
      a     b     c     d     e
0   2.0   4.0   6.0   8.0   5.0
1  11.0  13.0  15.0  17.0  10.0
2  20.0  22.0  24.0  26.0  15.0
3  16.0  17.0  18.0  19.0  20.0

减,subtract

df.sub(other,lever,axis,fill_value)

乘,multiply

df.mul(other,lever,axis,fill_value)

除,divide

df.div(other,lever,axis,fill_value)

直接使用+-*/会丢失数据,无法填充na

数值型:1234....等

离散型:1,2c,f,s..等

计算统计描述

约简:求和,平均值,中位数,最大、最小值等

In [21]: df
Out[21]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

In [22]: df.sum()
Out[22]: 
one    9.25
two   -5.80
dtype: float64

In [23]: df.sum(axis=1)
Out[23]: 
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [24]: df.sum(axis=1,skipna=False)
Out[24]: 
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

 

相关系数和协方差

面向Series对象,DataFrame对象无效

相关性:data1.corr(data2)两组数据是否有联系

方差:data1.var(),一组数据减去平均值的平方之和

协方差:data1.cov(data2)一组数据减去另一组数据的平方之和

标准差:data1.std()

In [33]: df1=pd.DataFrame(np.arange(12).reshape((3,4)))

In [34]: df2=pd.DataFrame(np.arange(12).reshape((3,4)))
#相关系数[-1,1]

In [36]: df1[1].corr(df2[1])
Out[36]: 1.0

In [37]: df1[1].corr(df2[2])
Out[37]: 1.0

In [38]: df1[1].corr(df2[3])
Out[38]: 1.0
#协方差
In [51]: df1[1].cov(df2[1])
Out[51]: 16.0

#var方差
In [46]: df2.var()
Out[46]: 
0    16.0
1    16.0
2    16.0
3    16.0
dtype: float64

In [47]: df2[1].var()
Out[47]: 16.0

#标准差
In [52]: df1.std()
Out[52]: 
0    4.0
1    4.0
2    4.0
3    4.0
dtype: float64

 

 

 

 

 

相关系数=1,完全正相关,相关系数=-1,完全负相关,即完全线性

方差:自己对自己,x,x

协方差:两组数据,x,y

 

线性相关有什么用:身高和体重的线性相关系数,

 

In [25]: df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
    ...:  "date":pd.date_range('20130102', periods=6),
    ...:   "city":['Beijing ', 'ShangHai', ' guangzhou ', 'Shenzhen', 'shanghai'
    ...: , 'BEIJING '],
    ...:  "age":[23,44,54,32,34,32],
    ...:  "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
    ...:   "price":[1200,np.nan,2133,5433,np.nan,4432]},
    ...:   columns =['id','date','city','category','age','price'])
    ...:   

In [26]: df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], 
    ...: "gender":['male','female','male','female','male','female','male','femal
    ...: e'],
    ...: "pay":['Y','N','Y','Y','N','Y','N','Y',],
    ...: "m-point":[10,12,20,40,40,40,30,20]})

In [27]: 

In [27]: df
Out[27]: 
     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
1  1002 2013-01-03     ShangHai    100-B   44     NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
4  1005 2013-01-06     shanghai    210-A   34     NaN
5  1006 2013-01-07     BEIJING     130-F   32  4432.0

In [28]: df1
Out[28]: 
     id  gender pay  m-point
0  1001    male   Y       10
1  1002  female   N       12
2  1003    male   Y       20
3  1004  female   Y       40
4  1005    male   N       40
5  1006  female   Y       40
6  1007    male   N       30
7  1008  female   Y       20

In [29]: df_inner=pd.merge(df,df1,how='inner')

In [30]: df_inner
Out[30]: 
     id       date         city category   ...     price  gender pay m-point
0  1001 2013-01-02     Beijing     100-A   ...    1200.0    male   Y      10
1  1002 2013-01-03     ShangHai    100-B   ...       NaN  female   N      12
2  1003 2013-01-04   guangzhou     110-A   ...    2133.0    male   Y      20
3  1004 2013-01-05     Shenzhen    110-C   ...    5433.0  female   Y      40
4  1005 2013-01-06     shanghai    210-A   ...       NaN    male   N      40
5  1006 2013-01-07     BEIJING     130-F   ...    4432.0  female   Y      40

[6 rows x 9 columns]


In [32]: df_inner['price'].cov(df_inner['m-point'])
Out[32]: 28771.666666666664

7.数据汇总

数据聚合

df.grouby(['column1',columns2])[columns].ufunc()

In [53]: df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
    ...:  "date":pd.date_range('20130102', periods=6),
    ...:   "city":['Beijing ', 'ShangHai', ' guangzhou ', 'Shenzhen', 'shanghai'
    ...: , 'BEIJING '],
    ...:  "age":[23,44,54,32,34,32],
    ...:  "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
    ...:   "price":[1200,np.nan,2133,5433,np.nan,4432]},
    ...:   columns =['id','date','city','category','age','price'])
    ...:   

In [54]: df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], 
    ...: "gender":['male','female','male','female','male','female','male','femal
    ...: e'],
    ...: "pay":['Y','N','Y','Y','N','Y','N','Y',],
    ...: "m-point":[10,12,20,40,40,40,30,20]})

In [55]: df_inner=pd.merge(df,df1,how='inner')
#按单个列对所有其他列进行聚合
In [56]: df_inner['city']=df_inner['city'].str.title()#首字母大写
In [63]: df_inner
Out[63]: 
     id       date         city category   ...     price  gender pay m-point
0  1001 2013-01-02     Beijing     100-A   ...    1200.0    male   Y      10
1  1002 2013-01-03     Shanghai    100-B   ...       NaN  female   N      12
2  1003 2013-01-04   Guangzhou     110-A   ...    2133.0    male   Y      20
3  1004 2013-01-05     Shenzhen    110-C   ...    5433.0  female   Y      40
4  1005 2013-01-06     Shanghai    210-A   ...       NaN    male   N      40
5  1006 2013-01-07     Beijing     130-F   ...    4432.0  female   Y      40

[6 rows x 9 columns]

In [64]: df_inner.groupby('city').count()
Out[64]: 
             id  date  category  age  price  gender  pay  m-point
city                                                             
 Guangzhou    1     1         1    1      1       1    1        1
Beijing       2     2         2    2      2       2    2        2
Shanghai      2     2         2    2      0       2    2        2
Shenzhen      1     1         1    1      1       1    1        1

In [65]: df_inner.groupby('city')['price'].sum()
Out[65]: 
city
 Guangzhou     2133.0
Beijing        5632.0
Shanghai          0.0
Shenzhen       5433.0
Name: price, dtype: float64

In [66]: df_inner['city']=df_inner['city'].str.strip()

In [67]: df_inner.groupby('city')['price'].sum()
Out[67]: 
city
Beijing      5632.0
Guangzhou    2133.0
Shanghai        0.0
Shenzhen     5433.0
Name: price, dtype: float64

#按多各列对其他列聚合
In [68]: df_inner.groupby(['city','pay'])['id'].count().unstack(0)
Out[68]: 
city  Beijing  Guangzhou  Shanghai  Shenzhen
pay                                         
N         NaN        NaN       2.0       NaN
Y         2.0        1.0       NaN       1.0

聚合的多函数应用

In [69]: df_inner.groupby('city')['price'].agg([len,np.sum,np.mean]) 
Out[69]: 
           len     sum    mean
city                          
Beijing    2.0  5632.0  2816.0
Guangzhou  1.0  2133.0  2133.0
Shanghai   2.0     0.0     NaN
Shenzhen   1.0  5433.0  5433.0

 apply() 一般性的"拆分——应用——合并

tips = pd.read_csv('data/tips.csv')
tips
tips['tip_pac']=tips['tip']/tips['total_bill']#添加‘小费占总额百分比’的列
tips.head()
#找出top N
def top(df,n=5,column='tip_pac'):
    return df.sort_values(by=column)[-n:]

tips.groupby('smoker').apply(top,n=3,column='total_bill')

 

透视表和交叉表

pivot_table()透视表,默认计算平均数

pandas.pivot_table(datavalues=Noneindex=Nonecolumns=Noneaggfunc='mean'fill_value=Nonemargins=Falsedropna=Truemargins_name='All')

crosstab()交叉表,是一种计算分组频数的特殊透视表

pandas.crosstab(indexcolumnsvalues=Nonerownames=Nonecolnames=Noneaggfunc=Nonemargins=Falsemargins_name='All'dropna=Truenormalize=False)

tips.pivot_table(index=['sex','smoker'])#计算的是分组平均数

pd.crosstab(index=[tips.time,tips.day],columns=tips.smoker,margins=True)

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值