Pandas

最新推荐文章于 2025-06-25 00:03:24 发布

iso少年

最新推荐文章于 2025-06-25 00:03:24 发布

阅读量288

点赞数

CC 4.0 BY-SA版权

文章标签： pandas python 数据分析

本文链接：https://blog.youkuaiyun.com/l_ll888/article/details/94555489

Pandas介绍

Python长期以来一直非常适合数据整理和准备，但对于数据分析和建模则不那么重要。pandas有助于填补这一空白，使您能够在Python中执行整个数据分析工作流程，而无需切换到更像域的特定语言，如R. pandas是一个Python包，提供快速，灵活和富有表现力的数据结构

创建对象

导入包：

import numpy as np
import pandas as pd

创建Series

和 NumPy 数组不同，Pandas 的 Series 能存放各种不同类型的对象。
从字典创建Series，pandas 会自动把字典的键值设置成 Series 的 index，并将对应的 values 放在和索引对应的 data 里。

例子代码：

"""创建Series"""
a=pd.Series([12,4,5,np.NAN,23]) #可以添加index=,但是不能添加columns=
a2=pd.Series(1,index=[0,1,2,3],dtype="float32")  #不够的自动补齐
b_dict={'a':50,'b':24,'c':89,'d':34}
b=pd.Series(b_dict)
type(b) #<class 'pandas.core.series.Series'>
print(a,a2,b)
"""
0    12.0
1     4.0
2     5.0
3     NaN
4    23.0
dtype: float64 

0    1.0
1    1.0
2    1.0
3    1.0
dtype: float32 

a    50
b    24
c    89
d    34
dtype: int64
"""

创建Dataframe

例子代码：

c=pd.DataFrame([[23,5,6],[2,3,5]])
d=pd.DataFrame([[23,5,6],[2,3,5]],index=["第一行","第二行"],columns=["第一列","第二列","第三列"])#或者用下面的np数组创建：
e=pd.DataFrame(np.array([[23,5,6],[2,3,5]]),index=["第一行","第二行"],columns=["第一列","第二列","第三列"])
print(c,"\n\n",d,"\n\n",e)
"""
	0  1  2
0  23  5  6
1   2  3  5 

      第一列  第二列  第三列
第一行   23    5    6
第二行    2    3    5 

      第一列  第二列  第三列
第一行   23    5    6
第二行    2    3    5
"""


"""通过传递带有日期时间索引和标记列的NumPy数组来创建"""
dates = pd.date_range('20130101', periods=6)
f=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
print(dates,f)
"""
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                                  
              A         B         C         D
2013-01-01  1.176395 -0.566222 -0.995993  0.686690
2013-01-02 -0.161430 -0.020617 -1.256424  1.061952
2013-01-03  0.831867 -0.814536  1.846343 -1.006396
2013-01-04  0.912144 -0.007161 -0.344765 -0.567658
2013-01-05  0.289082 -0.173980 -0.410463 -0.895588
2013-01-06 -0.142269  0.380692 -0.691355  0.474019
"""


"""通过传递可以转换为类似系列的对象的dict来创建。"""
g=pd.DataFrame({'A':3.,'B':pd.Series(1,index=[0,1,2,3],dtype="float32"),
				'C':np.array([4,23,2,2],dtype="int32")})  #A列数目不够，自动填充补齐
print(g,"\n",g.dtypes)
"""
     A    B   C
0  3.0  1.0   4
1  3.0  1.0  23
2  3.0  1.0   2
3  3.0  1.0   2 

A    float64
B    float32
C      int32
dtype: object
"""

数据查看

Dataframe.head(),Dataframe.tail()，输出头几行或后几行
Dataframe.index,Dataframe.columns，显示index和columns
Dataframe.describe()，显示数据的快速统计摘要（计数、均值、最大最小值、四分位数）
Dataframe.T，转置
Dataframe.sort_index(axis=1,ascending=False) ，按轴排序
Dataframe.sort_values(by=‘B’) ，按照列值排序（'B’列）
Dataframe.to_numpy()，转成numpy数组

例子代码：

a=pd.DataFrame({'A':[1,2,4,5,6],'B':[3,4,6,2,5],'C':[3,5,24,1,2]},index=["first","second","third","fourth","fifth"])
a.head(1)
a.tail(2)
print(a.index,"\n",a.columns)
a.describe()
a.T
a.sort_index(axis=1,ascending=False)
a.sort_index(axis=0,ascending=True)
a.sort_values('B')
a.to_numpy()
"""
       A  B  C				
first  1  3  3

        A  B  C
fourth  5  2  1
fifth   6  5  2

Index(['first', 'second', 'third', 'fourth', 'fifth'], dtype='object') 
Index(['A', 'B', 'C'], dtype='object')


              A         B          C
count  5.000000  5.000000   5.000000
mean   3.600000  4.000000   7.000000
std    2.073644  1.581139   9.617692
min    1.000000  2.000000   1.000000
25%    2.000000  3.000000   2.000000
50%    4.000000  4.000000   3.000000
75%    5.000000  5.000000   5.000000
max    6.000000  6.000000  24.000000

   first  second  third  fourth  fifth
A      1       2      4       5      6
B      3       4      6       2      5
C      3       5     24       1      2

         C  B  A
first    3  3  1
second   5  4  2
third   24  6  4
fourth   1  2  5
fifth    2  5  6

        A  B   C
fifth   6  5   2
first   1  3   3
fourth  5  2   1
second  2  4   5
third   4  6  24

        A  B   C
fourth  5  2   1
first   1  3   3
second  2  4   5
fifth   6  5   2
third   4  6  24

array([[ 1,  3,  3],
       [ 2,  4,  5],
       [ 4,  6, 24],
       [ 5,  2,  1],
       [ 6,  5,  2]], dtype=int64)
"""

数据选择

Dataframe[‘A’]，选择一个名为”A“的列，产生一个Series
Dataframe[0:2]，对行进行切片
DataFrame.loc，通过标签或布尔数字访问行和列
DataFrame.iloc， 按照位置选择
DataFrame.iat，快速访问标量
布尔索引：
- DataFrame[DataFrame.A > 0]，通过单个列的值来选择数据
- DataFrame[DataFrame[“A”].isin([0,1]) 选取A列值为0或1的

例子代码：

a=pd.DataFrame({'A':[1,2,4,5,6],'B':[3,4,6,2,5],'C':[3,5,24,1,2]},index=["first","second","third","fourth","fifth"])
b=a["A"]
c=a[0:2]
d=a["first":"third"]
print(b,c,d)
"""
first     1
second    2
third     4
fourth    5
fifth     6
Name: A, dtype: int64

         A  B  C
first   1  3  3
second  2  4  5

        A  B   C
first   1  3   3
second  2  4   5
third   4  6  24
"""


"""进行切片查找"""
e=a.loc["first"]
f=a.loc[["first","second"]]
g=a.loc[:,"A"]
h=a.loc[:,["A","B"]]
i=a.loc[["first","fifth"],["A","C"]]
print(e,f,g,h,i)
"""
A    1
B    3
C    3
Name: first, dtype: int64         
		
		A  B  C
first   1  3  3
second  2  4  5 

first     1
second    2
third     4
fourth    5
fifth     6
Name: A, dtype: int64         
		
		A  B
first   1  3
second  2  4
third   4  6
fourth  5  2
fifth   6  5        
		
		A  C
first  1  3
fifth  6  2
"""


"""按照位置查找"""
j=a.iloc[3]  #选取第4行
k=a.iloc[3:4] #左开右闭，还是选取第4行
l=a.iloc[[1,3,4],[0,2]]
m=a.iloc[:,[1,2]]
"""
A    5
B    2
C    1
Name: fourth, dtype: int64         
		
		A  B  C
fourth  5  2  1         
		
		A  C
second  2  5
fourth  5  1
fifth   6  2         
		
		B   C
first   3   3
second  4   5
third   6  24
fourth  2   1
fifth   5   2
"""

n=a.iat[2,2]  #查找标量
o=a[a["A"].isin([0,1])]  #选取A列值为0或者1的
p=a[a.A>0]  #布尔值查找
"""
24        

		A  B  C
first  1  3  3

        A  B   C
first   1  3   3
second  2  4   5
third   4  6  24
fourth  5  2   1
fifth   6  5   2
"""

设定值

例子代码：

a=pd.DataFrame({'A':[1,2,4,5,6],'B':[3,4,6,2,5],'C':[3,5,24,1,2]},index=["first","second","third","fourth","fifth"])
print(a)
"""
        A  B   C
first   1  3   3
second  2  4   5
third   4  6  24
fourth  5  2   1
fifth   6  5   2"""

b=pd.Series([23,4,5,6,3],index=["second","third","fourth","fifth","six"])
a["D"]=b  #会自动根据index对齐数据，其余的扔掉
a.at["first","A"]=0   """按照标签设定值"""
a.iat[0,1]=0   """按照位置设定值"""
a.loc[:,"F"]=np.arange(5)  """使用numpy数组设定值"""
print(a)
"""
        A  B   C     D  F
first   0  0   3   NaN  0
second  2  4   5  23.0  1
third   4  6  24   4.0  2
fourth  5  2   1   5.0  3
fifth   6  5   2   6.0  4"""

缺少数据处理

pandas使用np.nan来表示缺失的数据
Dataframe.dropna(how=“any”)，删除任何缺少数据的行
Dataframe.fillna(value=5)，填写缺失的数据
pandas.isna(Dataframe)，判断是否为nan值，返回布尔值

例子代码：

a=pd.DataFrame([[12,23,52,np.nan],[23,np.nan,2,5],[32,5,3,2]],columns=["A","B","C","D"])
print(a)
"""
    A     B   C    D
0  12  23.0  52  NaN
1  23   NaN   2  5.0
2  32   5.0   3  2.0"""

b=a.dropna(how="any")  #等价于b=a.dropna()
c=a.fillna(value=100)
d=pd.isna(a)
print(b,c,d)
"""
 A    B  C    D
2  32  5.0  3  2.0         
		
		A  B  C
first   1  3  3
second  2  4  5        
	
	A      B      C      D
0  False  False  False   True
1  False   True  False  False
2  False  False  False  False"""

一些操作（通常排除丢失的数据）

执行描述性统计：对列：df.mean()、对行：df.mean(1)
将函数应用于数据（申请），如：df.apply(lambda x: x.max() - x.min())
直方图：Dataframe.value_counts()
- 查看更多信息：https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-discretization
字符串方法，如：Dataframe.str.lower()/upper()/len()…
- 查看更多字符串方法：https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-string-methods

例子代码：

a=pd.DataFrame([[23,3,5,2,6],[2,3,5,6,3],[34,6,5,7,4]])
b=a.mean()  #对列进行求均值
c=a.mean(1)	#对行进行求均值
d=a.apply(lambda x:x.max()-x.min()) #可以加上axis=参数来改变函数应用方向
print(b,c,d)
"""
A    19.666667
B     4.000000
C     5.000000
D     5.000000
E     4.333333
dtype: float64 

0     7.8
1     3.8
2    11.2
dtype: float64 

A    32
B     3
C     0
D     5
E     3
dtype: int64"""


e=pd.Series([234,23,5,2,7,4,5,5,6,7,76,6,3,3])
f=e.value_counts()  #直方图
print(f)
"""
5      3
6      2
3      2
7      2
76     1
234    1
23     1
4      1
2      1
dtype: int64"""

合并

pd.concat((Dataframe1/Series1,Dataframe2/Series2…)），将pandas对象（Series、Dataframe）连接在一起
- https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging
pd.merge()，SQL样式合并
- https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join
df.append()，追加
- https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-concatenation

例子代码：

a=pd.DataFrame([[23,3,5,2,6],[2,3,5,6,3],[34,6,5,7,4]],columns=["A","B","C","D","E"])
b=a.loc[0:1,"A"]
c=a.iloc[0:2,[1,2]]
print(a,"\n\n",b,"\n\n",c)
"""
 	A  B  C  D  E
0  23  3  5  2  6
1   2  3  5  6  3
2  34  6  5  7  4 

0    23
1     2
Name: A, dtype: int64 
    
   B  C
0  3  5
1  3  5"""
print(a.append(a.iloc[1,:],ignore_index=True))
"""
    A  B  C  D  E
0  23  3  5  2  6
1   2  3  5  6  3
2  34  6  5  7  4
3   2  3  5  6  3"""
d1=pd.concat((b,c))
d2=pd.concat((b,c),axis=1)
print(d1,"\n\n",d2)
"""
      0    B    C
0  23.0  NaN  NaN
1   2.0  NaN  NaN
0   NaN  3.0  5.0
1   NaN  3.0  5.0 

    A  B  C
0  23  3  5
1   2  3  5"""


e1=pd.DataFrame({"key":["foo","foo"],"nba":[23,69]})
e2=pd.DataFrame({"key":["foo","foo"],"cba":[97,8]})
e=pd.merge(e1,e2,on="key")
f1=pd.DataFrame({"key":["foo","bar"],"nba":[23,69]})
f2=pd.DataFrame({"key":["foo","bar"],"cba":[97,8]})
f=pd.merge(f1,f2,on="key")
print(e,"\n\n",f)
"""
   key  nba  cba
0  foo   23   97
1  foo   23    8
2  foo   69   97
3  foo   69    8 

    key  nba  cba
0  foo   23   97
1  bar   69    8"""

分组

例子代码：

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
              	   'D': np.random.randn(8)})
df.groupby("A").sum()
"""
        C         D
A                      
bar -1.818018  0.920887
foo -1.083928  2.312676  """

df.groupby(["A","B"]).sum()
"""
              C         D
A   B                        
bar one    0.211772 -0.468906
    three -1.604844  0.910890
    two   -0.424946  0.478903
foo one   -1.435132  1.477868
    three -0.333705  0.011274
    two    0.684909  0.823534  """

数据透视表

例子代码：

a = pd.DataFrame({"A": ["A", "B", "C"] * 4,
				  "B": ["one", "two", "three", "four"] * 3,
				  "C": ["foo", "bar"] * 6,
                  "D": np.random.randn(12) * 10,
                  "E": np.random.randn(12) * 10})
b1 = pd.pivot_table(a, values=["D"], index=["A", "B"], columns="C")
b2 = pd.pivot_table(a, values=["D","E"], index=["A", "B"], columns="C")
print(a,"\n\n",b1,"\n\n",b2)
"""
A      B    C          D          E
0   A    one  foo  -4.914913  14.486079
1   B    two  bar  -8.004826 -18.946289
2   C  three  foo -11.676869 -16.178788
3   A   four  bar  23.877449  -4.409281
4   B    one  foo   6.695333  -4.126402
5   C    two  bar  11.015809 -16.389535
6   A  three  foo   3.195449   2.012969
7   B   four  bar  -0.299516   7.940081
8   C    one  foo  11.646412  12.938037
9   A    two  bar  -1.989591  10.057017
10  B  three  foo  -8.995876  -2.928453
11  C   four  bar  -2.779193 -12.932097 
                  D           
C              bar        foo
A B                          
A four   23.877449        NaN
  one          NaN  -4.914913
  three        NaN   3.195449
  two    -1.989591        NaN
B four   -0.299516        NaN
  one          NaN   6.695333
  three        NaN  -8.995876
  two    -8.004826        NaN
C four   -2.779193        NaN
  one          NaN  11.646412
  three        NaN -11.676869
  two    11.015809        NaN 
                  D                     E           
C              bar        foo        bar        foo
A B                                                
A four   23.877449        NaN  -4.409281        NaN
  one          NaN  -4.914913        NaN  14.486079
  three        NaN   3.195449        NaN   2.012969
  two    -1.989591        NaN  10.057017        NaN
B four   -0.299516        NaN   7.940081        NaN
  one          NaN   6.695333        NaN  -4.126402
  three        NaN  -8.995876        NaN  -2.928453
  two    -8.004826        NaN -18.946289        NaN
C four   -2.779193        NaN -12.932097        NaN
  one          NaN  11.646412        NaN  12.938037
  three        NaN -11.676869        NaN -16.178788
  two    11.015809        NaN -16.389535        NaN
"""

时间序列

例子代码：

a=pd.date_range("1/2/2019",periods=30,freq="D")
print(a)
DatetimeIndex(['2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05',
               '2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09',
               '2019-01-10', '2019-01-11', '2019-01-12', '2019-01-13',
               '2019-01-14', '2019-01-15', '2019-01-16', '2019-01-17',
               '2019-01-18', '2019-01-19', '2019-01-20', '2019-01-21',
               '2019-01-22', '2019-01-23', '2019-01-24', '2019-01-25',
               '2019-01-26', '2019-01-27', '2019-01-28', '2019-01-29',
               '2019-01-30', '2019-01-31'],
              dtype='datetime64[ns]', freq='D')

数据读取/导出

pd.read_excel()
pd.read_csv()
Dataframe.to_excel("",sheet_name="")#导出excel
Dataframe.to_csv("") #导出csv