python数据分析：pandas数据结构与操作

最新推荐文章于 2023-02-09 23:39:33 发布

原创最新推荐文章于 2023-02-09 23:39:33 发布 · 556 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #数据分析

python 专栏收录该内容

17 篇文章

订阅专栏

本文介绍了Pandas库中的两个核心数据结构：Series和DataFrame。详细讲述了如何创建这些数据结构，包括使用索引、处理缺失值等操作，并展示了如何通过字典和列表构建数据结构。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

pandas有两个常用的数据结构模块
Series和DataFrame
我们将对这两个数据结构模块进行学习
Series有点类似于一位数组的对象，也有点类似于字典，由一组数据以及一组与之相关的索引组成
我们可以简单的输入一些数据生成一个Series对象

In [1]: from pandas import *

In [2]: test=Series([1,2,3,4])

In [3]: test
Out[3]: 
0    1
1    2
2    3
3    4
dtype: int64

当然索引不一定得是整数，我们也可以在生成Series的时候对其添加索引

In [4]: test2=Series([1,2,3,4],index=['a','b','c','d'])

In [5]: test2
Out[5]: 
a    1
b    2
c    3
d    4
dtype: int64

In [6]: test2['a']
Out[6]: 1

In [7]: test2.index
Out[7]: Index(['a', 'b', 'c', 'd'], dtype='object')

索引的方式也是与字典类似，我们之前Numpy的数组中，所学习的一些方法在Series对象中同样适用

In [4]: test2=Series([1,2,3,4],index=['a','b','c','d'])

In [5]: test2
Out[5]: 
a    1
b    2
c    3
d    4
dtype: int64

In [6]: test2['a']
Out[6]: 1

In [7]: test2.index
Out[7]: Index(['a', 'b', 'c', 'd'], dtype='object')

因为映射方式的类似，其实Python的字典可以直接利用里面的数据来创建一个Series

In [12]: height={'tom':170,'david':175,'harry':180,'mary':170}

In [13]: test3=Series(height)

In [14]: test3
Out[14]: 
david    175
harry    180
mary     170
tom      170
dtype: int64

我们可以直接通过一个列表来传入索引

In [16]: name=['tom','david','harry','jack']

In [17]: test4=Series(height,index=name)

In [18]: test4
Out[18]: 
tom      170.0
david    175.0
harry    180.0
jack       NaN
dtype: float64

对于缺失的数值可以用pandas中的isnull和notnull函数检测

In [19]: pd.isnull(test4)
Out[19]: 
tom      False
david    False
harry    False
jack      True
dtype: bool

In [20]: pd.notnull(test4)
Out[20]: 
tom       True
david     True
harry     True
jack     False
dtype: bool

Series对象本身和它的索引都有一个name属性，这有点像我们平时常用的 excel表格

In [21]: test3.name='height'

In [22]: test3.index.name='Name'

In [23]: test3
Out[23]: 
Name
david    175
harry    180
mary     170
tom      170
Name: height, dtype: int64

索引也可以通过赋值的方式修改

In [24]: test3.index=['bob','bob','bob','bob']

In [25]: test3
Out[25]: 
bob    175
bob    180
bob    170
bob    170
Name: height, dtype: int64

Dataframe是pandas里面类似表格型的一个数据结构，他有一组有序的列，每列可以是不同的值类型，又有点像是一个扩充的Series，是以二维结构存储数据。
构建Dataframe有很多办法，我们可以在一个字典的基础上来创建他

In [3]: data={'name':['tom','harry','jack','mary'],'height':['176','170','180','160'],'weight':['100','80','90','50']}

In [4]: frame=DataFrame(data)

In [5]: frame
Out[5]: 
  height   name weight
0    176    tom    100
1    170  harry     80
2    180   jack     90
3    160   mary     50

Dataframe也可以按照指定序列进行排序

In [7]: DataFrame(data,columns=['name','height','weight'],index=['first','second','third','four'])
Out[7]: 
         name height weight
first     tom    176    100
second  harry    170     80
third    jack    180     90
four     mary    160     50

如果传入的数据在之中没有对应值也会像Series中一样生成NA值
指定了列序列，会按照指定的序列进行排序

In [7]: DataFrame(data,columns=['name','height','weight'],index=['first','second','third','four'])
Out[7]: 
         name height weight
first     tom    176    100
second  harry    170     80
third    jack    180     90
four     mary    160     50

可以通过索引把DataFrame的列获取为一个Series

In [9]: frame=DataFrame(data,columns=['name','height','weight'],index=['first','second','third','four'])

In [10]: frame['name']
Out[10]: 
first       tom
second    harry
third      jack
four       mary
Name: name, dtype: object

加入索引字段ix可以获取一整行的信息

In [12]: frame.ix['third']
Out[12]: 
name      jack
height     180
weight      90
Name: third, dtype: object

还有一种常见的数据形式是嵌套字典，我们将其传入字典，会被解释为：外层字典的键作为主列，内层的键则作为行索引

In [13]: pop={'Nevada':{2001:2.4,2002:2.9},'ohio':{2000:1.5,2001:1.7,2002:3.6}}

In [14]: frame=DataFrame(pop)

In [15]: frame
Out[15]: 
      Nevada  ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

我们可以对其进行转置

In [16]: frame.T
Out[16]: 
        2000  2001  2002
Nevada   NaN   2.4   2.9
ohio     1.5   1.7   3.6

内层的字典会被合并，排序成为最后的列。如果指定过索引，则不会这样

In [18]: DataFrame(pop,index=[2001,2002,2003])
Out[18]: 
      Nevada  ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN