数据分析(2)-pandas

本文详细介绍了 Pandas 库中的两种主要数据结构 Series 和 DataFrame 的使用方法,包括创建、索引、切片、数据对齐等功能,展示了如何利用这些结构进行高效的数据处理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

基本介绍
  1. 具备按轴自动或显式数据对齐功能的数据结构。
  2. 集成时间序列功能,既能处理时间序列数据也能处理非时间序列数据的数据结构。
  3. 数学运算和约简(比如对某个轴求和)可以根据不同的元数据(轴编号)执行,灵活处理缺失数据
  4. 合并及其他出现在常见数据库(例如基于SQL的)中的关系型运算
    使用
import numpy as np
from numpy import nan as NA
from pandas import Series, DataFrame
from pandas import Series, DataFrame, MultiIndex
数据结构 Series

Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型) 以及一组与之相关的数据标签(即索引)组成。
Series的字符串表现形式为:索引在左边,值在右边。

>>> from pandas import Series
>>> obj = Series([4, 7, -5, 3])  #'用数组生成Series'
>>> obj
0    4
1    7
2   -5
3    3
dtype: int64
>>> obj.values
array([ 4,  7, -5,  3], dtype=int64)
>>> obj.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])                #'指定Series的index'
>>> obj2
d    4
b    7
a   -5
c    3
dtype: int64
>>> obj2['a']
-5
>>> obj2['d'] = 6
>>> obj2[['c', 'a', 'd']]
c    3
a   -5
d    6
dtype: int64
>>> obj2[obj2 > 0]  # 找出大于0的元素
d    6
b    7
c    3
dtype: int64
>>> 'b' in obj2 # 判断索引是否存在
True
>>> 'e' in obj2
False
>>> sdata = {'Ohio':45000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}        #'使用字典生成Series'
>>> obj3 = Series(sdata)
>>> obj3
Ohio      45000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
>>> states = ['California', 'Ohio', 'Oregon', 'Texas']  # '使用字典生成Series,并额外指定index,不匹配部分为NaN'
>>> obj4 = Series(sdata, index = states)
>>> obj4
California      NaN
Ohio          45000
Oregon        16000
Texas         71000
dtype: float64
>>> obj3 + obj4  #'Series相加,相同索引部分相加'
California       NaN
Ohio           90000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64
>>> obj4.name = 'population'     #'指定Series及其索引的名字'
>>> obj4.index.name = 'state'
>>> obj4
state
California      NaN
Ohio          45000
Oregon        16000
Texas         71000
Name: population, dtype: float64
>>> obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] #'替换index'
>>> obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
数据结构 DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同 的值类型(数值、字符串、布尔值等)。 DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用 同一个索引)。

>>> data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year':[2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}#'用字典生成DataFrame,key为列的名字。'
>>> DataFrame(data)
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
[5 rows x 3 columns]
>>> DataFrame(data, columns = ['year', 'state', 'pop']) # 指定列顺序
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
[5 rows x 3 columns]
>>> frame2 = DataFrame(data,columns = ['year', 'state', 'pop', 'debt'],index = ['one', 'two', 'three', 'four', 'five'])#'指定索引,在列中指定不存在的列,默认数据用NaN。'
>>> frame2
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
[5 rows x 4 columns]
>>> frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
>>> frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
>>> frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
>>> frame2['debt'] = 16.5 # 修改一整列
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
[5 rows x 4 columns]
>>> frame2.debt = np.arange(5)  # 用numpy数组修改元素
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
[5 rows x 4 columns]
>>> val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])#'用Series指定要修改的索引及其对应的值,没有指定的默认数据用NaN。'
>>> frame2['debt'] = val
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
[5 rows x 4 columns]
>>> frame2['eastern'] = (frame2.state == 'Ohio')  # '赋值给新列';如果state等于Ohio为True
>>> frame2
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False
[5 rows x 5 columns]
>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')
>>> pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}                 #'DataFrame转置'
>>> frame3 = DataFrame(pop)
>>> frame3
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6
[3 rows x 2 columns]
>>> frame3.T
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6
[2 rows x 3 columns]
>>> DataFrame(pop, index = [2001, 2002, 2003])                                                  #'指定索引顺序,以及使用切片初始化数据。'
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN
[3 rows x 2 columns]
>>> pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]}
>>> DataFrame(pdata)
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
[2 rows x 2 columns]
>>> frame3.index.name = 'year'#'指定索引和列的名称'
>>> frame3.columns.name = 'state'
>>> frame3
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6
[3 rows x 2 columns]
>>> frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
基本功能

重新索引
创建一个适应新索引的新对象,该Series的reindex将会根据新索引进行重排。 如果某个索引值当前不存在,就引入缺失值。对于时间序列这样的有序数据,重新索引时可能需要做一些插值处理。 method选项即可达到此目的。

>>> obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])#'重新指定索引及顺序'
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])
>>> obj2
a   -5.3
b    7.2
d    4.5
c    3.6
e    NaN
dtype: float64
>>> obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0)  # '指定不存在元素的默认值'
a   -5.3
b    7.2
d    4.5
c    3.6
e    0.0
dtype: float64
>>> obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])#'重新指定索引并指定填元素充方法'
>>> obj3
0      blue
2    purple
4    yellow
dtype: object
>>> obj3.reindex(range(6), method = 'ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
>>> frame = DataFrame(np.arange(9).reshape(3, 3),index = ['a', 'c', 'd'],columns = ['Ohio', 'Texas', 'California'])#'对DataFrame重新指定索引'
>>> frame
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
[3 rows x 3 columns]
>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
   Ohio  Texas  California
a     0      1           2
b   NaN    NaN         NaN
c     3      4           5
d     6      7           8
[4 rows x 3 columns]
>>> states = ['Texas', 'Utah', 'California']     #'重新指定column'
>>> frame.reindex(columns = states)
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8
[3 rows x 3 columns]
>>> frame.reindex(index = ['a', 'b', 'c', 'd'],method = 'ffill',columns = states)#'对DataFrame重新指定索引并指定填元素充方法'
   Texas  Utah  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8
[4 rows x 3 columns]
>>> frame.ix[['a', 'b', 'd', 'c'], states] #'frame没有改变'
   Texas  Utah  California
a      1   NaN           2
b    NaN   NaN         NaN
d      7   NaN           8
c      4   NaN           5
[4 rows x 3 columns]

丢弃指定轴上的项
丢弃某条轴上的一个或多个项很简单【原来的DataFrame和Series没有改变】,只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是一个在指定轴上删除了指定值的新对象

>>> obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])#'Series根据索引删除元素'
>>> new_obj = obj.drop('c')
>>> new_obj
a    0
b    1
d    3
e    4
dtype: float64
>>> obj.drop(['d', 'c'])
a    0
b    1
e    4
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])#'DataFrame删除元素,可指定索引或列。'
>>> data.drop(['Colorado', 'Ohio'])
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
[2 rows x 4 columns]
>>> data.drop('two', axis = 1)
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
[4 rows x 3 columns]
>>> data.drop(['two', 'four'], axis = 1)
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14
[4 rows x 2 columns]

索引、选取和过滤
1,Series索引(obj[…])的工作方式类似于NumPy数组的索引,只不过Series的
2,索引值不只是整数。
3,利用标签的切片运算与普通的Python切片运算不同,其末端是包含的(inclusive)。
4,对DataFrame进行索引其实就是获取一个或多个列
5,为了在DataFrame的行上进行标签索引,引入了专门的索引字段ix。

>>> obj = Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])#'Series的索引,默认数字索引可以工作。'
>>> obj['b']
1.0
>>> obj[3]
3.0
>>> obj[[1, 3]]
b    1
d    3
dtype: float64
>>> obj[obj < 2]
a    0
b    1
dtype: float64
>>> obj['b':'c']  #'Series的数组切片'->闭区间
b    1
c    2
dtype: float64
>>> obj['b':'c'] = 5
>>> obj
a    0
b    5
c    5
d    3
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])#'DataFrame的索引'
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[4 rows x 4 columns]
>>> data['two'] # 打印列
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
>>> data[['three', 'one']]
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12
[4 rows x 2 columns]
>>> data[:2]
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
[2 rows x 4 columns]
>>> data.ix['Colorado', ['two', 'three']] # 指定索引和列
two      5
three    6
Name: Colorado, dtype: int32
>>> data.ix[['Colorado', 'Utah'], [3, 0, 1]]
          four  one  two
Colorado     7    4    5
Utah        11    8    9
[2 rows x 3 columns]
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[4 rows x 4 columns]
>>> data.ix[2]  # 打印第2行(从0开始)
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
>>> data.ix[:'Utah', 'two'] # 从开始到Utah,第2列。
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32
>>> data[data.three > 5]#'根据条件选择'
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[3 rows x 4 columns]
>>> data < 5  # 打印True或者False
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False
[4 rows x 4 columns]
>>> data[data < 5] = 0
>>> data
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[4 rows x 4 columns]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值