数据分析(2)-pandas

最新推荐文章于 2020-11-29 04:10:22 发布

原创最新推荐文章于 2020-11-29 04:10:22 发布 · 352 阅读

0 ·

CC 4.0 BY-SA版权

机器学习学习笔记专栏收录该内容

18 篇文章

订阅专栏

本文详细介绍了 Pandas 库中的两种主要数据结构 Series 和 DataFrame 的使用方法，包括创建、索引、切片、数据对齐等功能，展示了如何利用这些结构进行高效的数据处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基本介绍

具备按轴自动或显式数据对齐功能的数据结构。
集成时间序列功能，既能处理时间序列数据也能处理非时间序列数据的数据结构。
数学运算和约简（比如对某个轴求和）可以根据不同的元数据（轴编号）执行，灵活处理缺失数据
合并及其他出现在常见数据库（例如基于SQL的）中的关系型运算
使用

import numpy as np
from numpy import nan as NA
from pandas import Series, DataFrame
from pandas import Series, DataFrame, MultiIndex

数据结构 Series

Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。
Series的字符串表现形式为：索引在左边，值在右边。

>>> from pandas import Series
>>> obj = Series([4, 7, -5, 3])  #'用数组生成Series'
>>> obj
0    4
1    7
2   -5
3    3
dtype: int64
>>> obj.values
array([ 4,  7, -5,  3], dtype=int64)
>>> obj.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])                #'指定Series的index'
>>> obj2
d    4
b    7
a   -5
c    3
dtype: int64
>>> obj2['a']
-5
>>> obj2['d'] = 6
>>> obj2[['c', 'a', 'd']]
c    3
a   -5
d    6
dtype: int64
>>> obj2[obj2 > 0]  # 找出大于0的元素
d    6
b    7
c    3
dtype: int64
>>> 'b' in obj2 # 判断索引是否存在
True
>>> 'e' in obj2
False
>>> sdata = {'Ohio':45000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}        #'使用字典生成Series'
>>> obj3 = Series(sdata)
>>> obj3
Ohio      45000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
>>> states = ['California', 'Ohio', 'Oregon', 'Texas']  # '使用字典生成Series，并额外指定index，不匹配部分为NaN'
>>> obj4 = Series(sdata, index = states)
>>> obj4
California      NaN
Ohio          45000
Oregon        16000
Texas         71000
dtype: float64
>>> obj3 + obj4  #'Series相加，相同索引部分相加'
California       NaN
Ohio           90000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64
>>> obj4.name = 'population'     #'指定Series及其索引的名字'
>>> obj4.index.name = 'state'
>>> obj4
state
California      NaN
Ohio          45000
Oregon        16000
Texas         71000
Name: population, dtype: float64
>>> obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] #'替换index'
>>> obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

数据结构 DataFrame

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。 DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）。

>>> data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year':[2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}#'用字典生成DataFrame，key为列的名字。'
>>> DataFrame(data)
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
[5 rows x 3 columns]
>>> DataFrame(data, columns = ['year', 'state', 'pop']) # 指定列顺序
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
[5 rows x 3 columns]
>>> frame2 = DataFrame(data,columns = ['year', 'state', 'pop', 'debt'],index = ['one', 'two', 'three', 'four', 'five'])#'指定索引，在列中指定不存在的列，默认数据用NaN。'
>>> frame2
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
[5 rows x 4 columns]
>>> frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
>>> frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
>>> frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
>>> frame2['debt'] = 16.5 # 修改一整列
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
[5 rows x 4 columns]
>>> frame2.debt = np.arange(5)  # 用numpy数组修改元素
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
[5 rows x 4 columns]
>>> val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])#'用Series指定要修改的索引及其对应的值，没有指定的默认数据用NaN。'
>>> frame2['debt'] = val
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
[5 rows x 4 columns]
>>> frame2['eastern'] = (frame2.state == 'Ohio')  # '赋值给新列'；如果state等于Ohio为True
>>> frame2
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False
[5 rows x 5 columns]
>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')
>>> pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}                 #'DataFrame转置'
>>> frame3 = DataFrame(pop)
>>> frame3
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6
[3 rows x 2 columns]
>>> frame3.T
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6
[2 rows x 3 columns]
>>> DataFrame(pop, index = [2001, 2002, 2003])                                                  #'指定索引顺序，以及使用切片初始化数据。'
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN
[3 rows x 2 columns]
>>> pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]}
>>> DataFrame(pdata)
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
[2 rows x 2 columns]
>>> frame3.index.name = 'year'#'指定索引和列的名称'
>>> frame3.columns.name = 'state'
>>> frame3
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6
[3 rows x 2 columns]
>>> frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

基本功能

重新索引
创建一个适应新索引的新对象，该Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在，就引入缺失值。对于时间序列这样的有序数据，重新索引时可能需要做一些插值处理。 method选项即可达到此目的。

>>> obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])#'重新指定索引及顺序'
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])
>>> obj2
a   -5.3
b    7.2
d    4.5
c    3.6
e    NaN
dtype: float64
>>> obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0)  # '指定不存在元素的默认值'
a   -5.3
b    7.2
d    4.5
c    3.6
e    0.0
dtype: float64
>>> obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])#'重新指定索引并指定填元素充方法'
>>> obj3
0      blue
2    purple
4    yellow
dtype: object
>>> obj3.reindex(range(6), method = 'ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
>>> frame = DataFrame(np.arange(9).reshape(3, 3),index = ['a', 'c', 'd'],columns = ['Ohio', 'Texas', 'California'])#'对DataFrame重新指定索引'
>>> frame
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
[3 rows x 3 columns]
>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
   Ohio  Texas  California
a     0      1           2
b   NaN    NaN         NaN
c     3      4           5
d     6      7           8
[4 rows x 3 columns]
>>> states = ['Texas', 'Utah', 'California']     #'重新指定column'
>>> frame.reindex(columns = states)
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8
[3 rows x 3 columns]
>>> frame.reindex(index = ['a', 'b', 'c', 'd'],method = 'ffill',columns = states)#'对DataFrame重新指定索引并指定填元素充方法'
   Texas  Utah  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8
[4 rows x 3 columns]
>>> frame.ix[['a', 'b', 'd', 'c'], states] #'frame没有改变'
   Texas  Utah  California
a      1   NaN           2
b    NaN   NaN         NaN
d      7   NaN           8
c      4   NaN           5
[4 rows x 3 columns]

丢弃指定轴上的项
丢弃某条轴上的一个或多个项很简单【原来的DataFrame和Series没有改变】，只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个在指定轴上删除了指定值的新对象

>>> obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])#'Series根据索引删除元素'
>>> new_obj = obj.drop('c')
>>> new_obj
a    0
b    1
d    3
e    4
dtype: float64
>>> obj.drop(['d', 'c'])
a    0
b    1
e    4
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])#'DataFrame删除元素，可指定索引或列。'
>>> data.drop(['Colorado', 'Ohio'])
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
[2 rows x 4 columns]
>>> data.drop('two', axis = 1)
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
[4 rows x 3 columns]
>>> data.drop(['two', 'four'], axis = 1)
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14
[4 rows x 2 columns]

索引、选取和过滤
1,Series索引（obj[…]）的工作方式类似于NumPy数组的索引，只不过Series的
2,索引值不只是整数。
3,利用标签的切片运算与普通的Python切片运算不同，其末端是包含的（inclusive）。
4,对DataFrame进行索引其实就是获取一个或多个列
5,为了在DataFrame的行上进行标签索引，引入了专门的索引字段ix。

>>> obj = Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])#'Series的索引，默认数字索引可以工作。'
>>> obj['b']
1.0
>>> obj[3]
3.0
>>> obj[[1, 3]]
b    1
d    3
dtype: float64
>>> obj[obj < 2]
a    0
b    1
dtype: float64
>>> obj['b':'c']  #'Series的数组切片'->闭区间
b    1
c    2
dtype: float64
>>> obj['b':'c'] = 5
>>> obj
a    0
b    5
c    5
d    3
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])#'DataFrame的索引'
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[4 rows x 4 columns]
>>> data['two'] # 打印列
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
>>> data[['three', 'one']]
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12
[4 rows x 2 columns]
>>> data[:2]
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
[2 rows x 4 columns]
>>> data.ix['Colorado', ['two', 'three']] # 指定索引和列
two      5
three    6
Name: Colorado, dtype: int32
>>> data.ix[['Colorado', 'Utah'], [3, 0, 1]]
          four  one  two
Colorado     7    4    5
Utah        11    8    9
[2 rows x 3 columns]
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[4 rows x 4 columns]
>>> data.ix[2]  # 打印第2行（从0开始）
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
>>> data.ix[:'Utah', 'two'] # 从开始到Utah，第2列。
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32
>>> data[data.three > 5]#'根据条件选择'
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[3 rows x 4 columns]
>>> data < 5  # 打印True或者False
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False
[4 rows x 4 columns]
>>> data[data < 5] = 0
>>> data
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
[4 rows x 4 columns]