基本介绍
- 具备按轴自动或显式数据对齐功能的数据结构。
- 集成时间序列功能,既能处理时间序列数据也能处理非时间序列数据的数据结构。
- 数学运算和约简(比如对某个轴求和)可以根据不同的元数据(轴编号)执行,灵活处理缺失数据
- 合并及其他出现在常见数据库(例如基于SQL的)中的关系型运算
使用
import numpy as np
from numpy import nan as NA
from pandas import Series, DataFrame
from pandas import Series, DataFrame, MultiIndex
数据结构 Series
Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型) 以及一组与之相关的数据标签(即索引)组成。
Series的字符串表现形式为:索引在左边,值在右边。
>>> from pandas import Series
>>> obj = Series([4, 7, -5, 3]) #'用数组生成Series'
>>> obj
0 4
1 7
2 -5
3 3
dtype: int64
>>> obj.values
array([ 4, 7, -5, 3], dtype=int64)
>>> obj.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c']) #'指定Series的index'
>>> obj2
d 4
b 7
a -5
c 3
dtype: int64
>>> obj2['a']
-5
>>> obj2['d'] = 6
>>> obj2[['c', 'a', 'd']]
c 3
a -5
d 6
dtype: int64
>>> obj2[obj2 > 0] # 找出大于0的元素
d 6
b 7
c 3
dtype: int64
>>> 'b' in obj2 # 判断索引是否存在
True
>>> 'e' in obj2
False
>>> sdata = {'Ohio':45000, 'Texas':71000, 'Oregon':16000, 'Utah':5000} #'使用字典生成Series'
>>> obj3 = Series(sdata)
>>> obj3
Ohio 45000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
>>> states = ['California', 'Ohio', 'Oregon', 'Texas'] # '使用字典生成Series,并额外指定index,不匹配部分为NaN'
>>> obj4 = Series(sdata, index = states)
>>> obj4
California NaN
Ohio 45000
Oregon 16000
Texas 71000
dtype: float64
>>> obj3 + obj4 #'Series相加,相同索引部分相加'
California NaN
Ohio 90000
Oregon 32000
Texas 142000
Utah NaN
dtype: float64
>>> obj4.name = 'population' #'指定Series及其索引的名字'
>>> obj4.index.name = 'state'
>>> obj4
state
California NaN
Ohio 45000
Oregon 16000
Texas 71000
Name: population, dtype: float64
>>> obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] #'替换index'
>>> obj
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
数据结构 DataFrame
DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同 的值类型(数值、字符串、布尔值等)。 DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用 同一个索引)。
>>> data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year':[2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}#'用字典生成DataFrame,key为列的名字。'
>>> DataFrame(data)
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
[5 rows x 3 columns]
>>> DataFrame(data, columns = ['year', 'state', 'pop']) # 指定列顺序
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
[5 rows x 3 columns]
>>> frame2 = DataFrame(data,columns = ['year', 'state', 'pop', 'debt'],index = ['one', 'two', 'three', 'four', 'five'])#'指定索引,在列中指定不存在的列,默认数据用NaN。'
>>> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
[5 rows x 4 columns]
>>> frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
>>> frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
>>> frame2.ix['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
>>> frame2['debt'] = 16.5 # 修改一整列
>>> frame2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
[5 rows x 4 columns]
>>> frame2.debt = np.arange(5) # 用numpy数组修改元素
>>> frame2
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
[5 rows x 4 columns]
>>> val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])#'用Series指定要修改的索引及其对应的值,没有指定的默认数据用NaN。'
>>> frame2['debt'] = val
>>> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
[5 rows x 4 columns]
>>> frame2['eastern'] = (frame2.state == 'Ohio') # '赋值给新列';如果state等于Ohio为True
>>> frame2
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
[5 rows x 5 columns]
>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')
>>> pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}} #'DataFrame转置'
>>> frame3 = DataFrame(pop)
>>> frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
[3 rows x 2 columns]
>>> frame3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
[2 rows x 3 columns]
>>> DataFrame(pop, index = [2001, 2002, 2003]) #'指定索引顺序,以及使用切片初始化数据。'
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
[3 rows x 2 columns]
>>> pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]}
>>> DataFrame(pdata)
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
[2 rows x 2 columns]
>>> frame3.index.name = 'year'#'指定索引和列的名称'
>>> frame3.columns.name = 'state'
>>> frame3
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
[3 rows x 2 columns]
>>> frame3.values
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
基本功能
重新索引
创建一个适应新索引的新对象,该Series的reindex将会根据新索引进行重排。 如果某个索引值当前不存在,就引入缺失值。对于时间序列这样的有序数据,重新索引时可能需要做一些插值处理。 method选项即可达到此目的。
>>> obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])#'重新指定索引及顺序'
>>> obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
>>> obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])
>>> obj2
a -5.3
b 7.2
d 4.5
c 3.6
e NaN
dtype: float64
>>> obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0) # '指定不存在元素的默认值'
a -5.3
b 7.2
d 4.5
c 3.6
e 0.0
dtype: float64
>>> obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])#'重新指定索引并指定填元素充方法'
>>> obj3
0 blue
2 purple
4 yellow
dtype: object
>>> obj3.reindex(range(6), method = 'ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
>>> frame = DataFrame(np.arange(9).reshape(3, 3),index = ['a', 'c', 'd'],columns = ['Ohio', 'Texas', 'California'])#'对DataFrame重新指定索引'
>>> frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
[3 rows x 3 columns]
>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
Ohio Texas California
a 0 1 2
b NaN NaN NaN
c 3 4 5
d 6 7 8
[4 rows x 3 columns]
>>> states = ['Texas', 'Utah', 'California'] #'重新指定column'
>>> frame.reindex(columns = states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
[3 rows x 3 columns]
>>> frame.reindex(index = ['a', 'b', 'c', 'd'],method = 'ffill',columns = states)#'对DataFrame重新指定索引并指定填元素充方法'
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
[4 rows x 3 columns]
>>> frame.ix[['a', 'b', 'd', 'c'], states] #'frame没有改变'
Texas Utah California
a 1 NaN 2
b NaN NaN NaN
d 7 NaN 8
c 4 NaN 5
[4 rows x 3 columns]
丢弃指定轴上的项
丢弃某条轴上的一个或多个项很简单【原来的DataFrame和Series没有改变】,只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是一个在指定轴上删除了指定值的新对象
>>> obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])#'Series根据索引删除元素'
>>> new_obj = obj.drop('c')
>>> new_obj
a 0
b 1
d 3
e 4
dtype: float64
>>> obj.drop(['d', 'c'])
a 0
b 1
e 4
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])#'DataFrame删除元素,可指定索引或列。'
>>> data.drop(['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15
[2 rows x 4 columns]
>>> data.drop('two', axis = 1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
[4 rows x 3 columns]
>>> data.drop(['two', 'four'], axis = 1)
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
[4 rows x 2 columns]
索引、选取和过滤
1,Series索引(obj[…])的工作方式类似于NumPy数组的索引,只不过Series的
2,索引值不只是整数。
3,利用标签的切片运算与普通的Python切片运算不同,其末端是包含的(inclusive)。
4,对DataFrame进行索引其实就是获取一个或多个列
5,为了在DataFrame的行上进行标签索引,引入了专门的索引字段ix。
>>> obj = Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])#'Series的索引,默认数字索引可以工作。'
>>> obj['b']
1.0
>>> obj[3]
3.0
>>> obj[[1, 3]]
b 1
d 3
dtype: float64
>>> obj[obj < 2]
a 0
b 1
dtype: float64
>>> obj['b':'c'] #'Series的数组切片'->闭区间
b 1
c 2
dtype: float64
>>> obj['b':'c'] = 5
>>> obj
a 0
b 5
c 5
d 3
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])#'DataFrame的索引'
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[4 rows x 4 columns]
>>> data['two'] # 打印列
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
>>> data[['three', 'one']]
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
[4 rows x 2 columns]
>>> data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
[2 rows x 4 columns]
>>> data.ix['Colorado', ['two', 'three']] # 指定索引和列
two 5
three 6
Name: Colorado, dtype: int32
>>> data.ix[['Colorado', 'Utah'], [3, 0, 1]]
four one two
Colorado 7 4 5
Utah 11 8 9
[2 rows x 3 columns]
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[4 rows x 4 columns]
>>> data.ix[2] # 打印第2行(从0开始)
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
>>> data.ix[:'Utah', 'two'] # 从开始到Utah,第2列。
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int32
>>> data[data.three > 5]#'根据条件选择'
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[3 rows x 4 columns]
>>> data < 5 # 打印True或者False
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
[4 rows x 4 columns]
>>> data[data < 5] = 0
>>> data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[4 rows x 4 columns]