pandas的两个主要数据结构:Series和DataFrame
Series是一种类似于一维数组的对象,由一组数据和一组数据标签(即索引)组成。
Series对象创建方法:
①直接传入值列表,创建默认0到N-1(N为数组的长度)整数索引Series对象
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: pd.Series([1,3,5,np.nan,6,8])
Out[3]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
②创建指定数据标签的Series对象In [8]: pd.Series([1,3,5,np.nan,6,8],index=['A','B','C','D','E','F'])
Out[8]:
A 1.0
B 3.0
C 5.0
D NaN
E 6.0
F 8.0
dtype: float64
③利用字典创建Series对象,字典的键当作Series对象的索引(有序排列)
In [13]: data = {'A':2,'B':4,'D':8,'C':16}
In [14]: s2 = pd.Series(data)
In [15]: s2
Out[15]:
A 2
B 4
C 16
D 8
dtype: int64
④字典键根据指定索引匹配创建Series对象,匹配上的取字典values,否则取NanIn [16]: idx = ['A','B','D','G']
In [17]: s3 = pd.Series(data,index=idx)
In [18]: s3
Out[18]:
A 2.0
B 4.0
D 8.0
G NaN
dtype: float64
Series对象属性:
①shape属性获取数组的形状
In [9]: s = pd.Series([1,3,5,np.nan,6,8])
In [10]: s.shape
Out[10]: (6,)
②dtype属性获取数组的数据类型In [12]: s = pd.Series([1,3,5,np.nan,6,8])
In [13]: s.dtype
Out[13]: dtype('float64')
③values属性获取数组的内容In [4]: s = pd.Series([1,3,5,np.nan,6,8])
In [5]: s.values
Out[5]: array([ 1., 3., 5., nan, 6., 8.])
④index属性获取数组的索引In [6]: s = pd.Series([1,3,5,np.nan,6,8])
In [7]: s.index
Out[7]: RangeIndex(start=0, stop=6, step=1)
⑤Series对象本身及索引的name属性
In [23]: s3.name='data'
In [24]: s3.index.name='idx'
In [25]: s3
Out[25]:
idx
A 2.0
B 4.0
D 8.0
G NaN
Name: data, dtype: float64
Series缺失数据处理
①检查缺失数据
In [19]: pd.isnull(s3)
Out[19]:
A False
B False
D False
G True
dtype: bool
In [20]: pd.notnull(s3)
Out[20]:
A True
B True
D True
G False
dtype: bool
In [21]: s3.isnull
Out[21]:
<bound method NDFrame.isnull of A 2.0
B 4.0
D 8.0
G NaN
dtype: float64>
In [22]: s3.notnull
Out[22]:
<bound method NDFrame.notnull of A 2.0
B 4.0
D 8.0
G NaN
dtype: float64>
②删除缺失数据
In [29]: s3.dropna()#返回删除Nan后数据,原s3不受影响
Out[29]:
A 2.0
B 4.0
C 8.0
Name: data, dtype: float64
In [30]: s3
Out[30]:
A 2.0
B 4.0
C 8.0
D NaN
Name: data, dtype: float64
In [31]: s3[s3.notnull()]#通过布尔索引获取不为Nan数据
Out[31]:
A 2.0
B 4.0
C 8.0
Name: data, dtype: float64
③填充缺失数据
In [32]: s3.fillna(0)#用0填充缺失数据
Out[32]:
A 2.0
B 4.0
C 8.0
D 0.0
Name: data, dtype: float64
In [40]: s3.fillna(s3.mean())#用均值填充缺失数据
Out[40]:
A 6.0
B 4.0
C 8.0
D 6.0
Name: data, dtype: float64
In [33]: s3.fillna(method='ffill')#向前填充
Out[33]:
A 2.0
B 4.0
C 8.0
D 8.0
Name: data, dtype: float64
In [36]: s3.A=None
In [37]: s3
Out[37]:
A NaN
B 4.0
C 8.0
D NaN
Name: data, dtype: float64
In [39]: s3.fillna(method='bfill')#向后填充
Out[39]:
A 4.0
B 4.0
C 8.0
D NaN
Name: data, dtype: float64
In [41]: s3.fillna(method ='bfill',limit=1)#向后填充,同时控制填充的次数
Out[41]:
A 4.0
B 4.0
C 8.0
D NaN
Name: data, dtype: float64
Series数据选择:首先新建Series对象s:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: s = pd.Series([1,3,5,np.nan,6,8],index=['A','B','C','D','E','F'])
In [4]: s
Out[4]:
A 1.0
B 3.0
C 5.0
D NaN
E 6.0
F 8.0
dtype: float64
①下标位置整数索引
In [5]: s[1]#获取单值
Out[5]: 3.0
In [6]: s[[1,2,3]]#获取多值
Out[6]:
B 3.0
C 5.0
D NaN
dtype: float64
②数据标签索引
In [7]: s['B']
Out[7]: 3.0
In [8]: s[['B','C','D']]
Out[8]:
B 3.0
C 5.0
D NaN
dtype: float64
get(self, key, default=None)
get_value(self, label, takeable=False)
get_values(self)
In [11]: s.get(1)#获取单值
Out[11]: 3.0
In [12]: s.get('B')
Out[12]: 3.0
In [13]: s.get([1,3])#获取多值
Out[13]:
B 3.0
D NaN
dtype: float64
In [14]: s.get(['B','D'])
Out[14]:
B 3.0
D NaN
dtype: float64
In [15]: s.get_value(1)#只能获取单值
Out[15]: 3.0
In [16]: s.get_value('B')
Out[16]: 3.0
In [17]: s.get_values()#获取所有值
Out[17]: array([ 1., 3., 5., nan, 6., 8.])
In [18]: s.get('G')#索引不存在,默认返回None,不报错
④切片
In [3]: s = pd.Series([1,3,5,np.nan,6,8],index=['A','B','C','D','E','F'])
In [4]: s[:3]#获取下标为3之前的所有数据,不包含下标为3索引数据
Out[4]:
A 1.0
B 3.0
C 5.0
dtype: float64
In [5]: s[3:]#获取下标为3之后的所有数据,包含下标为3索引数据
Out[5]:
D NaN
E 6.0
F 8.0
dtype: float64
In [6]: s[:-1]#获取所有数据
Out[6]:
A 1.0
B 3.0
C 5.0
D NaN
E 6.0
dtype: float64
In [7]: s[-1:]#获取最后一个数据
Out[7]:
F 8.0
dtype: float64
In [8]: s[1:5:2]#获取下标为1、3索引数据
Out[8]:
B 3.0
D NaN
dtype: float64
In [10]: s[-1::-1]#逆序输出
Out[10]:
F 8.0
E 6.0
D NaN
C 5.0
B 3.0
A 1.0
dtype: float64
①直接赋值操作
In [1]: import pandas as pd
...: import numpy as np
...: s = pd.Series([1,3,5,np.nan,6,8],index=['A','B','C','D','E','F'])
...: s.D=4
...: s
...:
Out[1]:
A 1.0
B 3.0
C 5.0
D 4.0
E 6.0
F 8.0
dtype: float64
②布尔运算
In [2]: s[s>3]
Out[2]:
C 5.0
D 4.0
E 6.0
F 8.0
dtype: float64
In [3]: s*2
Out[3]:
A 2.0
B 6.0
C 10.0
D 8.0
E 12.0
F 16.0
dtype: float64
In [4]: s1 = pd.Series([4,5,6,7],index=['B','C','D','G'])
In [6]: s1
Out[6]:
B 4
C 5
D 6
G 7
dtype: int64
In [7]: s+s1
Out[7]:
A NaN
B 7.0
C 10.0
D 10.0
E NaN
F NaN
G NaN
dtype: float64
⑤成员运算
In [8]: 'B' in s
Out[8]: True
In [9]: 'G' not in s
Out[9]: True
⑥Series对象索引的赋值就地修改
In [26]: s3.index=['A','B','C','D']
In [27]: s3
Out[27]:
A 2.0
B 4.0
C 8.0
D NaN
Name: data, dtype: float64