Jupyter Notebok查看效果更佳!
丢弃指定轴上的项
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(5.),index = ['a', 'b','c', 'd', 'e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['d','e'])
a 0.0
b 1.0
c 2.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index = ['Ohio', 'Colorado', 'Utah', 'New York'],
columns = ['one', 'two', 'three', 'four'])
data
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data.drop(['Colorado','Ohio'])
| one | two | three | four |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
- 通过传递axis=1或者axis='columns’可以删除列的值:
data.drop('two',axis=1)
| one | three | four |
---|
Ohio | 0 | 2 | 3 |
---|
Colorado | 4 | 6 | 7 |
---|
Utah | 8 | 10 | 11 |
---|
New York | 12 | 14 | 15 |
---|
- 以上操作会返回一个新的对象,原数据结构并不变化,但是可以通过传入inplace参数就地修改对象
data.drop('one',axis=1,inplace=True)
data
| two | three | four |
---|
Ohio | 1 | 2 | 3 |
---|
Colorado | 5 | 6 | 7 |
---|
Utah | 9 | 10 | 11 |
---|
New York | 13 | 14 | 15 |
---|
索引、选取、过滤
- Series索引和numpy数组的索引差不多,只不过Series的索引值不一定是整数
obj = pd.Series(np.arange(4), index = ['a','b','c','d'])
obj
a 0
b 1
c 2
d 3
dtype: int32
obj['b']
1
obj[2:4]
c 2
d 3
dtype: int32
obj[['a','c','d']]
a 0
c 2
d 3
dtype: int32
- 利用标签的切片运算和普通的python不同,末端是包含的
obj['a':'c']
a 0
b 1
c 2
dtype: int32
obj['a':'c'] = 5
obj
a 5
b 5
c 5
d 3
dtype: int32
DataFrame
- 利用一个值或者序列对DataFrame进行索引其实就是获取一个或者多个列
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three','two']]
| three | two |
---|
Ohio | 2 | 1 |
---|
Colorado | 6 | 5 |
---|
Utah | 10 | 9 |
---|
New York | 14 | 13 |
---|
data[:2]
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
data[data['three']>5]
| one | two | three | four |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data<5
| one | two | three | four |
---|
Ohio | True | True | True | True |
---|
Colorado | True | False | False | False |
---|
Utah | False | False | False | False |
---|
New York | False | False | False | False |
---|
data[data<5]
| one | two | three | four |
---|
Ohio | 0.0 | 1.0 | 2.0 | 3.0 |
---|
Colorado | 4.0 | NaN | NaN | NaN |
---|
Utah | NaN | NaN | NaN | NaN |
---|
New York | NaN | NaN | NaN | NaN |
---|
data[data<6]=0
data
| one | two | three | four |
---|
Ohio | 0 | 0 | 0 | 0 |
---|
Colorado | 0 | 0 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
使用loc和iloc进行选取
- 对于DataFrame的行的标签索引,我引入了特殊的标签运算符loc和iloc。它们可以让你用类似NumPy的标记,使用轴标签(loc)或整数索引(iloc),从DataFrame选择行和列的子集。
data.loc['Colorado',['two','three']]
two 0
three 6
Name: Colorado, dtype: int32
data.iloc[2,[3,0,1]]
four 11
one 8
two 9
Name: Utah, dtype: int32
data.loc[:'Utah', 'two']
Ohio 0
Colorado 0
Utah 9
Name: two, dtype: int32
data.iloc[:,:3][data.three>5]
| one | two | three |
---|
Colorado | 0 | 0 | 6 |
---|
Utah | 8 | 9 | 10 |
---|
New York | 12 | 13 | 14 |
---|

整数索引
ser = pd.Series(np.arange(3.0),index=['a', 'b', 'c'])
ser
a 0.0
b 1.0
c 2.0
dtype: float64
ser[-1]
2.0
- 为了进行统一,如果轴索引含有整数,数据总会使用标签。为了更加准确,推荐使用loc(标签索引)和iloc(整数索引)
ser2 = pd.Series(np.arange(3.0))
ser2
0 0.0
1 1.0
2 2.0
dtype: float64
ser2.iloc[:1]
0 0.0
dtype: float64
ser2.loc[:1]
0 0.0
1 1.0
dtype: float64
ser2[:1]
0 0.0
dtype: float64
数据运算和数据对齐
- pandas的一个重要的功能是可以对不同的索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集
s1 = pd.Series([7.3,-2.5,3.4,1.5], index=['a','b','c','d'])
s2 = pd.Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])
s1
a 7.3
b -2.5
c 3.4
d 1.5
dtype: float64
s2
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
s1 + s2
a 5.2
b NaN
c 7.0
d NaN
e NaN
f NaN
g NaN
dtype: float64
- 对于DataFrame 对齐操作同时作用于行和列上
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
| b | c | d |
---|
Ohio | 0.0 | 1.0 | 2.0 |
---|
Texas | 3.0 | 4.0 | 5.0 |
---|
Colorado | 6.0 | 7.0 | 8.0 |
---|
df2
| b | d | e |
---|
Utah | 0.0 | 1.0 | 2.0 |
---|
Ohio | 3.0 | 4.0 | 5.0 |
---|
Texas | 6.0 | 7.0 | 8.0 |
---|
Oregon | 9.0 | 10.0 | 11.0 |
---|
df1 + df2
| b | c | d | e |
---|
Colorado | NaN | NaN | NaN | NaN |
---|
Ohio | 3.0 | NaN | 6.0 | NaN |
---|
Oregon | NaN | NaN | NaN | NaN |
---|
Texas | 9.0 | NaN | 12.0 | NaN |
---|
Utah | NaN | NaN | NaN | NaN |
---|
df1 = pd.DataFrame({'A':[1,2]})
df2 = pd.DataFrame({'B':[3,4]})
df1
df2
df1+df2
在算术方法中填充值
- 在对具有不同索引的对象进行算术运算时,希望对不同是具有的轴标签赋值(比如:0)
df1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))
df2.loc[1,'b'] = np.nan
df1
df2
| a | b | c | d | e |
---|
0 | 0 | 1.0 | 2 | 3 | 4 |
---|
1 | 5 | NaN | 7 | 8 | 9 |
---|
2 | 10 | 11.0 | 12 | 13 | 14 |
---|
3 | 15 | 16.0 | 17 | 18 | 19 |
---|
df1+df2
| a | b | c | d | e |
---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
---|
1 | 9.0 | NaN | 13.0 | 15.0 | NaN |
---|
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
---|
3 | NaN | NaN | NaN | NaN | NaN |
---|
df1.add(df2, fill_value=0)
| a | b | c | d | e |
---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
---|
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
---|
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
---|
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
---|
1/df1
| a | b | c | d |
---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
---|
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
---|
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
---|
df1.rdiv(1)
| a | b | c | d |
---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
---|
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
---|
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
---|
df1.reindex(columns=df2.columns,fill_value=0)
| a | b | c | d | e |
---|
0 | 0 | 1 | 2 | 3 | 0 |
---|
1 | 4 | 5 | 6 | 7 | 0 |
---|
2 | 8 | 9 | 10 | 11 | 0 |
---|
DataFrame和Series之间的运算
arr = np.arange(12.).reshape((3,4))
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
array([0., 1., 2., 3.])
np.array(arr[0])
array([0., 1., 2., 3.])
arr - arr[0]
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
- 上面的启发性例子,当两个维数不一样的数组相减时,每一行都会减,这叫做广播。DataFrame和Series之间的运算差不多
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
columns=list('bde'),
index=['Utah','Ohio','Texas','Oregon'])
series = frame.iloc[0]
frame
| b | d | e |
---|
Utah | 0.0 | 1.0 | 2.0 |
---|
Ohio | 3.0 | 4.0 | 5.0 |
---|
Texas | 6.0 | 7.0 | 8.0 |
---|
Oregon | 9.0 | 10.0 | 11.0 |
---|
series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
| b | d | e |
---|
Utah | 0.0 | 0.0 | 0.0 |
---|
Ohio | 3.0 | 3.0 | 3.0 |
---|
Texas | 6.0 | 6.0 | 6.0 |
---|
Oregon | 9.0 | 9.0 | 9.0 |
---|
- 如果某个索引值在DataFrame的列或者Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:
series2 = pd.Series(range(3),index=['b','e','f'])
series2
b 0
e 1
f 2
dtype: int64
frame + series2
| b | d | e | f |
---|
Utah | 0.0 | NaN | 3.0 | NaN |
---|
Ohio | 3.0 | NaN | 6.0 | NaN |
---|
Texas | 6.0 | NaN | 9.0 | NaN |
---|
Oregon | 9.0 | NaN | 12.0 | NaN |
---|
- 如果希望匹配行且在列上进行广播,则必须使用算术运算方法
series3 = frame['d']
series3
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
frame.sub(series3,axis='index')
| b | d | e |
---|
Utah | -1.0 | 0.0 | 1.0 |
---|
Ohio | -1.0 | 0.0 | 1.0 |
---|
Texas | -1.0 | 0.0 | 1.0 |
---|
Oregon | -1.0 | 0.0 | 1.0 |
---|