import pandas as pd
import numpy as np
Series的创建:
1)Series是通用的NumPy数组:
例: data = pd.Series([0.25,0.5,075,1.0, index = ["a","b","c","d"])
2)Series是特殊的字典:
例: data_dict = {"population":123,
"text":456,
"flower":789}
data = pd.Series(data_dict)
data
Pandas提供的索引器(pandas提供的索引器(loc,iloc,ix(在0.20.00以后的pandas中ix使用较少))):
f = pd.Series([1,2,3],index = ["a","b","c"])
print(f)
print("*"*20)
print(f.loc["c"]) #显式索引器
print("*"*20)
print(f.iloc[2]) #隐式索引器
a 1
b 2
c 3
dtype: int64
********************
3
********************
3
Series和DataFrame对以上的索引器都适用。
Series的切片:
将隐式索引作为切片:
data = pd.Series([0.25,0.5,075,1.0, index = ["a","b","c","d"])
data[0:2]
a 0.25
b 0.5
将显式索引作为切片:
data = pd.Series([0.25,0.5,075,1.0, index = ["a","b","c","d"])
data["a":"c"]
a 0.25
b 0.5
c 0.75
注意的是:当使用显式索引作为切片时,结果包含最后的一个索引,当使用隐式索引作为切片时,结果不包含最后一个索引。
索引对齐(Series和DataFrame):
对于缺失的值用NaN值填充;
Series索引对齐:
A = pd.Series([2,4,6],index = [0,1,2])
B = pd.Series([1,3,5],index = [1,2,3])
print(A+B)
A.add(B,fill_value = 0)
DataFrame索引对齐:
A = pd.DataFrame(rng.randint(0.20.(2.2)),columns = list("AB"))
B = pd.DataFrame(rng.randint(0.10.(3.3)),columns = list("BAC"))
print(A + B)
fill = A.stack().mean()
A.add(B,fill_value = fill) #使用A的add方法,传入B以及一个fill_value参数,fill_value表示缺失值的填充值;
DataFrame行和列的查找:
对于查找DataFrame行和列,DataFrame可以看成增强的二维数组,对于查找DataFrame的某一行,可以通过obj.values[i]来查找,对于查找DataFrame的某一列
可以通过obj["j"]或者obj.j来查找;
例: frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = ["ohil","colorado","utah","new york"],
columns = ["one","two","three"])
series = frame.values[0]
print(series)
print(type(series))
series1 = frame["two"]
print(series1)
print(type(series1))
[0 1 2]
<class 'numpy.ndarray'>
ohil 1
colorado 4
utah 7
new york 10
Name: two, dtype: int32
<class 'pandas.core.series.Series'>
DataFrame与Series的运算:
DataFrame与Sreies的运算:Series要进行广播,达到与DataFrame同样的结构可以进行运算;(问题:axis值的选择)
例1: frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = ["ohil","colorado","utah","new york"],
columns = ["one","two","three"])
series = frame.values[0]
print(frame)
print(series)
print(frame.sub(series,axis = 1))
one two three
ohil 0 1 2
colorado 3 4 5
utah 6 7 8
new york 9 10 11
[0 1 2]
one two three
ohil 0 0 0
colorado 3 3 3
utah 6 6 6
new york 9 9 9
例2: A = np.random.randint(10,size = (3,4))
print(A)
df = pd.DataFrame(A,columns = list("QRST"))
print(df["R"])
print(df.sub(df["R"],axis = 0))
print(df.sub(df["R"],axis = 1))
[[7 3 8 4]
[2 0 0 9]
[4 8 4 4]]
0 3
1 0
2 8
Name: R, dtype: int32
Q R S T
0 4 0 5 1
1 2 0 0 9
2 -4 0 -4 -4
Q R S T 0 1 2
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
缺失值的处理(Series和DataFrame):
1)发现缺失值
Pandas数据结构有两种有效的方法去发现缺失值:isnull和notnull
2)剔除缺失值
通过dropna方法去剔除缺失值
例: data = pd.Series([1,np.nan,"hello",None])
data.dropna
o 1
1 hello
dtype: object
注意:我们没有办法从DataFrame中单独剔除一个值,要么剔除整行或者整列;根据实际要求,DataFrame中的dropna()会配置一些参数进行选择;
例: df = pd.DataFrame([1, np.nan, 2],
[2, 3 , 5],
[np.nan, 4, 6])
print(df.dropna())
0, 1, 2
1 2, 3, 5
print(df.dropna(axis = "columns"))#或者是:axis = 1
2
0 2
1 5
2 6
注意:对于以上剔除DataFrame中的缺失值过于暴力,可能会剔除非缺失值的值;
对于解决这个问题,可以通过设置how或thresh参数来满足,可以设置剔除行或者列缺失值的数量阈值。
默认设置是how = "any",即,只要是缺失值就剔除整行或整列(通过axis来设置坐标轴),(参看上例)
how = "all",即,剔除全部是缺失值的行或者列(参看下例);
例: df[3] = np.nan
df
0 1 2 3
0 1 NaN 2 NaN
1 2 3 5 NaN
2 NaN 4 6 NaN
df.dropna(axis = "columns",how = "all")
0 1 2
0 1 NaN 2
1 2 3 5
2 NaN 4 6
3)缺失值的填充
Series与DataFrame缺失值的填充操作方法类似,只是DataFrame填充时需要设置坐标轴参数axis;
通过fillna()方法填充缺失值;对于fillna()方法,可以填充一个数值,也可以method来实现向前填充(method = "ffill")和向后填充(method = "bfill");
Series缺失值填充
例: data = pd.Series([1,np.nan,2,None,3],index = list("abcde"))
data
a 1
b NaN
c 2
d NaN
e 3
dtype: float64
data.fillna(0)
a 1
b 0
c 2
d 0
e 3
dtype: float64
data.fillna(method = "ffill")
a 1
b 1
c 2
d 2
e 3
dtype: float64
data.fillna(method = "bfill")
a 1
b 2
c 2
d 3
e 3
dtype: float64
DataFrame缺失值填充:
例: df = pd.DataFrame([1, np.nan, 2, np.nan],
[2, 3 , 5, np.nan],
[np.nan, 4, 6, np.nan])
df
0 1 2 3
0 1 NaN 2 NaN
1 2 3 5 NaN
2 NaN 4 6 NaN
df.fillna(method = "ffill",axis = 1)
0 1 2 3
0 1 1 2 2
1 2 3 5 5
2 NaN 4 6 6
具体pandas中axis的问题参看:https://www.jianshu.com/p/bf60078103f2;
重新索引:
重新索引就是通过调用Series的reindex,根据新索引进行重新排序。如果当新索引与原索引不一致时,就会用NaN值进行填充;
例1: Series重新索引:
obj = pd.Series([4.5,7.2,-5.3,3.6],index = ["d","b","a","c"])
print(obj)
obj2 = obj.reindex(["a","b","c","d","e"])
print(obj2)
obj3 = pd.Series(["blue","purple","yellow"],index = [0,2,4])
obj4 = obj3.reindex(range(6),method = "ffill")
print(obj4)
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
例2: DataFrame重新索引:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ["a","c","d"],columns = ["ohil","texas","california"])
#对index进行重新索引
print(frame)
print(frame.reindex(["a","b","c","d"]))
#对columns进行重新索引
states = ["texas","see","california"]
print(frame.reindex(columns = states))
#对index和columns同时进行重新索引
frame1 = frame.reindex(index = ["a","b","c","d"], columns = states)
print(frame1)
frame1.loc["b"] = 0
print(frame1)
ohil texas california
a 0 1 2
c 3 4 5
d 6 7 8
ohil texas california
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
texas see california
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
texas see california
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
texas see california
a 1.0 NaN 2.0
b 0.0 0.0 0.0
c 4.0 NaN 5.0
d 7.0 NaN 8.0
疑惑:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ["a","c","d"],columns = ["ohil","texas","california"])
frame1 = frame.reindex(index = ["a","b","c","d"],method = "ffill", columns = states)
print(frame1)
ValueError: index must be monotonic increasing or decreasing(索引必须是单调递增或递减)(出现值的错误,与例题答案不同)
丢弃指定轴上的值:
例: data = pd.DataFrame(np.arange(16).reshape((4,4)),
index = ["ohil","colorado","utah","new york"],
columns = ["one","two","three","four"])
print(data)
print(data.drop("two",axis = 1))
print(data.drop("ohil")) #默认axis = 0,可以不用指定
one two three four
ohil 0 1 2 3
colorado 4 5 6 7
utah 8 9 10 11
new york 12 13 14 15
one three four
ohil 0 2 3
colorado 4 6 7
utah 8 10 11
new york 12 14 15
one two three four
colorado 4 5 6 7
utah 8 9 10 11
new york 12 13 14 15
排名和排序:
排序:
要对行和列索引进行排序(按字典顺序),可使用sort_index方法,将返回一个已排序的新对象,对于索引降序排序,可以通过设置sort_index = (ascending = False)
对于Series的值进行排序,对DataFrame进行一个或者多个列的值进行排序时,通过sort_values()进行排序,通过by来设置某一列或者多列;
例: #对索引的排序
obj = pd.Series(range(4),index = ["d","b","c","a"])
print(obj.sort_index())
print(obj.sort_index(ascending = False))
frame = pd.DataFrame(np.arange(8).reshape((2,4)),index = ["one","three"],
columns = ["d","a","b","c"])
print(frame.sort_index(axis = 1))
#对值的排序
print(frame.sort_values(by = "c",ascending = False))
obj1 = pd.Series([4,7,-2,2])
print(obj1.sort_values())
a 3
b 1
c 2
d 0
dtype: int64
d 0
c 2
b 1
a 3
dtype: int64
a b c d
one 1 2 3 0
three 5 6 7 4
d a b c
three 4 5 6 7
one 0 1 2 3
2 -2
3 2
0 4
1 7
dtype: int64
排名:
排名时用于破坏评级关系的method选项:
method 说明
"average” 默认:在相等分组中,为各个值分配平均排名
"min" 使用整个分组的最小排名
"max" 使用整个分组的最大排名
"first" 按值在原始数据中的出现顺序分配排名
例: obj = pd.Series([7,-5,4,2,0,4])
print(obj.rank())
print(obj.rank(method = "first"))
#降序
print(obj.rank(ascending = False,method = "first"))
frame = pd.DataFrame({"b":[4.3,7,-3,2],"a":[0,1,0,1],
"c":[-2,5,8,-2.5]})
print(frame)
print(frame.rank(axis = 0))
0 6.0
1 1.0
2 4.5
3 3.0
4 2.0
5 4.5
dtype: float64
0 6.0
1 1.0
2 4.0
3 3.0
4 2.0
5 5.0
dtype: float64
0 1.0
1 6.0
2 2.0
3 4.0
4 5.0
5 3.0
dtype: float64
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
a b c
0 1.5 3.0 2.0
1 3.5 4.0 3.0
2 1.5 1.0 4.0
3 3.5 2.0 1.0
层级索引:
使用unstack(),stack()方法Series与DataFrame之间的转换
例: index = pd.MultiIndex.from_product([["a","b"],[1,2]],names = ["sex","age"])
data = np.arange(4)
data1 = pd.Series(data,index = index)
print(data1)
data2 = data1.unstack()
print(data2)
data3 = data2.stack()
print(data3)
sex age
a 1 0
2 1
b 1 2
2 3
dtype: int32
age 1 2
sex
a 0 1
b 2 3
sex age
a 1 0
2 1
b 1 2
2 3
dtype: int32
显式地创建多级索引:
常用:笛卡尔积创建MultiIndex
pd.MultiIndex.from_product([["a","b"],[1,2]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
利用笛卡尔积创建一个Series结构的数据
例: index = pd.MultiIndex.from_product([["a","b"],[1,2]],names = ["sex","age"])
data = np.arange(4)
data1 = pd.Series(data,index = index)
print(data1)
sex age
a 1 0
2 1
b 1 2
2 3
dtype: int32
利用笛卡尔积创建一个DataFrame结构的数据
例: index = pd.MultiIndex.from_product([["a","b"],[1,2]], names = ["sex","age"])
columns = ["see","bob","lol"]
data = np.arange(12).reshape((4,3))
data1 = pd.DataFrame(data,index = index, columns = columns)
print(data1)
see bob lol
sex age
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
多级索引的取值和切片:
Series多级索引:
例: index = pd.MultiIndex.from_product([["a","b"],[1,2]],names = ["single","age"])
data = np.arange(4)
data1 = pd.Series(data,index = index)
print(data1)
print("*"*10)
print(data1["a",1])
print("*"*10)
print(data1["a"])
print("*"*10)
print(data1[1])
print(id(data1[1]))
#对于局部切片,要求MultiIndex是按顺序排列的
print(data1.loc[("a",2)])
print(id(data1.loc[("a",2)]))
print("*"*10)
print(data1.loc["a",2])
print("*"*10)
print(id(data1.loc["a",2]))
single age
a 1 0
2 1
b 1 2
2 3
dtype: int32
**********
0
**********
age
1 0
2 1
dtype: int32
**********
1
2421847444384
1
2421847444288
**********
1
**********
2421847444288
DataFrame多级索引:
例: index = pd.MultiIndex.from_product([[2013,2014],[1,2]],names = ["year","visit"])
columns = pd.MultiIndex.from_product([["bob","guido","sue"],["hr","temp"]],names = ["subject","type"])
data = np.round(np.random.randn(4,6),1)
data1 = pd.DataFrame(data,index = index,columns = columns)
print(data1)
print("*"*20)
#直接使用Series中的索引在DataFrame上,就直接会使用在DataFrame中的列上
print(data1["guido","hr"])
print("*"*20)
#iloc是只取最内层的索引
print(data1.iloc[:2,:2])
print("*"*20)
#使用loc可以在行和列上选择
print(data1.loc[:,("bob","hr")])
print("*"*20)
print(data1.loc[(2013,1),("bob","hr")])
subject bob guido sue
type hr temp hr temp hr temp
year visit
2013 1 1.0 -3.4 -0.6 -2.0 0.5 -0.2
2 0.5 2.4 -0.3 -2.1 -0.0 -0.1
2014 1 -0.2 -0.5 0.5 -1.2 -2.0 1.8
2 -0.1 0.6 -1.2 0.3 -0.3 -1.2
********************
year visit
2013 1 -0.6
2 -0.3
2014 1 0.5
2 -1.2
Name: (guido, hr), dtype: float64
********************
subject bob
type hr temp
year visit
2013 1 1.0 -3.4
2 0.5 2.4
********************
year visit
2013 1 1.0
2 0.5
2014 1 -0.2
2 -0.1
Name: (bob, hr), dtype: float64
********************
1.0
多级索引行列转换:
有序索引和无序索引:
如果MultiIndex不是有序索引,使用切片操作会失败,可以使用sort_indx()方法对索引进行排序;
例: index = pd.MultiIndex.from_product([["a","c","b"],[1,2]],names = ["year","visit"])
data = np.random.rand(6)
data1 = pd.Series(data,index = index)
print(data1)
try:
data1["a":"b"]
except KeyError as e:
print(e)
#使用sort_index()方法
data1 = data1.sort_index()
print(data1["a":"b"])
year visit
a 1 0.316982
2 0.988440
c 1 0.553392
2 0.189816
b 1 0.577293
2 0.894663
dtype: float64
'Key length (1) was greater than MultiIndex lexsort depth (0)'
year visit
a 1 0.316982
2 0.988440
b 1 0.577293
2 0.894663
dtype: float64
索引stack与unstack:
index = pd.MultiIndex.from_product([["a","b","c"],["d","e"],[1,2]],names = ["year","see","visit"])
data = np.random.rand(12)
data1 = pd.Series(data,index = index)
print(data1)
print("*"*20)
print(data1.unstack(level = 1))
print("*"*20)
print(data1.unstack(level = 0))
print("*"*20)
print(data1.unstack(level = 2))
year see visit
a d 1 0.092771
2 0.830678
e 1 0.584222
2 0.391323
b d 1 0.381634
2 0.554762
e 1 0.616809
2 0.226231
c d 1 0.389731
2 0.484459
e 1 0.526906
2 0.405733
dtype: float64
********************
see d e
year visit
a 1 0.092771 0.584222
2 0.830678 0.391323
b 1 0.381634 0.616809
2 0.554762 0.226231
c 1 0.389731 0.526906
2 0.484459 0.405733
********************
year a b c
see visit
d 1 0.092771 0.381634 0.389731
2 0.830678 0.554762 0.484459
e 1 0.584222 0.616809 0.526906
2 0.391323 0.226231 0.405733
********************
visit 1 2
year see
a d 0.092771 0.830678
e 0.584222 0.391323
b d 0.381634 0.554762
e 0.616809 0.226231
c d 0.389731 0.484459
e 0.526906 0.405733
索引的设置与重置:
层级数据维度转换的另一种方法是行列标签的转换,通过reset_index方法实现;
例: index = pd.MultiIndex.from_product([["a","b","c"],["d","e"],[1,2]],names = ["year","see","visit"])
data = np.random.rand(12)
data1 = pd.Series(data,index = index)
print(data1)
print(data1.reset_index(name = "join"))
year see visit
a d 1 0.260028
2 0.431808
e 1 0.651972
2 0.486900
b d 1 0.024664
2 0.466295
e 1 0.325269
2 0.175738
c d 1 0.507806
2 0.772308
e 1 0.913312
2 0.079500
dtype: float64
year see visit join
0 a d 1 0.260028
1 a d 2 0.431808
2 a e 1 0.651972
3 a e 2 0.486900
4 b d 1 0.024664
5 b d 2 0.466295
6 b e 1 0.325269
7 b e 2 0.175738
8 c d 1 0.507806
9 c d 2 0.772308
10 c e 1 0.913312
11 c e 2 0.079500
多级索引的数据累计方法
对于层级索引数据,可以设置参数level实现对数据子集的累计操作;
例: index = pd.MultiIndex.from_product([[2013,2014],[1,2]],names = ["year","visit"])
columns = pd.MultiIndex.from_product([["bob","guido","sue"],["hr","temp"]],names = ["subject","type"])
data = np.round(np.random.randn(4,6),1)
data1 = pd.DataFrame(data,index = index,columns = columns)
print(data1)
print("*"*20)
#level表示累计的参数名称,axis表示沿着哪一维度累计
print(data1.mean(level = 0))
print("*"*20)
print(data1.mean(axis = 1,level = 0))
print("*"*20)
print(data1.mean(axis = 0,level = 1))
print("*"*20)
print(data1.mean(axis = 1,level = 1))
subject bob guido sue
type hr temp hr temp hr temp
year visit
2013 1 -0.3 -2.5 -0.8 1.4 0.7 -0.7
2 -0.9 -0.8 0.2 -0.8 1.1 -2.3
2014 1 0.2 -0.9 0.7 1.4 0.5 0.2
2 -0.0 -0.4 1.6 1.2 -0.5 0.4
********************
subject bob guido sue
type hr temp hr temp hr temp
year
2013 -0.6 -1.65 -0.30 0.3 0.9 -1.5
2014 0.1 -0.65 1.15 1.3 0.0 0.3
********************
subject bob guido sue
year visit
2013 1 -1.40 0.30 0.00
2 -0.85 -0.30 -0.60
2014 1 -0.35 1.05 0.35
2 -0.20 1.40 -0.05
********************
subject bob guido sue
type hr temp hr temp hr temp
visit
1 -0.05 -1.7 -0.05 1.4 0.6 -0.25
2 -0.45 -0.6 0.90 0.2 0.3 -0.95
********************
type hr temp
year visit
2013 1 -0.133333 -0.600000
2 0.133333 -1.300000
2014 1 0.466667 0.233333
2 0.366667 0.400000
合并数据集:Concat与append操作
合并数据集:合并与连接