pandas的基本函数
实验目的
熟练掌握pandas基本函数使用方法
实验原理
列转行方法
stack函数:pandas.DataFrame.stack(self, level=-1, dropna=True),对于普通的DataFrame而言,直接列索引转换到最内层行索引,生一个Series对象。
对于层次化索引的DataFrame而言,可以将指定的索引层转换到行上,默认是将最内层的列索引转换到最内层行。
unstack函数:pandas.DataFrame.unstack(self, level=-1, fill_value=None),对于普通的DataFrame而言,直接将列索引转换到行索引的最外层索引,生成一个Series对象,对于层次化索引的DataFrame而言,和stack函数类似,似乎把两层索引当作一个整体,当level为列表时报错。
melt函数:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value’, col_level=None),id_vars可以理解为结果需要保留的原始列,value_vars可以理解为需要列转行的列名;
var_name把列转行的列变量重新命名,默认为variable;value_name列转行对应变量的值的名称。
行转列方法
unstack函数:pandas.DataFrame.unstack(self, level=-1, fill_value=None)
实验环境
Python 3.6.1
Jupyter
实验内容
练习pandas的主要的基本函数的使用。
代码部分
import pandas as pd
import numpy as np
1.创建一个DataFrame,名为df
df = pd.DataFrame({'A':np.random.randint(1, 100, 4),'B':pd.date_range(start='20130101', periods=4, freq='D'),'C':pd.Series([1, 2, 3, 4],index=['zhang', 'li', 'zhou', 'wang'],dtype='float32'),'D':np.array([3] * 4,dtype='int32'), 'E':pd.Categorical(["test","train","test","train"]),'F':'foo'})
df
| A | B | C | D | E | F |
---|
zhang | 12 | 2013-01-01 | 1.0 | 3 | test | foo |
---|
li | 40 | 2013-01-02 | 2.0 | 3 | train | foo |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
wang | 27 | 2013-01-04 | 4.0 | 3 | train | foo |
---|
2.二维数据查看
df.head()
| A | B | C | D | E | F |
---|
zhang | 12 | 2013-01-01 | 1.0 | 3 | test | foo |
---|
li | 40 | 2013-01-02 | 2.0 | 3 | train | foo |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
wang | 27 | 2013-01-04 | 4.0 | 3 | train | foo |
---|
df.head(3)
| A | B | C | D | E | F |
---|
zhang | 12 | 2013-01-01 | 1.0 | 3 | test | foo |
---|
li | 40 | 2013-01-02 | 2.0 | 3 | train | foo |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
df.tail(2)
| A | B | C | D | E | F |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
wang | 27 | 2013-01-04 | 4.0 | 3 | train | foo |
---|
3.查看二维数据的索引、列名和数据
df.index
Index(['zhang', 'li', 'zhou', 'wang'], dtype='object')
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df.values
array([[12, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
[40, Timestamp('2013-01-02 00:00:00'), 2.0, 3, 'train', 'foo'],
[26, Timestamp('2013-01-03 00:00:00'), 3.0, 3, 'test', 'foo'],
[27, Timestamp('2013-01-04 00:00:00'), 4.0, 3, 'train', 'foo']],
dtype=object)
4.查看数据的统计信息
df.describe()
| A | C | D |
---|
count | 4.000000 | 4.000000 | 4.0 |
---|
mean | 26.250000 | 2.500000 | 3.0 |
---|
std | 11.441882 | 1.290994 | 0.0 |
---|
min | 12.000000 | 1.000000 | 3.0 |
---|
25% | 22.500000 | 1.750000 | 3.0 |
---|
50% | 26.500000 | 2.500000 | 3.0 |
---|
75% | 30.250000 | 3.250000 | 3.0 |
---|
max | 40.000000 | 4.000000 | 3.0 |
---|
5.二维数据转置
df.T
| zhang | li | zhou | wang |
---|
A | 12 | 40 | 26 | 27 |
---|
B | 2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 |
---|
C | 1 | 2 | 3 | 4 |
---|
D | 3 | 3 | 3 | 3 |
---|
E | test | train | test | train |
---|
F | foo | foo | foo | foo |
---|
6.排序
df.sort_index(axis=0, ascending=False)
| A | B | C | D | E | F |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
zhang | 12 | 2013-01-01 | 1.0 | 3 | test | foo |
---|
wang | 27 | 2013-01-04 | 4.0 | 3 | train | foo |
---|
li | 40 | 2013-01-02 | 2.0 | 3 | train | foo |
---|
df.sort_index(axis=0, ascending=True)
| A | B | C | D | E | F |
---|
li | 40 | 2013-01-02 | 2.0 | 3 | train | foo |
---|
wang | 27 | 2013-01-04 | 4.0 | 3 | train | foo |
---|
zhang | 12 | 2013-01-01 | 1.0 | 3 | test | foo |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
df.sort_index(axis=1, ascending=False)
| F | E | D | C | B | A |
---|
zhang | foo | test | 3 | 1.0 | 2013-01-01 | 12 |
---|
li | foo | train | 3 | 2.0 | 2013-01-02 | 40 |
---|
zhou | foo | test | 3 | 3.0 | 2013-01-03 | 26 |
---|
wang | foo | train | 3 | 4.0 | 2013-01-04 | 27 |
---|
df.sort_values(by='A')
| A | B | C | D | E | F |
---|
zhang | 12 | 2013-01-01 | 1.0 | 3 | test | foo |
---|
zhou | 26 | 2013-01-03 | 3.0 | 3 | test | foo |
---|
wang | 27 | 2013-01-04 | 4.0 | 3 | train | foo |
---|
li | 40 | 2013-01-02 | 2.0 | 3 | train | foo |
---|
7.重复值处理
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4, 'k2':[1, 1, 2, 3, 3, 4, 4]})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
1)检测重复行
data.duplicatedlicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
2) 返回新数组,删除重复行
data.drop_duplicates()
3)删除k1列的重复数据,保留首行重复数据。
data.drop_duplicates(['k1'],keep='last')
8.映射
1)使用函数进行映射,将data中k1列的值转换为大写
data['k1']=data['k1'].map(str.upper)
data
| k1 | k2 |
---|
0 | ONE | 1 |
---|
1 | ONE | 1 |
---|
2 | ONE | 2 |
---|
3 | TWO | 3 |
---|
4 | TWO | 3 |
---|
5 | TWO | 4 |
---|
6 | TWO | 4 |
---|
2)使用字典表示映射关系,将data中k1列的值转换为小写。
data['k1']=data['k1'].map({'ONE':'one','TWO':'two'})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
3) 使用lambda表达式表示映射关系,将data中k2列的值加5。
data['k2'] = data['k2'].map(lambda x:x+5)
data
| k1 | k2 |
---|
0 | one | 6 |
---|
1 | one | 6 |
---|
2 | one | 7 |
---|
3 | two | 8 |
---|
4 | two | 8 |
---|
5 | two | 9 |
---|
6 | two | 9 |
---|
4) 使用lambda表达式表示映射关系,将data中索引的值加5。
data.index = data.index.map(lambda x:x+5)
data
| k1 | k2 |
---|
5 | one | 6 |
---|
6 | one | 6 |
---|
7 | one | 7 |
---|
8 | two | 8 |
---|
9 | two | 8 |
---|
10 | two | 9 |
---|
11 | two | 9 |
---|
5) 使用lambda表达式表示映射关系,将data中列名转换为大写
data.columns=data.columns.map(str.upper)
data
| K1 | K2 |
---|
5 | one | 6 |
---|
6 | one | 6 |
---|
7 | one | 7 |
---|
8 | two | 8 |
---|
9 | two | 8 |
---|
10 | two | 9 |
---|
11 | two | 9 |
---|
9.数据离散化
from random import randrange
data=[randrange(100) for _ in range(10)]
category=[0,25,50,100]
pd.cut(data,category)
[(50, 100], (50, 100], (0, 25], (50, 100], (50, 100], (50, 100], (0, 25], (50, 100], (50, 100], (50, 100]]
Categories (3, interval[int64]): [(0, 25] < (25, 50] < (50, 100]]
1) 按category对data数据进行切分,使得参数right=False形成左闭右开区间
pd.cut(data,category,right=False)
[[50, 100), [50, 100), [0, 25), [50, 100), [50, 100), [50, 100), [0, 25), [50, 100), [50, 100), [50, 100)]
Categories (3, interval[int64]): [[0, 25) < [25, 50) < [50, 100)]
2 )按category对data数据进行切分,使得参数right=False形成左闭右开区间,并对每个区间打标签
labels = ['low', 'middle', 'high']
pd.cut(data,category,right=False,labels=labels)
['high', 'high', 'low', 'high', 'high', 'high', 'low', 'high', 'high', 'high']
Categories (3, object): ['low' < 'middle' < 'high']
3)对data数据按4分位进行切分。
data
pd.cut(data,4)
[(45.5, 66.75], (66.75, 88.0], (2.915, 24.25], (66.75, 88.0], (45.5, 66.75], (66.75, 88.0], (2.915, 24.25], (45.5, 66.75], (66.75, 88.0], (45.5, 66.75]]
Categories (4, interval[float64]): [(2.915, 24.25] < (24.25, 45.5] < (45.5, 66.75] < (66.75, 88.0]]
10.频次统计与位移
1)将df数据通过copy方法赋值为df1,然后对df1数据使用shift方法下移一行(负数表示上移)。
df1=df.copy()
df1.shift(1)
| A | B | C | D | E | F |
---|
zhang | NaN | NaT | NaN | NaN | NaN | NaN |
---|
li | 12.0 | 2013-01-01 | 1.0 | 3.0 | test | foo |
---|
zhou | 40.0 | 2013-01-02 | 2.0 | 3.0 | train | foo |
---|
wang | 26.0 | 2013-01-03 | 3.0 | 3.0 | test | foo |
---|
2)对df1中D列数据进行直方图统计
df1['D'].value_counts()
3 4
Name: D, dtype: int64
11.透视转换
df = pd.DataFrame({'a':[1,2,3,4],'b':[2,3,4,5],'c':[3,4,5,6], 'd':[3,3,3,3]})
df
1) 将df的a列值作为索引,b列值作为列名,c列值作为值,构建透视图。
df.pivot(index='a', columns='b', values='c')
b | 2 | 3 | 4 | 5 |
---|
a | | | | |
---|
1 | 3.0 | NaN | NaN | NaN |
---|
2 | NaN | 4.0 | NaN | NaN |
---|
3 | NaN | NaN | 5.0 | NaN |
---|
4 | NaN | NaN | NaN | 6.0 |
---|
2) 将df的a列值作为索引,b列值作为列名,d列值作为值,构建透视图
df.pivot(index='a', columns='b', values='d')
b | 2 | 3 | 4 | 5 |
---|
a | | | | |
---|
1 | 3.0 | NaN | NaN | NaN |
---|
2 | NaN | 3.0 | NaN | NaN |
---|
3 | NaN | NaN | 3.0 | NaN |
---|
4 | NaN | NaN | NaN | 3.0 |
---|
12.数据差分
1)新建数据帧名为df
df = pd.DataFrame({'a':np.random.randint(1, 100, 10),'b':np.random.randint(1, 100, 10)},index=map(str, range(10)))
df
| a | b |
---|
0 | 26 | 26 |
---|
1 | 16 | 6 |
---|
2 | 48 | 30 |
---|
3 | 2 | 52 |
---|
4 | 21 | 86 |
---|
5 | 81 | 63 |
---|
6 | 99 | 2 |
---|
7 | 9 | 73 |
---|
8 | 38 | 2 |
---|
9 | 51 | 59 |
---|
2)对df的行进行一阶差分
df.diff()
| a | b |
---|
0 | NaN | NaN |
---|
1 | 24.0 | 24.0 |
---|
2 | -86.0 | -3.0 |
---|
3 | 5.0 | 26.0 |
---|
4 | 56.0 | 15.0 |
---|
5 | 7.0 | -9.0 |
---|
6 | -40.0 | -21.0 |
---|
7 | -6.0 | -42.0 |
---|
8 | 22.0 | 61.0 |
---|
9 | 21.0 | -13.0 |
---|
3) 对df的列进行一阶差分
df.diff(axis=1)
| a | b |
---|
0 | NaN | -58.0 |
---|
1 | NaN | -58.0 |
---|
2 | NaN | 25.0 |
---|
3 | NaN | 46.0 |
---|
4 | NaN | 5.0 |
---|
5 | NaN | -11.0 |
---|
6 | NaN | 8.0 |
---|
7 | NaN | -28.0 |
---|
8 | NaN | 11.0 |
---|
9 | NaN | -23.0 |
---|
4) 对df的行进行二阶差分。
df.diff(periods=2)
| a | b |
---|
0 | NaN | NaN |
---|
1 | NaN | NaN |
---|
2 | -62.0 | 21.0 |
---|
3 | -81.0 | 23.0 |
---|
4 | 61.0 | 41.0 |
---|
5 | 63.0 | 6.0 |
---|
6 | -33.0 | -30.0 |
---|
7 | -46.0 | -63.0 |
---|
8 | 16.0 | 19.0 |
---|
9 | 43.0 | 48.0 |
---|
13.计算相关系数
1)新建一个dataframe名为df
df = pd.DataFrame({'A':np.random.randint(1, 100, 10),'B':np.random.randint(1, 100, 10),'C':np.random.randint(1, 100, 10)})
df
| A | B | C |
---|
0 | 5 | 69 | 16 |
---|
1 | 1 | 59 | 17 |
---|
2 | 90 | 92 | 62 |
---|
3 | 22 | 37 | 72 |
---|
4 | 23 | 49 | 80 |
---|
5 | 26 | 10 | 89 |
---|
6 | 10 | 13 | 7 |
---|
7 | 45 | 45 | 65 |
---|
8 | 17 | 43 | 56 |
---|
9 | 73 | 48 | 44 |
---|
2) 计算df的相关系数, pearson相关系数
df.corr()
| A | B | C |
---|
A | 1.000000 | 0.429682 | 0.357246 |
---|
B | 0.429682 | 1.000000 | -0.127607 |
---|
C | 0.357246 | -0.127607 | 1.000000 |
---|
3) 计算df的相关系数, Kendall相关系数
df.corr('kendall')
| A | B | C |
---|
A | 1.000000 | 0.066667 | 0.333333 |
---|
B | 0.066667 | 1.000000 | -0.244444 |
---|
C | 0.333333 | -0.244444 | 1.000000 |
---|
4) 计算df的相关系数, spearman秩相关
df.corr('spearman')
| A | B | C |
---|
A | 1.000000 | 0.042424 | 0.527273 |
---|
B | 0.042424 | 1.000000 | -0.272727 |
---|
C | 0.527273 | -0.272727 | 1.000000 |
---|
14.重塑Reshaping
1) 新进一个DataFrame,为名df,将df的前4行赋值给df2.
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['A', 'B'])
index
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['A', 'B'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
| | A | B |
---|
A | B | | |
---|
bar | one | 0.090815 | 0.776030 |
---|
two | 1.217861 | -0.726099 |
---|
baz | one | 0.801053 | -1.271264 |
---|
two | 0.785470 | 0.514176 |
---|
foo | one | -0.066081 | -0.724912 |
---|
two | -1.927954 | 0.247929 |
---|
qux | one | 0.030065 | 1.202135 |
---|
two | -0.169432 | -1.266427 |
---|
df2 = df[:4]
df2
| | A | B |
---|
A | B | | |
---|
bar | one | 0.090815 | 0.776030 |
---|
two | 1.217861 | -0.726099 |
---|
baz | one | 0.801053 | -1.271264 |
---|
two | 0.785470 | 0.514176 |
---|
2) 使用stack方法对df2进行列转行,将结果返回给stacked.
stacked = df2.stack()
stacked
A B
bar one A 0.090815
B 0.776030
two A 1.217861
B -0.726099
baz one A 0.801053
B -1.271264
two A 0.785470
B 0.514176
dtype: float64
3) 使用unstack方法对stacked进行行转列,默认level=2,解压最内层
stacked.unstack()
| | A | B |
---|
A | B | | |
---|
bar | one | 0.090815 | 0.776030 |
---|
two | 1.217861 | -0.726099 |
---|
baz | one | 0.801053 | -1.271264 |
---|
two | 0.785470 | 0.514176 |
---|
4) 使用unstack方法对stacked进行列转行,设置level=1,解压中间层
stacked.unstack(1)
| B | one | two |
---|
A | | | |
---|
bar | A | 0.090815 | 1.217861 |
---|
B | 0.776030 | -0.726099 |
---|
baz | A | 0.801053 | 0.785470 |
---|
B | -1.271264 | 0.514176 |
---|
5) 使用unstack方法对stacked进行列转行,默认level=0,解压最外层。
stacked.unstack(0)
| A | bar | baz |
---|
B | | | |
---|
one | A | 0.090815 | 0.801053 |
---|
B | 0.776030 | -1.271264 |
---|
two | A | 1.217861 | 0.785470 |
---|
B | -0.726099 | 0.514176 |
---|
6) 使用unstack方法对stacked进行列转行,默认level=‘A’,解压最外层
stacked.unstack('A')
| A | bar | baz |
---|
B | | | |
---|
one | A | 0.090815 | 0.801053 |
---|
B | 0.776030 | -1.271264 |
---|
two | A | 1.217861 | 0.785470 |
---|
B | -0.726099 | 0.514176 |
---|
15.melt函数:将DataFrame的列转行。
1) 新建一个DataFrame,名为df
df=pd.DataFrame(np.arange(8).reshape(2,4),index=['AA','BB'],columns=['A','B','C','D'])
df
2) 使用melt函数,将df进行列转行操作,保留A,C两个原始列,将B,D两列进行列转行,将列转行的列变量重新命名为B|D,列转行对应变量的值的名称命名为B|D_value 。
pd.melt(df,id_vars=['A','C'],value_vars=['B','D'],var_name='B|D',value_name='(B|D)_value')
| A | C | B|D | (B|D)_value |
---|
0 | 0 | 2 | B | 1 |
---|
1 | 4 | 6 | B | 5 |
---|
2 | 0 | 2 | D | 3 |
---|
3 | 4 | 6 | D | 7 |
---|
16.sub函数:截取DataFrame中的行或列。
1) 新建一个DataFrame,名为df
df=pd.DataFrame({ 'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df
| one | two | three |
---|
a | -0.553244 | 0.030308 | NaN |
---|
b | 0.827353 | 0.143289 | -0.986206 |
---|
c | -0.313427 | -0.875161 | 0.074300 |
---|
d | NaN | -0.164881 | 0.229687 |
---|
2)取df中索引为1的行,赋值给row,取列名为two的列赋值给column,使用sub方法将df的row行截取掉,axis='columns’或1。
row=df.iloc[1]
row
one 0.827353
two 0.143289
three -0.986206
Name: b, dtype: float64
column=df['two']
column
a 0.030308
b 0.143289
c -0.875161
d -0.164881
Name: two, dtype: float64
df.sub(row,axis='columns')
| one | two | three |
---|
a | -1.380597 | -0.112981 | NaN |
---|
b | 0.000000 | 0.000000 | 0.000000 |
---|
c | -1.140780 | -1.018450 | 1.060507 |
---|
d | NaN | -0.308170 | 1.215894 |
---|
df.sub(row,axis=1)
| one | two | three |
---|
a | -1.380597 | -0.112981 | NaN |
---|
b | 0.000000 | 0.000000 | 0.000000 |
---|
c | -1.140780 | -1.018450 | 1.060507 |
---|
d | NaN | -0.308170 | 1.215894 |
---|
3)使用sub方法将df的column列截取掉,axis='index’或0。
df.sub(column,axis='index')
| one | two | three |
---|
a | -0.583553 | 0.0 | NaN |
---|
b | 0.684063 | 0.0 | -1.129495 |
---|
c | 0.561734 | 0.0 | 0.949461 |
---|
d | NaN | 0.0 | 0.394568 |
---|
df.sub(column,axis=0)
| one | two | three |
---|
a | -0.583553 | 0.0 | NaN |
---|
b | 0.684063 | 0.0 | -1.129495 |
---|
c | 0.561734 | 0.0 | 0.949461 |
---|
d | NaN | 0.0 | 0.394568 |
---|
17.删除操作
1)删除指定行,返回一个删除后的DataFrame,对原始DataFrame不做改变。
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1, 1, 2, 3, 3, 4, 4]})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data.drop(5,axis=0)
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
6 | two | 4 |
---|
2)删除指定行,对data本身进行删除操作。
data.drop(3,inplace=True)
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
3)删除指定列,返回一个删除后的DataFrame,对原始DataFrame不做改变。
data.drop('k1',axis=1)