Pandas基础教程(代码笔记版)
文章目录
一、 引言
首先再次感谢小破站的莫烦大大的讲解,干货满满,人也超级温柔,接着上期Numpy学习,这一章将讲解Pandas的基本使用教程,这里附上对应的视频链接(mofanpy.com)。接下来我针对莫烦大大的教学内容以及一些现版本不适用的语法进行了笔记梳理,有需要的宝子可以结合着视频一起食用~
二、 教程
1. 库安装
首先,要使用pandas,得先下载pandas库,下载方式如下:
pip install pandas
当我们需要调用这个库的时候,我们习惯性用下面这种方式:
import pandas as pd
库下载完成后,可以开始创建图表了ovo
2. 介绍
接下来附上各模块的代码:
import pandas as pd
import numpy as np
s = pd.Series([1, 3, 6, np.nan, 44, 1])
print(s)
dates = pd.date_range('20160101', periods=6)
print(dates)
# 标注行、列信息
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
print(df)
df1 = pd.DataFrame(np.arange(12).reshape(3, 4))
print(df1)
# 字典方式
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='int64'),
'D': np.array([3]*4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
# 各种模块练习
print(df2)
print(df2.dtypes)
print(df2.index)
print(df2.columns)
print(df2.values)
print(df2.describe())
print(df2.T)
print(df2.sort_index(axis=0, ascending=False))
print(df2.sort_index(axis=1, ascending=False))
print(df2.sort_values(by='E'))
ps:注意一下这里:
print(df2.sort_index(axis=0, ascending=False))
print(df2.sort_index(axis=1, ascending=False))
里面的axis=0表示按行排列,axis=1表示按列排列,ascending=False表示从大往小排;
结果:
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_1_介绍.py
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', freq='D')
a b c d
2016-01-01 -0.322726 -0.842901 0.202721 -0.591344
2016-01-02 -0.133020 0.372410 -0.529450 0.512845
2016-01-03 0.381274 -0.944750 0.673554 0.279403
2016-01-04 0.772122 0.998136 -0.659775 0.428906
2016-01-05 1.840408 -1.281372 -0.992370 0.562366
2016-01-06 1.282832 1.040648 -0.881359 -0.172878
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
A B C D E F
0 1.0 2013-01-02 1 3 test foo
1 1.0 2013-01-02 1 3 train foo
2 1.0 2013-01-02 1 3 test foo
3 1.0 2013-01-02 1 3 train foo
A float64
B datetime64[s]
C int64
D int32
E category
F object
dtype: object
Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
[[1.0 Timestamp('2013-01-02 00:00:00') 1 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1 3 'train' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1 3 'train' 'foo']]
A B C D
count 4.0 4 4.0 4.0
mean 1.0 2013-01-02 00:00:00 1.0 3.0
min 1.0 2013-01-02 00:00:00 1.0 3.0
25% 1.0 2013-01-02 00:00:00 1.0 3.0
50% 1.0 2013-01-02 00:00:00 1.0 3.0
75% 1.0 2013-01-02 00:00:00 1.0 3.0
max 1.0 2013-01-02 00:00:00 1.0 3.0
std 0.0 NaN 0.0 0.0
0 ... 3
A 1.0 ... 1.0
B 2013-01-02 00:00:00 ... 2013-01-02 00:00:00
C 1 ... 1
D 3 ... 3
E test ... train
F foo ... foo
[6 rows x 4 columns]
A B C D E F
3 1.0 2013-01-02 1 3 train foo
2 1.0 2013-01-02 1 3 test foo
1 1.0 2013-01-02 1 3 train foo
0 1.0 2013-01-02 1 3 test foo
F E D C B A
0 foo test 3 1 2013-01-02 1.0
1 foo train 3 1 2013-01-02 1.0
2 foo test 3 1 2013-01-02 1.0
3 foo train 3 1 2013-01-02 1.0
A B C D E F
0 1.0 2013-01-02 1 3 test foo
2 1.0 2013-01-02 1 3 test foo
1 1.0 2013-01-02 1 3 train foo
3 1.0 2013-01-02 1 3 train foo
3. 选择数据
代码:
import pandas as pd
import numpy as np
dates = pd.date_range('20240601', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A','B', 'C', 'D'])
print(df)
print(df['A'], df.A)
print(df[0:3], df['20240602':'20240604'])
# 通过标签选择 loc — (location)
print(df.loc['20240601'])
print(df.loc['20240601', ['A', 'B']])
print(df.loc['20240601', :])
# 通过位置选择 iloc
print(df.iloc[[1, 3, 5], 1:3])
# # mixed selection:ix (已被弃用)
# print(df.ix[:3, ['A', 'C']])
print("*"*30)
print(df)
print("*"*30)
print(df[(df.A > 8)])
print("*"*30)
print(df[(df.A > 8) & (df.B > 16)])
结果:
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_2_选择数据.py
A B C D
2024-06-01 0 1 2 3
2024-06-02 4 5 6 7
2024-06-03 8 9 10 11
2024-06-04 12 13 14 15
2024-06-05 16 17 18 19
2024-06-06 20 21 22 23
2024-06-01 0
2024-06-02 4
2024-06-03 8
2024-06-04 12
2024-06-05 16
2024-06-06 20
Freq: D, Name: A, dtype: int32 2024-06-01 0
2024-06-02 4
2024-06-03 8
2024-06-04 12
2024-06-05 16
2024-06-06 20
Freq: D, Name: A, dtype: int32
A B C D
2024-06-01 0 1 2 3
2024-06-02 4 5 6 7
2024-06-03 8 9 10 11
A B C D
2024-06-02 4 5 6 7
2024-06-03 8 9 10 11
2024-06-04 12 13 14 15
A 0
B 1
C 2
D 3
Name: 2024-06-01 00:00:00, dtype: int32
A 0
B 1
Name: 2024-06-01 00:00:00, dtype: int32
A 0
B 1
C 2
D 3
Name: 2024-06-01 00:00:00, dtype: int32
B C
2024-06-02 5 6
2024-06-04 13 14
2024-06-06 21 22
******************************
A B C D
2024-06-01 0 1 2 3
2024-06-02 4 5 6 7
2024-06-03 8 9 10 11
2024-06-04 12 13 14 15
2024-06-05 16 17 18 19
2024-06-06 20 21 22 23
******************************
A B C D
2024-06-04 12 13 14 15
2024-06-05 16 17 18 19
2024-06-06 20 21 22 23
******************************
A B C D
2024-06-05 16 17 18 19
2024-06-06 20 21 22 23
A B C D
2024-06-01 NaN NaN NaN NaN
2024-06-02 NaN NaN NaN NaN
2024-06-03 NaN 9.0 10.0 11.0
2024-06-04 12.0 13.0 14.0 15.0
2024-06-05 16.0 17.0 18.0 19.0
2024-06-06 20.0 21.0 22.0 23.0
进程已结束,退出代码为 0
4. 设置数值
import pandas as pd
import numpy as np
import win32pdhutil
dates = pd.date_range('20240601', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df.iloc[2, 2] = 1111
df.loc['20240601', 'B'] = 222
# 某一列替换
C = df.A
C[df.A > 4] = 0
df.A = C
# 设置空白值
df['F'] = np.nan
df['E'] = pd.Series(np.arange(1, 7), index=pd.date_range('20240601', periods=6))
df['E'] = np.arange(1, 7)
print(df)
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_3_设置值.py
A B C D F E
2024-06-01 0 222 2 3 NaN 1
2024-06-02 4 5 6 7 NaN 2
2024-06-03 0 9 1111 11 NaN 3
2024-06-04 0 13 14 15 NaN 4
2024-06-05 0 17 18 19 NaN 5
2024-06-06 0 21 22 23 NaN 6
5. 处理丢失数据
代码:
import pandas as pd
import numpy as np
dates = pd.date_range('20240601', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan
# 将空白的两行丢失掉
print(df.dropna(axis=0, how='any')) # how={'any','all'}
print(df.fillna(value=0))
print(df.isnull())
# 空白数据特别大,不方便查找nan时
print(np.any(df.isnull()) == True)
结果:
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_4_处理丢失数据.py
A B C D
2024-06-01 0 NaN 2.0 3
2024-06-02 4 5.0 NaN 7
2024-06-03 8 9.0 10.0 11
2024-06-04 12 13.0 14.0 15
2024-06-05 16 17.0 18.0 19
2024-06-06 20 21.0 22.0 23
******************************
A B C D
2024-06-03 8 9.0 10.0 11
2024-06-04 12 13.0 14.0 15
2024-06-05 16 17.0 18.0 19
2024-06-06 20 21.0 22.0 23
******************************
A B C D
2024-06-01 0 0.0 2.0 3
2024-06-02 4 5.0 0.0 7
2024-06-03 8 9.0 10.0 11
2024-06-04 12 13.0 14.0 15
2024-06-05 16 17.0 18.0 19
2024-06-06 20 21.0 22.0 23
******************************
A B C D
2024-06-01 False True False False
2024-06-02 False False True False
2024-06-03 False False False False
2024-06-04 False False False False
2024-06-05 False False False False
2024-06-06 False False False False
******************************
True
6. 导入导出
代码:
import pandas as pd
data = pd.read_csv('data_12k_10c.csv')
print(data)
data.to_pickle('data.pickle')
7. append(此方法最新版本已过期,不再使用,老版本可以继续尝试一下)
代码:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
# append(已过期)
# res1 = df1.append(df2, ignore_index=True)
7. join
代码:
import pandas as pd
import numpy as np
# join, ['inner', 'outer']
df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
print(df1)
print('*' * 30)
print(df2)
# outer:取并集 inner:取交集 join='**'
res1 = pd.concat([df1, df2], join='inner', ignore_index=True)
print('*' * 30)
print(res1)
# # join_axes (已过期)
# res2 = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
# 弹幕老师提供的方法 (弹幕老师太牛了!!!)
res2 = pd.concat([df1, df2.reindex(df1.index)], axis=1)
print(res2)
结果:
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_6_join.py
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
******************************
b c d e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
******************************
b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
******************************
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
8. 合并DataFrame
代码:
import pandas as pd
import numpy as np
# concatenating
df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], dtype='int64')
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd'], dtype='int64')
df3 = pd.DataFrame(np.ones((3, 4))*2, columns=['a', 'b', 'c', 'd'], dtype='int64')
print(df1)
print(df2)
print(df3)
res1 = pd.concat([df1, df2, df3], axis=0)
print('*'*30)
print(res1)
res2 = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
print('*'*30)
print(res2)
res3 = np.vstack((df1, df2, df3))
print('*'*30)
print(res3)
结果:
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_6_合并DataFrame_concat.py
a b c d
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
a b c d
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
a b c d
0 2 2 2 2
1 2 2 2 2
2 2 2 2 2
******************************
a b c d
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
0 2 2 2 2
1 2 2 2 2
2 2 2 2 2
******************************
a b c d
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 1
5 1 1 1 1
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
******************************
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[1 1 1 1]
[1 1 1 1]
[1 1 1 1]
[2 2 2 2]
[2 2 2 2]
[2 2 2 2]]
进程已结束,退出代码为 0
9. merge
代码:
import pandas as pd
# #
# left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
# 'A': ['A0', 'A1', 'A2', 'A3'],
# 'B': ['B0', 'B1', 'B2', 'B3']})
#
# right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
# 'C': ['C0', 'C1', 'C2', 'C3'],
# 'D': ['D0', 'D1', 'D2', 'D3']})
#
# print(left)
# print('*' * 30)
# print(right)
# print('*' * 30)
# res = pd.merge(left, right, on='key')
# print(res)
# print('*' * 30)
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print('*' * 30)
print(right)
# how = ['left', 'right', 'outer', 'inner']
res1 = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print('*' * 30)
print(res1)
#
res2 = pd.merge(left, right, on=['key1', 'key2'], how='inner')
print('*' * 30)
print(res2)
res3 = pd.merge(left, right, on=['key1', 'key2'], how='right')
print('*' * 30)
print(res3)
df1 = pd.DataFrame({'col1': [0, 1], 'col_left': ['a', 'b']})
df2 = pd.DataFrame({'col1': [1, 2, 2], 'col_right': [2, 2, 2]})
print('*' * 30)
print(df1)
print('*' * 30)
print(df2)
res4 = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
print('*' * 30)
print(res4)
# merge by index
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print('*' * 30)
print(left)
print('*' * 30)
print(right)
res5 = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(res5)
##
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})
print(boys)
print('*' * 30)
print(girls)
res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
print('*' * 30)
print(res)
结果:
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_7_merge.py
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
******************************
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
******************************
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K0 NaN NaN C3 D3
5 K2 K1 A3 B3 NaN NaN
******************************
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
******************************
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
******************************
col1 col_left
0 0 a
1 1 b
******************************
col1 col_right
0 1 2
1 2 2
2 2 2
******************************
col1 col_left col_right indicator_column
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
******************************
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
******************************
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
k age
0 K0 1
1 K1 2
2 K2 3
******************************
k age
0 K0 4
1 K0 5
2 K3 6
******************************
k age_boy age_girl
0 K0 1 4
1 K0 1 5
进程已结束,退出代码为 0
10. 绘图
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
matplotlib.use('TkAgg')
# # plot data
#
# # Series
# data = pd.Series(np.random.randn(1000), index=np.arange(1000))
# data = data.cumsum()
#
# plt.figure()
# data.plot()
# # plt.plot(x= , y=)
# plt.show()
# DataFrame
data = pd.DataFrame(np.random.randn(1000, 4),
index=np.arange(1000),
columns=list("ABCD"))
data = data.cumsum()
print(data.head(5))
# 绘图
# plt.figure()
ax1 = data.plot.scatter(x='A', y='B', color='DarkBlue', label='Class 1')
ax2 = data.plot.scatter(x='A', y='C', color='DarkGreen', label='Class 2', ax=ax1)
data.plot.scatter(x='A', y='D', color='DarkBlue', label='Class 3', ax=ax2)
plt.show()
结果:
三、小结
本篇内容是对莫烦大大pandas视频教学的一个整理,总结了八种常用的DataFrame函数以及使用方法,觉得有用的宝子记得点赞收藏+关注~