Pandas基础教程(代码笔记版)

Pandas基础教程(代码笔记版)

一、 引言

   首先再次感谢小破站的莫烦大大的讲解,干货满满,人也超级温柔,接着上期Numpy学习,这一章将讲解Pandas的基本使用教程,这里附上对应的视频链接(mofanpy.com)。接下来我针对莫烦大大的教学内容以及一些现版本不适用的语法进行了笔记梳理,有需要的宝子可以结合着视频一起食用~

二、 教程

1. 库安装

首先,要使用pandas,得先下载pandas库,下载方式如下:

pip install pandas

当我们需要调用这个库的时候,我们习惯性用下面这种方式:

import pandas as pd

库下载完成后,可以开始创建图表了ovo

2. 介绍

接下来附上各模块的代码:

import pandas as pd
import numpy as np

s = pd.Series([1, 3, 6, np.nan, 44, 1])
print(s)

dates = pd.date_range('20160101', periods=6)
print(dates)

# 标注行、列信息
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
print(df)

df1 = pd.DataFrame(np.arange(12).reshape(3, 4))
print(df1)

# 字典方式
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='int64'),
                    'D': np.array([3]*4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

# 各种模块练习
print(df2)
print(df2.dtypes)
print(df2.index)
print(df2.columns)
print(df2.values)
print(df2.describe())
print(df2.T)
print(df2.sort_index(axis=0, ascending=False))
print(df2.sort_index(axis=1, ascending=False))
print(df2.sort_values(by='E'))

ps:注意一下这里:

print(df2.sort_index(axis=0, ascending=False))
print(df2.sort_index(axis=1, ascending=False))

里面的axis=0表示按行排列,axis=1表示按列排列,ascending=False表示从大往小排;

结果:

D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_1_介绍.py
0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')
                   a         b         c         d
2016-01-01 -0.322726 -0.842901  0.202721 -0.591344
2016-01-02 -0.133020  0.372410 -0.529450  0.512845
2016-01-03  0.381274 -0.944750  0.673554  0.279403
2016-01-04  0.772122  0.998136 -0.659775  0.428906
2016-01-05  1.840408 -1.281372 -0.992370  0.562366
2016-01-06  1.282832  1.040648 -0.881359 -0.172878
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
     A          B  C  D      E    F
0  1.0 2013-01-02  1  3   test  foo
1  1.0 2013-01-02  1  3  train  foo
2  1.0 2013-01-02  1  3   test  foo
3  1.0 2013-01-02  1  3  train  foo
A          float64
B    datetime64[s]
C            int64
D            int32
E         category
F           object
dtype: object
Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
[[1.0 Timestamp('2013-01-02 00:00:00') 1 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1 3 'train' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1 3 'train' 'foo']]
         A                    B    C    D
count  4.0                    4  4.0  4.0
mean   1.0  2013-01-02 00:00:00  1.0  3.0
min    1.0  2013-01-02 00:00:00  1.0  3.0
25%    1.0  2013-01-02 00:00:00  1.0  3.0
50%    1.0  2013-01-02 00:00:00  1.0  3.0
75%    1.0  2013-01-02 00:00:00  1.0  3.0
max    1.0  2013-01-02 00:00:00  1.0  3.0
std    0.0                  NaN  0.0  0.0
                     0  ...                    3
A                  1.0  ...                  1.0
B  2013-01-02 00:00:00  ...  2013-01-02 00:00:00
C                    1  ...                    1
D                    3  ...                    3
E                 test  ...                train
F                  foo  ...                  foo

[6 rows x 4 columns]
     A          B  C  D      E    F
3  1.0 2013-01-02  1  3  train  foo
2  1.0 2013-01-02  1  3   test  foo
1  1.0 2013-01-02  1  3  train  foo
0  1.0 2013-01-02  1  3   test  foo
     F      E  D  C          B    A
0  foo   test  3  1 2013-01-02  1.0
1  foo  train  3  1 2013-01-02  1.0
2  foo   test  3  1 2013-01-02  1.0
3  foo  train  3  1 2013-01-02  1.0
     A          B  C  D      E    F
0  1.0 2013-01-02  1  3   test  foo
2  1.0 2013-01-02  1  3   test  foo
1  1.0 2013-01-02  1  3  train  foo
3  1.0 2013-01-02  1  3  train  foo

3. 选择数据

代码:

import pandas as pd
import numpy as np

dates = pd.date_range('20240601', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A','B', 'C', 'D'])

print(df)
print(df['A'], df.A)
print(df[0:3], df['20240602':'20240604'])

# 通过标签选择 loc — (location)
print(df.loc['20240601'])
print(df.loc['20240601', ['A', 'B']])
print(df.loc['20240601', :])

# 通过位置选择 iloc
print(df.iloc[[1, 3, 5], 1:3])

# # mixed selection:ix (已被弃用)
# print(df.ix[:3, ['A', 'C']])
print("*"*30)
print(df)
print("*"*30)
print(df[(df.A > 8)])
print("*"*30)
print(df[(df.A > 8) & (df.B > 16)])

结果:

D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_2_选择数据.py
             A   B   C   D
2024-06-01   0   1   2   3
2024-06-02   4   5   6   7
2024-06-03   8   9  10  11
2024-06-04  12  13  14  15
2024-06-05  16  17  18  19
2024-06-06  20  21  22  23
2024-06-01     0
2024-06-02     4
2024-06-03     8
2024-06-04    12
2024-06-05    16
2024-06-06    20
Freq: D, Name: A, dtype: int32 2024-06-01     0
2024-06-02     4
2024-06-03     8
2024-06-04    12
2024-06-05    16
2024-06-06    20
Freq: D, Name: A, dtype: int32
            A  B   C   D
2024-06-01  0  1   2   3
2024-06-02  4  5   6   7
2024-06-03  8  9  10  11
             A   B   C   D
2024-06-02   4   5   6   7
2024-06-03   8   9  10  11
2024-06-04  12  13  14  15
A    0
B    1
C    2
D    3
Name: 2024-06-01 00:00:00, dtype: int32
A    0
B    1
Name: 2024-06-01 00:00:00, dtype: int32
A    0
B    1
C    2
D    3
Name: 2024-06-01 00:00:00, dtype: int32
             B   C
2024-06-02   5   6
2024-06-04  13  14
2024-06-06  21  22
******************************
             A   B   C   D
2024-06-01   0   1   2   3
2024-06-02   4   5   6   7
2024-06-03   8   9  10  11
2024-06-04  12  13  14  15
2024-06-05  16  17  18  19
2024-06-06  20  21  22  23
******************************
             A   B   C   D
2024-06-04  12  13  14  15
2024-06-05  16  17  18  19
2024-06-06  20  21  22  23
******************************
             A   B   C   D
2024-06-05  16  17  18  19
2024-06-06  20  21  22  23
               A     B     C     D
2024-06-01   NaN   NaN   NaN   NaN
2024-06-02   NaN   NaN   NaN   NaN
2024-06-03   NaN   9.0  10.0  11.0
2024-06-04  12.0  13.0  14.0  15.0
2024-06-05  16.0  17.0  18.0  19.0
2024-06-06  20.0  21.0  22.0  23.0

进程已结束,退出代码为 0

4. 设置数值

import pandas as pd
import numpy as np
import win32pdhutil

dates = pd.date_range('20240601', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])

df.iloc[2, 2] = 1111
df.loc['20240601', 'B'] = 222

# 某一列替换
C = df.A
C[df.A > 4] = 0
df.A = C

# 设置空白值
df['F'] = np.nan
df['E'] = pd.Series(np.arange(1, 7), index=pd.date_range('20240601', periods=6))
df['E'] = np.arange(1, 7)
print(df)
D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_3_设置值.py
            A    B     C   D   F  E
2024-06-01  0  222     2   3 NaN  1
2024-06-02  4    5     6   7 NaN  2
2024-06-03  0    9  1111  11 NaN  3
2024-06-04  0   13    14  15 NaN  4
2024-06-05  0   17    18  19 NaN  5
2024-06-06  0   21    22  23 NaN  6

5. 处理丢失数据

代码:

import pandas as pd
import numpy as np

dates = pd.date_range('20240601', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan

# 将空白的两行丢失掉
print(df.dropna(axis=0, how='any'))  # how={'any','all'}
print(df.fillna(value=0))
print(df.isnull())

# 空白数据特别大,不方便查找nan时
print(np.any(df.isnull()) == True)

结果:

D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_4_处理丢失数据.py
             A     B     C   D
2024-06-01   0   NaN   2.0   3
2024-06-02   4   5.0   NaN   7
2024-06-03   8   9.0  10.0  11
2024-06-04  12  13.0  14.0  15
2024-06-05  16  17.0  18.0  19
2024-06-06  20  21.0  22.0  23
******************************
             A     B     C   D
2024-06-03   8   9.0  10.0  11
2024-06-04  12  13.0  14.0  15
2024-06-05  16  17.0  18.0  19
2024-06-06  20  21.0  22.0  23
******************************
             A     B     C   D
2024-06-01   0   0.0   2.0   3
2024-06-02   4   5.0   0.0   7
2024-06-03   8   9.0  10.0  11
2024-06-04  12  13.0  14.0  15
2024-06-05  16  17.0  18.0  19
2024-06-06  20  21.0  22.0  23
******************************
                A      B      C      D
2024-06-01  False   True  False  False
2024-06-02  False  False   True  False
2024-06-03  False  False  False  False
2024-06-04  False  False  False  False
2024-06-05  False  False  False  False
2024-06-06  False  False  False  False
******************************
True

6. 导入导出

代码:

import pandas as pd

data = pd.read_csv('data_12k_10c.csv')
print(data)


data.to_pickle('data.pickle')

7. append(此方法最新版本已过期,不再使用,老版本可以继续尝试一下)

代码:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])

# append(已过期)
# res1 = df1.append(df2, ignore_index=True)

7. join

代码:

import pandas as pd
import numpy as np

# join, ['inner', 'outer']

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])

print(df1)
print('*' * 30)
print(df2)

# outer:取并集   inner:取交集 join='**'
res1 = pd.concat([df1, df2], join='inner', ignore_index=True)
print('*' * 30)
print(res1)

# # join_axes (已过期)
# res2 = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

# 弹幕老师提供的方法 (弹幕老师太牛了!!!)
res2 = pd.concat([df1, df2.reindex(df1.index)], axis=1)
print(res2)

结果:

D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_6_join.py
     a    b    c    d
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0
******************************
     b    c    d    e
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
******************************
     b    c    d
0  0.0  0.0  0.0
1  0.0  0.0  0.0
2  0.0  0.0  0.0
3  1.0  1.0  1.0
4  1.0  1.0  1.0
5  1.0  1.0  1.0
******************************
     a    b    c    d    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0

8. 合并DataFrame

代码:

import pandas as pd
import numpy as np

# concatenating
df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], dtype='int64')
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd'], dtype='int64')
df3 = pd.DataFrame(np.ones((3, 4))*2, columns=['a', 'b', 'c', 'd'], dtype='int64')

print(df1)
print(df2)
print(df3)


res1 = pd.concat([df1, df2, df3], axis=0)
print('*'*30)
print(res1)

res2 = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
print('*'*30)
print(res2)

res3 = np.vstack((df1, df2, df3))
print('*'*30)
print(res3)

结果:

D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_6_合并DataFrame_concat.py
   a  b  c  d
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0
   a  b  c  d
0  1  1  1  1
1  1  1  1  1
2  1  1  1  1
   a  b  c  d
0  2  2  2  2
1  2  2  2  2
2  2  2  2  2
******************************
   a  b  c  d
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0
0  1  1  1  1
1  1  1  1  1
2  1  1  1  1
0  2  2  2  2
1  2  2  2  2
2  2  2  2  2
******************************
   a  b  c  d
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0
3  1  1  1  1
4  1  1  1  1
5  1  1  1  1
6  2  2  2  2
7  2  2  2  2
8  2  2  2  2
******************************
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [1 1 1 1]
 [1 1 1 1]
 [1 1 1 1]
 [2 2 2 2]
 [2 2 2 2]
 [2 2 2 2]]

进程已结束,退出代码为 0

9. merge

代码:

import pandas as pd

# #
# left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
#                      'A': ['A0', 'A1', 'A2', 'A3'],
#                      'B': ['B0', 'B1', 'B2', 'B3']})
#
# right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
#                      'C': ['C0', 'C1', 'C2', 'C3'],
#                       'D': ['D0', 'D1', 'D2', 'D3']})
#
# print(left)
# print('*' * 30)
# print(right)
# print('*' * 30)
# res = pd.merge(left, right, on='key')
# print(res)
# print('*' * 30)

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

print(left)
print('*' * 30)
print(right)

# how = ['left', 'right', 'outer', 'inner']
res1 = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print('*' * 30)
print(res1)
#
res2 = pd.merge(left, right, on=['key1', 'key2'], how='inner')
print('*' * 30)
print(res2)

res3 = pd.merge(left, right, on=['key1', 'key2'], how='right')
print('*' * 30)
print(res3)


df1 = pd.DataFrame({'col1': [0, 1], 'col_left': ['a', 'b']})
df2 = pd.DataFrame({'col1': [1, 2, 2], 'col_right': [2, 2, 2]})
print('*' * 30)
print(df1)

print('*' * 30)
print(df2)

res4 = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')

print('*' * 30)
print(res4)


# merge by index
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                    index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                     index=['K0', 'K2', 'K3'])

print('*' * 30)
print(left)
print('*' * 30)
print(right)

res5 = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(res5)
##

boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})

print(boys)
print('*' * 30)
print(girls)

res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
print('*' * 30)
print(res)

结果:

D:\Software\Anacanda\envs\pythonProject4\python.exe D:/ALL_CODE/PYCHARM/pythonProject4/pandas/par_7_merge.py
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3
******************************
  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3
******************************
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K0  NaN  NaN   C3   D3
5   K2   K1   A3   B3  NaN  NaN
******************************
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2
******************************
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3
******************************
   col1 col_left
0     0        a
1     1        b
******************************
   col1  col_right
0     1          2
1     2          2
2     2          2
******************************
   col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only
******************************
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
******************************
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3
    k  age
0  K0    1
1  K1    2
2  K2    3
******************************
    k  age
0  K0    4
1  K0    5
2  K3    6
******************************
    k  age_boy  age_girl
0  K0        1         4
1  K0        1         5

进程已结束,退出代码为 0

10. 绘图

import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
matplotlib.use('TkAgg')

# # plot data
#
# # Series
# data = pd.Series(np.random.randn(1000), index=np.arange(1000))
# data = data.cumsum()
#
# plt.figure()
# data.plot()
# # plt.plot(x= , y=)
# plt.show()

# DataFrame
data = pd.DataFrame(np.random.randn(1000, 4),
                    index=np.arange(1000),
                    columns=list("ABCD"))
data = data.cumsum()
print(data.head(5))

# 绘图
# plt.figure()
ax1 = data.plot.scatter(x='A', y='B', color='DarkBlue', label='Class 1')
ax2 = data.plot.scatter(x='A', y='C', color='DarkGreen', label='Class 2', ax=ax1)
data.plot.scatter(x='A', y='D', color='DarkBlue', label='Class 3', ax=ax2)
plt.show()

结果:
在这里插入图片描述

三、小结

本篇内容是对莫烦大大pandas视频教学的一个整理,总结了八种常用的DataFrame函数以及使用方法,觉得有用的宝子记得点赞收藏+关注~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

HenryLiuu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值