pandas

原创已于 2023-03-07 13:50:56 修改 · 179 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#pandas #python #开发语言

于 2023-03-04 12:05:16 首次发布

机器学习专栏收录该内容

3 篇文章

订阅专栏

1.pandas介绍

1.1.简介

2008年WesMcKinney开发出的库
专门用于数据挖掘的开源python库
以Numpy为基础，借力Numpy模块在计算方面性能高的优势
基于matplotlib，能够简便的画图
独特的数据结构

1.2.优势

增强图表可读性
强大的数据处理能力(如处理缺失值)
读取文件方便
集成了matplotlib和numpy

2.数据结构

2.1.Series

Series由一对对index,value组成，类似于map和mongodb。

2.1.1.Series的创建

首先引入库

import pandas as pd
import numpy as np

创建Series的api是pd.Series(data=None, index=None, dtype=None)
data是数据，index是索引，dtype是数据类型

指定内容，默认索引

pd.Series(np.random.randint(-10,10,5))

0    3
1    9
2    4
3   -6
4   -9
dtype: int32

指定内容，指定索引

pd.Series(np.random.randint(-10,10,5),np.arange(5)+1)

1    1
2    8
3    0
4    7
5   -8
dtype: int32

通过字典数据创建

color_count = pd.Series({'red':100, 'blue':200, 'green': 500, 'yellow':1000})
color_count

red        100
blue       200
green      500
yellow    1000
dtype: int64

2.1.2.Series的属性

index

color_count.index

Index(['red', 'blue', 'green', 'yellow'], dtype='object')

values

color_count.values

array([ 100,  200,  500, 1000], dtype=int64)

下标值

color_count[0]

color_count[1]

2.2.DataFrame

DataFrame有两个索引，行索引和列索引，类似于excel

行索引，index，axis=0
列索引，columns，axis=1

2.2.1.DataFrame的创建

使用api pd.DataFrame(data=None, index=None, columns=None)

下面是一个构建多个学生多科成绩的DataFrame构建过程：

# 随机生成十名同学的五科成绩
score = np.random.randint(60,100,[10,5])
# 利用二维数组构建DataFrame
score_df = pd.DataFrame(score)
score_df

在这里插入图片描述
更改index和values

# 构造行索引序列
subjects = ["语文", "数学", "英语", "政治", "体育"]
# 构造列索引序列
stu = ['同学' + str(i) for i in range(score_df.shape[0])]
# 添加行索引
data = pd.DataFrame(score, columns=subjects, index=stu)
data

在这里插入图片描述

2.2.2.DataFrame的属性

shape

data.shape

(10, 5)

index

data.index

Index(['同学0', '同学1', '同学2', '同学3', '同学4', '同学5', '同学6', '同学7', '同学8', '同学9'], dtype='object')

columns

data.columns

Index(['语文', '数学', '英语', '政治', '体育'], dtype='object')

values

data.values

array([[78, 60, 94, 64, 85],
       [86, 60, 79, 72, 71],
       [97, 91, 67, 77, 90],
       [87, 92, 81, 60, 66],
       [74, 65, 96, 75, 68],
       [99, 69, 75, 84, 61],
       [85, 68, 93, 69, 85],
       [85, 80, 66, 80, 66],
       [81, 66, 94, 64, 70],
       [64, 69, 67, 86, 71]])

data.T

在这里插入图片描述

head()

# default=5
data.head(5)

在这里插入图片描述

tail()

# default=5
data.tail(5)

在这里插入图片描述

2.2.2.DataFrame索引的设置

2.2.2.1.修改行列索引值

stu = ["学生_" + str(i) for i in range(score_df.shape[0])]
data.index = stu
data

在这里插入图片描述

2.2.2.2.重设索引

# data本身不会改变，data的index还是学生_i的形式
# drop默认是False，即不会删除当前的index，同时还原之前的index。
# drop=True则会删除当前index，还原之前的index
data2 = data.reset_index(drop=True)
data2

在这里插入图片描述

2.2.2.3.以某列为新的索引

df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2013, 2012, 2014],
'sale':[55, 40, 84, 31]})
df

在这里插入图片描述

df.set_index("year")

在这里插入图片描述
意义不明，后续用到了再回来补充

2.3.MultiIndex

3.基本数据操作

读取文件api，pd.read_csv("路径名")
读取股票数据

import pandas as pd
data = pd.read_csv("./data/stock_day.csv")
# 删掉六列
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"],axis=1)
data.head()

在这里插入图片描述

3.1.索引操作

3.1.1.直接使用行列索引

# 直接使用行列索引名字的方式（先列后行）
data['open']['2018-02-27']
23.53
# 不支持的操作
# 错误
data['2018-02-27']['open']
# 错误
data[:1, :2]

3.1.2.结合loc或者iloc使用索引

loc：使用行列索引名

data.loc['2018-02-27':'2018-02-22', 'open']

在这里插入图片描述
iloc：使用行列索引下标

# 获取前3天数据,前5列的结果
data.iloc[:3, :5]

在这里插入图片描述

3.2.赋值操作

直接使用类似对象.属性的方式只能修改某列的数据

# 直接修改原来的值
data['close'] = 1
# 或者
data.close = 1

想要修改某行或者某区域的值得结合loc或者iloc

3.3.排序

3.3.1.DataFrame排序

单键排序

# 按open的值从小到大排列
data.sort_values(by="open", ascending=True).head()

多键排序

# 按open的值从小到大排列，如果相等，再排high
data.sort_values(by=['open', 'high'])

索引排序

# 按索引从小到大排列,默认是升序，即ascending=True
data.head().sort_index(ascending=False)

在这里插入图片描述

3.3.2.Series排序

Series的排序和DataFrame基本相同

按values排序

data['p_change'].sort_values(ascending=True).head()

按index排序

data['p_change'].sort_index().head()

4.DataFrame运算

4.1.算数运算

# open列每个值加1
data['open'].add(1)

4.2.逻辑运算

(data["open"] > 23).head()
# 2018-02-27 True
# 2018-02-26 False
# 2018-02-23 False
# 2018-02-22 False
# 2018-02-14 False

上面只能得到一个Series，利用该Series可以进行筛选：

data[data["open"] > 23].head()

在这里插入图片描述
多条件：

data[(data["open"] > 23) & (data["open"] < 24)].head()

query方法可以更好地实现上述功能且增加可读性：

data.query("open<24 & open>23").head()

特定值筛选：（类似于MySQL里的isin）

data[data["open"].isin([23.53, 23.85])]

4.3.统计运算

describe()，得到各列值的统计结果

data.describe()

在这里插入图片描述

max()，min()，最大值最小值

data.max()
# open                34.99
# high                36.35
# close               35.21
# low                 34.01
# volume          501915.41
# price_change         3.03
# p_change            10.03
# turnover            12.56
# dtype: float64

var()，std()，方差，标准差

data.var()
# open            1.545255e+01
# high            1.662665e+01
# close           1.554572e+01
# low             1.437902e+01
# volume          5.458124e+09
# price_change    8.072595e-01
# p_change        1.664394e+01
# turnover        4.323800e+00
# dtype: float64

data.std()
# open                3.930973
# high                4.077578
# close               3.942806
# low                 3.791968
# volume          73879.119354
# price_change        0.898476
# p_change            4.079698
# turnover            2.079375
# dtype: float64

median()，中位数

data.median()
# open               21.44
# high               21.97
# close              21.45
# low                20.98
# volume          83175.93
# price_change        0.05
# p_change            0.26
# turnover            2.50
# dtype: float64

idxmax()，idxmin()，求最大值最小值所在行的index

data.idxmax()
# open            2015-06-15
# high            2015-06-10
# close           2015-06-12
# low             2015-06-12
# volume          2017-10-26
# price_change    2015-06-09
# p_change        2015-08-28
# turnover        2017-10-26
# dtype: object

cumsum（累计和）

# 按照日期从小到大排列
data = data.sort_index()
stock_rise = data['p_change']
# 自身并不改变
stock_rise.cumsum()

# 2015-03-02      2.62
# 2015-03-03      4.06
# 2015-03-04      5.63
# 2015-03-05      7.65
# 2015-03-06     16.16
               ...  
# 2018-02-14    112.59
# 2018-02-22    114.23
# 2018-02-23    116.65
# 2018-02-26    119.67
# 2018-02-27    122.35
# Name: p_change, Length: 643, dtype: float64

使用matplotlib显式地展现变化

import matplotlib.pyplot as plt
# plot显示图形
stock_rise.cumsum().plot()
# 需要调用show，才能显示出结果
plt.show()

在这里插入图片描述

cummax（目前为止的最大值）
cummin（目前为止的最小值）
cumprod（累积积）

4.4.自定义运算

api：df.apply(func, axis=0)
其中func是函数，axis是方向，0是列方向，1是行方向，默认是0
例子：

# 分别求open和close列中最大值与最小值的差距
data[['open', 'close']].apply(lambda x: x.max() - x.min())
# open     22.74
# close    22.85
# dtype: float64

5.Pandas画图

api是DataFrame.plot.barh(x=None, y=None, **kwargs)
具体的使用方法pandas官网文档有非常详细的说明
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html