04.pandas-优快云博客

import pandas as pd
import numpy as np
# 随机创建一个二维数组
ndarray = np.random.normal(0,5,(10,5))
#采用DataFrame进行读取
date = pd.DataFrame(ndarray)
# 进行loc取值
date[:]
date.loc[:6,:2]
#type(date)
#help(pd.DataFrame)

3 赋值操作

#直接 ndarray['列名'(或者索引)] = 值 即可
# todo 如果ndarray里面没有这个列的话会自动在最后一列让添加
date[0] = 1000
date['0'] = 1400
date

4 排序

#排序有两种方式,一种对于索引进行排序,一种对于内容进行排序

使用dataframe.sort_values(by=,ascending=)
- 单个键或者多个键进行排序，默认升序
- 【ascending】False 降序
- 【ascending】Ture 升序

date.sort_values(by = [0,1,2],ascending=True).head()

2.使用series.sort_index()进行排序

date['object'].sort_values(ascending=True)
date['object'].sort_index()

date[3].sort_index(ascending=False).head()
date[3].sort_values(ascending=True).head()

5 DataFrame 的运算

1.给一列数据加上某个数

add(other)

2.给一列数据减去某个数

sub(other)

date[0].sub(5).head()
date[0].add(5).head()

3. 逻辑运算（< ,> ,| ,&）

直接对象+逻辑运算即可实现
其中需要注意：使用| 和 & 是存在优先级问题，需要使用小括号解决

4. 逻辑运算函数

date.query(查询字符串（运算逻辑）)
date['object'].isin([值1，值2，....])#判断是否存在值1，值2...，返回False/True

5. 统计运算

date.describe() #计算平均值、标准差、最大值、最小值
统计函数

count describe
sum 求和
mean 平均值
median 中位数
min 最小值
max 最大值
mode 众数
abs 绝对值
prod 乘积
std 标准差
var 方差
idxmax 最大值的索引的值
idxmin 最小值的索引的值
cumsum 计算1/2/3/.../n个数的和
cummax 计算1/2/3/.../n个数的最大值
cummin 计算1/2/3/.../n个数的最小值
cumprod 计算1/2/3/.../n个数的积
注意：对于单个函数的运用，坐标轴还是按照这些默认为“columns”(axis=0,default),如果要对行“index”需要指定（axis=1）

count	describe
sum	求和
mean	平均值
median	中位数
min	最小值
max	最大值
mode	众数
abs	绝对值
prod	乘积
std	标准差
var	方差
idxmax	最大值的索引的值
idxmin	最小值的索引的值
cumsum	计算1/2/3/.../n个数的和
cummax	计算1/2/3/.../n个数的最大值
cummin	计算1/2/3/.../n个数的最小值
cumprod	计算1/2/3/.../n个数的积
注意：对于单个函数的运用，坐标轴还是按照这些默认为“columns”(axis=0,default),如果要对行“index”需要指定（axis=1）

date.describe()

6. 自定义运算

apply(func,axis=0)
- func:自定义函数
- axis=0:默认是列，axis=1为行进行运算

date[[3]].apply(lambda x: x.max()-x.min(),axis=0)

6 pandas的画图

pandas.DataFrame.plot

DataFrame.plot(x=None,y=None,kind='line')
- x: 横轴的值
- y: 纵轴的值
- kind: 画图的类型（line(默认),bar,barh,hist,pie,scatter）

2.pandas.Series.plot

Series.plot(x=None,y=None,kind='line')
用法和DataFrame一样

date[1].cumsum().plot()

7 文件的读取与存储

常见的格式如下

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_fearther
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_fearther
binary	Parquet Format	read_parquet	to_parquet
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google Big Query	read_gbq	to_gbq

7.1 CSV 读取

7.1.1 read_csv

pandas.read_csv(filepath_or_buffer,usecols)
- filepath_or_buffer:文件路径
- usecols: 指定读取的列名，列表形式

7.1.2 to_csv

DataFrame.to_csv(path_or_buf=None,sep=',',columns=None,header=True,index=True,mode='w',encoding=None)
- path_or_buf:bin存的路径
- sep:按什么方式分割
- columns:按什么次序
- mode:'w':重写,'a':追加
- index: 是否写进行索引
- header: 是否写进列索引
- encoding: 编码方式

7.2 HDF5

7.2.1 read_hdf 和 to_hdf

HDF5文件的读取和存储需要指定一个键，值为要存储的DataFrame
pandas.read_hdf(path_or_buffer,key=None,**kwargs) -【从h5文件当中读取数据】
- path_or_buffer:文件路径
- key：读取的键
- return:Thesellected object
DataFrame.to_hdf(path_or_buf,key,**kwargs)

8 缺失值删除

8.1 如何处理nan（float）

判断数据是否为NaN
- pd.isnull(df)
- pd.notnull(df)
处理方式：
1. 存在缺失值nan，并且是np.nan
- np.any()#有一个值为True，就返回True
- np.all()#有一个为False，就返回False
- 删除纯在缺失值的：对象.dropna(axis='rows')
  - 注意：不会修改原数据，需要接受返回值
- 替换缺失值：fillna(value,inplace=True)
  - value:替换成的值
  - inplace:True会修改原数据，False：不替换修改原数据，生成新的对象
2.不是缺失值nan，有默认标记的
- 先替换'?'为np.nan
  - df.replace(to_replace=,value=)
  - to_replace:替换前的值
  - value:替换后的值