原帖地址,此贴为学习验证过程。传送门
创建一个series语法如下
series = pd.Series([1,2,3,4],['beijing','shanghai','xian','shenzhen'])
series = pd.Series(data,[index])
index为可选参数,如果不设置index,则索引默认为数组下标
series = pd.Series(['beijing','shanghai','xian','shenzhen'])
series支持( + - * /)运算,运算原理是根据索引运算,如若索引对应不上,该位置返回NaN,如果要求运算结果的精度问题则需要加入pd.set_option('precision', 3)
,3就是返回的小数点位数,其结果是四舍五入后的结果
series = pd.Series([1,2,3,4],['beijing','shanghai','xian','shenzhen'])
series2 = pd.Series([6,7,8,9],['beijing','wuhan','xian','shenzhen'])
print("--------减法运算------------","\n",series-series2)
print("--------乘法运算------------","\n",series*series2)
print("--------加法运算------------","\n",series+series2)
print("--------除法运算------------","\n",series/series2)
对于字典类型也有如下操作,更加的高效,便捷
dicts = {'A':1,"B":2,"C":3}
series_dict =pd.Series(dicts)
print(series_dict.index)#获取字典键值
print(series_dict['A'])#访问series元素
print(dict(series_dict))#相互转换
DataFrames
Pandas 的 DataFrame(数据表)是一种 2 维数据结构,数据以表格的形式存储,分成若干行和列。通过 DataFrame,你能很方便地处理数据。常见的操作比如选取、替换行或列的数据,还能重组数据表、修改索引、多重筛选等。
构建一个 DataFrame 对象的基本语法如下:
df = {
'name':pd.Series(['bob',"john","tom"],['A','B','C']),
'age':pd.Series(['15','16','13','20'],['A','B','C','D']),
'sex':pd.Series(['男','女',"男"],['A','B','C'])
}
df = pd.DataFrame(df)
print(df)
一个有意思的事情是,单独的series里面的index必须和数据一一对应,缺一不可.
但是DataFrame里面却没有那么严格的要求,空出来位置显示的就是NaN.
当我们用一个字典来构建DataFrame的时候,要求就又变得严格了
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])#index依然是可选参数
print(df)
如果对应位置缺少数据,不会以NaN补充,而是直接报错,这里要注意两种区别.
想要获取内容也很简单,其返回的其实就是一个series
print(df['name'])
给DataFrame增加一列,需要注意的是这里的索引必须加上,否则增加的一列全是NaN数据,这里务必注意。
当然也是可以利用现有的列来产生需要的新列。(就是两个series运算后得到新的series)
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
print(df)
从 DataFrame 里删除行/列:
想要删除某一行或一列,可以用 .drop() 函数。在使用这个函数的时候,你需要先指定具体的删除方向,axis=0 对应的是行 row,而 axis=1 对应的是列 column 。
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
print(df)
df = df.drop('A',axis=0)
print("这里删除了A行")
print(df)
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
print(df)
df = df.drop('name',axis=1)
print("这里删除了name列")
print(df)
请务必记住,除非用户明确指定,否则在调用 .drop() 的时候,Pandas 并不会真的永久性地删除这行/列。这主要是为了防止用户误操作丢失数据。
你可以通过调用 df 来确认数据的完整性。如果你确定要永久性删除某一行/列,你需要加上 inplace=True 参数,比如:
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
print(df)
df2 = df.drop('name',axis=1)#inplace=True添加以后原始数据改变
print("这里删除了name列,原始数据其实并未改变")
print(df)
**获取 DataFrame 中的一行或多行数据
要获取某一行,你需要用 .loc[] 来按索引(标签名)引用这一行,或者用 .iloc[],按这行在表中的位置(行数)来引用。
**
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
print(df)
a = df.loc['A']
print("这是A行")
print(a)
a0 = df.iloc[1:3]
print("这是用当前行的下标来访问")
print(a0)
同时你可以用 .loc[] 来指定具体的行列范围,并生成一个子数据表,就像在 NumPy里做的一样。比如,提取 ‘c’ 行中 'Name’ 列的内容,可以如下操作:
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
print(df)
a_NAME = df.loc['A','age']
print("这是访问A行的name")
print(a_NAME)
用中括号 [ ] 的方式,除了直接指定选中某些列外,还能接收一个条件语句,然后筛选出符合条件的行/列。比如,我们希望在下面这个表格中筛选出身高>175 的行:
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
df['high'] = df['high'].apply(pd.to_numeric)#把high先转换为数据类型
need_df = df[df['high']>175]
print(need_df)
need_df = df[df['high']>175][['name','age']]#帅选出身高大于175的人的姓名,年龄
print(need_df)
need_df = df[df['high']>175][['name','age']]#帅选出身高大于175的人的姓名,年龄
#分解上述表达式
need_a = df['high']>175
print("这是第一个需求","\n",need_a)
need_b = df[need_a]
print("这是第二个需求","\n",need_b)
need_c = need_b[['name','age']]
print("这是第三个需求","\n",need_c)
你可以用逻辑运算符 &(与)和 |(或)来链接多个条件语句,以便一次应用多个筛选条件到当前的 DataFrame 上。举个栗子,你可以用下面的方法筛选出同时满足 ‘high’>175 和’age’>15 的行:
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
df['high'] = df['high'].apply(pd.to_numeric)#把high先转换为数据类型
df['age'] = df['age'].apply(pd.to_numeric)#把high先转换为数据类型
need = df[(df['high']>175) | (df['age']>=15)]
print(need)
need = df[(df['high']>175) & (df['age']>=15)]
print(need)
**重置 DataFrame 的索引
如果你觉得当前 DataFrame 的索引有问题,你可以用 .reset_index() 简单地把整个表的索引都重置掉。这个方法将把目标 DataFrame 的索引保存在一个叫 index 的列中,而把表格的索引变成默认的从零开始的数字,也就是 [0, …, len(data) - 1] 。比如下面这样:**
import pandas as pd
import numpy as np
df = {
'name':['bob',"john","tom",'Frank'],
'age':['15','16','13','20'],
'sex':['男','女',"男","男"]
}
df = pd.DataFrame(df,index=['A','B','C','D'])
df['high'] = pd.Series(['178','175','168','170'],['A','B','C','D'])
new_df=df.reset_index(inplace=True)
print(df)
print(new_df)
想必也都知道为什么会有一个none值了。
#利用map函数构建一个索引列出来,这里直接覆盖了原有的索引,和上面的不同,要注意
new_df=df.reset_index(inplace=True)
print(df)
df['id'] = list(map(lambda x:str(x)*2,list(range(4))))
df = df.set_index('id')
print(df)
学到哪里写到哪里。
又来了,跌倒一个大坑去了,哎,学习还是不能忘得,加油。
直接上代码
#
import pandas as pd
import numpy as np
#创建一个dataframe
dates = pd.date_range('20190301',periods=8)
df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=list('ABCDE'))
print(df)
print('-'*90)
print(df.head(2))
print(df.tail(2))
print('-'*90)
print(df.index)
print('-'*90)
print(df.values)#剥离二维数组
print('-'*90)
print(df.T)#转置
print('-'*90)
print(df.sort_values('C'))#指定列排序
print('-'*90)
print(df.sort_index(axis=1,ascending=False))#指定索引排序
print('-'*90)
print(df.describe())#输出各属性平均值,总数量,中位数等
print('-'*90)
print('-'*90)
print(type(df['A']))
print('-'*90)
print(df[:3])#列表切片
print(df['20190303':'20190306'])#指定元素切片
print('-'*90)
print(df.loc[dates[0]])#横向锁定元素切片
print('-'*90)
print(df.loc['20190304':'20190307',['B','D']])#索引列位置在前
print(df.at[dates[0],'C'])
print('-'*90)
print(df.iloc[1:3,0:7])#前面为索引列下标,后面为横向下标
print('-'*90)
print(df.iloc[1:2])#默认索引列
print('-'*90)
print(df.iat[1,4])#相当于df.iloc[1,4]
print('-'*90)
print(df.iloc[1,4])
print('-'*90)
print(df[df.B>0][df.A<0])#相当于B列大于0,并且A列<0
print(df[df>0])#只会返回所有满足条件的值,其余Nan显示
print('-'*90)
print(df[df['E'].isin([0,2])])#判断某一列的满足条件的 记住这个坑,面试害惨我了df_chan['cmp_新项目编码'] = df_chan['新项目编码'].isin(df_po['项目编码'])
print('-'*90)
print('-'*90)
print('-'*90)
s1 = pd.Series(list(range(10,18)),index=pd.date_range("20190304",periods=8))#F列多出来的不会显示,少的以NaN显示
df['F'] =s1#扩展dataFrame
print(df)
print('-'*90)
df.at[dates[0],'A'] = 0#把索引列下标是0的行的A列数值赋为0
print(df)
print('-'*90)
df.iat[1,1] =1#横轴下标不计算索引列
print(len(df))
df.loc[:,"D"] = np.array([4]*len(df))#扩展数组,牛逼了
print(df)
print('-'*90)
df2 = df.copy()#赋值dataFrame
print(df2)
df2[df2>0] =-df2 #把df2大于0的值全部赋值成负数
print(df2)
print('-'*90)
df1 = df.reindex(index=dates[:4],columns=list('ABCD')+['G'])
df1.loc[dates[0]:dates[1],'G'] =1
print(df1)
print(df1.dropna())
print(df1.fillna(value=2))#填充NA值
print('-'*90)
#step 3
print(df.mean(axis=1))#以每一列/每一行的形式求平均值
print('-'*90)
print(df.var)#方差
print('-'*90)
s = pd.Series([1,2,2,np.nan,5,7,9,10],index=dates)
print(s)
print('-'*90)
print(s.shift(5))#所有值下移两位以NAn补齐
print('-'*90)
print(s.diff())#后一位减去前一位的差值显示
print('-'*90)
print(s.value_counts())#返回每一个值在当前列的次数,绘制直方图比较方便
print('-'*90)
print(df.apply(np.cumsum,axis=1))#后一列是前面几列累加而来,可以控制方向
print('-'*90)
print(df.apply(lambda x:x.max()-x.min()))#自定义函数,默认都是竖列运算
print('-'*90)
#step4
pieces = [df[:3],df[-3:]]
print(pieces)
print('-'*90)
print(pd.concat(pieces))#拼接操作
print('-'*90)
print(df[:3])
下面来个聚合,交叉表,我反正还没领会交叉表。。。ememememe’…
import pandas as pd
import numpy as np
import datetime
left = pd.DataFrame({'key':['x','y'],'value':[1,2]})
right = pd.DataFrame({'key':['x','z'],'value':[3,4]})
print('LEFT',left)
print('RIGHT',right)
print(pd.merge(left,right,on='key',how='outer'))#inner并集outer全集,
print('-'*90)
df3 = pd.DataFrame({'A':['a','b','c','b'],'B':list(range(4))})
print(df3.groupby('A').sum())#按照A分组,并且聚合方式为sum
print('-'*90)
#reshape
df4 = pd.DataFrame({'A':['one','one','two','three']*6,
'B':['a','b','c']*8,
'C':['foo','foo','foo','bar','bar','bar']*4,
'D':np.random.randn(24),
'E':np.random.randn(24),
'F':[datetime.datetime(2019,i,1) for i in range(1,13)]+
[datetime.datetime(2019,i,15) for i in range(1,13)],
})
print(pd.pivot_table(df4,values='D',index=['A','B'],columns=['C']))
绘图模块
import pandas as pd
import numpy as np
import datetime
from pylab import*
t_exam = pd.date_range('20190301',periods=10,freq='S')#freq指代时间单位
print(t_exam)
ts = pd.Series(np.random.randn(1000), index=pd.date_range('20190301',periods=1000))
ts = ts.cumsum()
ts.plot()
show()
#file
#