pandas
pandas中主要有两种数据结构,分别是:Series和DataFrame.
- Series:一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。注意:Series中的索引值是可以重复的。
- DataFrame:一个表格型的数据结构,包含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型等),DataFrame即有行索引也有列索引,可以被看做是由Series组成的字典。
Series
通过一维数组创建Series
code:
from pandas import Series,DataFrame
import pandas as pd
import numpy as np
a1 = np.array(["Python","C++","Java","PHP"])
ser1 = Series(a1)
print(ser1) # 输出包含默认的序列号
print(ser1.dtype)
print(ser1.index)
print(ser1.values)
out:
0 Python
1 C++
2 Java
3 PHP
dtype: object
object
RangeIndex(start=0, stop=4, step=1)
['Python' 'C++' 'Java' 'PHP']
code:
ser1.index = ["one","two","three","four"]
print(ser1)
out:
one Python
two C++
three Java
four PHP
dtype: object
code:
ser2 = Series(data = [78,90,65,92],dtype = np.float64,index = ["Jim","HanMei","LiLei","Havorld"])
print(ser2)
out:
Jim 78.0
HanMei 90.0
LiLei 65.0
Havorld 92.0
dtype: float64
通过字典的方式创建Series
code
dict1= {"Jim":84,"HanMei":68,"Havorld":96}
ser2 = Series(dict1)
print(ser2) # 字典的key组成Series的索引,Value组成Series的值
out
HanMei 68
Havorld 96
Jim 84
dtype: int64
Series值的获取
ser3 = Series(data = [78,90,65,92],dtype = np.float64,index = ["Jim","HanMei","LiLei","Havorld"])
print(ser3)
输出:
HanMei 68
Havorld 96
Jim 84
dtype: int64
print(ser3[1])
print(ser3["Havorld"])
print(ser3[-2]) #负数表示从右向左算
输出:
HanMei 90.0
LiLei 65.0
Havorld 92.0
dtype: float64
print(ser3[1:])
输出:
Havorld 96
Jim 84
dtype: int64
print(ser3["Havorld":"Jim"])
输出:
Havorld 96
Jim 84
dtype: int64
Series的运算
- NumPy中的数组运算,在Series中都保留了,均可以使用,并且Series进行数组运算的时候,索引与值之间的映射关系不会发生改变。
- 在操作Series的时候,基本上可以把Series看成NumPy中的ndarray数组来进行操作。ndarray数组的绝大多数操作都可以应用到Series上。
Series缺失值检测
ser4 = Series({"Jim":84,"HanMei":68,"Havorld":96})
print(ser4)
输出:
HanMei 68
Havorld 96
Jim 84
dtype: int64
new_index={"Jim","Lucy","HanMei","Havorld"}
ser4 = Series(ser4,index=new_index)
print(ser4)
输出:
Jim 84.0
Lucy NaN
HanMei 68.0
Havorld 96.0
dtype: float64
ser5 = pd.isnull(ser4) #判断是否为空
print(ser5)
输出:
Jim False
Lucy True
HanMei False
Havorld False
dtype: bool
ser6 = pd.notnull(ser4) #判断是否为非空
print(ser6)
输出:
Jim True
Lucy False
HanMei True
Havorld True
dtype: bool
Series之间的运算
当多个series对象之间进行运算的时候,series之间相同key值的元素value进行运算,不同索引key的value赋值为NaN。
Series及其索引的name属性
ser7 = Series({"Jim":84,"HanMei":68,"Havorld":96})
ser7.index.name = "成绩单"
ser7.name = "语文成绩"
print(ser7)
输出:
成绩单
HanMei 68
Havorld 96
Jim 84
Name: 语文成绩, dtype: int64
DataFrame
通过二维数组创建DataFrame
arr = np.array([
["China","USA","English"],
[16,12,100]
])
df1 = DataFrame(arr)
print(df1)
输出:
0 1 2 列索引:columns
0 China USA English 数据:values
1 16 12 100 数据:values
行索引:index
创建并指定列和行属性
df2 = DataFrame(arr,columns = ["one","two","three"],index = ["一","二"])
print(df2)
输出:
one two three
一 China USA English
二 16 12 100
print(df2.columns)
print(df2.index)
print(df2.values)
输出:
Index(['one', 'two', 'three'], dtype='object')
Index(['一', '二'], dtype='object')
[['China' 'USA' 'English']
['16' '12' '100']]
通过字典的方式创建DataFrame
dict2= {"day":[1,24,12,25],"month":[5,7,3,12],"year":[1990,2001,1997,2018]}
df3 = DataFrame(dict2)
print(df3)
输出:
day month year
0 1 5 1990
1 24 7 2001
2 12 3 1997
3 25 12 2018
#修改默认索引
df3.index = ["one","two","three","four"]
print(df3)
输出:
day month year
one 1 5 1990
two 24 7 2001
three 12 3 1997
four 25 12 2018
DataFrame数据获取
dict2= {"day":[1,24,12,25],"month":[5,7,3,12],"year":[1990,2001,1997,2018]}
df3 = DataFrame(dict2)
df3.index = ["one","two","three","four"]
print(df3)
输出:
day month year
one 1 5 1990
two 24 7 2001
three 12 3 1997
four 25 12 2018
print(df3["year"]) # 根据索引取列
print(df3.ix["two"]) #根据索引取行
输出:
one 1990
two 2001
three 1997
four 2018
Name: year, dtype: int64
day 24
month 7
year 2001
Name: two, dtype: int64
df3["century"] = 21 #新增列
df3.ix["five"] = np.NaN #新增行
print(df3)
输出:
day month year century
one 1.0 5.0 1990.0 21.0
two 24.0 7.0 2001.0 21.0
three 12.0 3.0 1997.0 21.0
four 25.0 12.0 2018.0 21.0
five NaN NaN NaN NaN
pandas基本功能
- 数据文件读取/文本数据读取
- 索引、选取和数据过滤
- 算法运算和数据对齐
- 函数的应用和映射
- 重置索引
pandas本地读取数据
read1 = pd.read_csv("E:/Users/Havorld/Desktop/data.csv")
print(read1)
输出:
name age source
0 gerry 18 98.5
1 tom 21 78.2
2 lili 24 98.5
3 john 20 89.2
# 读取文本数据,指定属性分隔符为";" 不读取头数据
read2 = pd.read_csv("data.txt",sep=";",header = None)
print(read2)
输出:
0 1 2 3 4
0 gerry 18 98.5 89.5 88.5
1 tom 21 98.5 85.5 80.0
2 lili 20 85.6 86.2 NaN
3 john 18 70.0 85.0 60.0
4 joe 19 80.0 85.0 82.0
pandas数据过滤获取
read2.columns = {"name","age",u"语文",u"数学",u"英语"} #指定列名
print(read2)
age 数学 语文 英语 name
0 gerry 18 98.5 89.5 88.5
1 tom 21 98.5 85.5 80.0
2 lili 20 85.6 86.2 NaN
3 john 18 70.0 85.0 60.0
4 joe 19 80.0 85.0 82.0
read3 = read2[read2.columns[2:]] #取出指定的数据
print(read3)
语文 英语 name
0 98.5 89.5 88.5
1 98.5 85.5 80.0
2 85.6 86.2 NaN
3 70.0 85.0 60.0
4 80.0 85.0 82.0
read4 = read3.dropna() #删除含有NaN的行
print(read4)
语文 英语 name
0 98.5 89.5 88.5
1 98.5 85.5 80.0
3 70.0 85.0 60.0
4 80.0 85.0 82.0
选取数据loc,iloc,ix
import numpy as np
import pandas as pd
#生产数据
df = pd.DataFrame(np.arange(0,60,2).reshape(10,3),columns=list('abc'))
print(df)
# loc通过行引用row index和列名column names选取数据
#取第0行第b列的值
print(df.loc[0, 'b'])
#取第0行至第3行的ab列
print(df.loc[0:3, ['a', 'b']])
#取第1行和第5行的bc列
print(df.loc[[1, 5], ['b', 'c']])
# iloc通过行引用row index和列引用column index选取数据
print(df.iloc[0,1])
print(df.iloc[0:4, [0,1]])
print(df.iloc[[1, 5], 1:3])
# ix既可以通过行引用row index和列名column names选取数据,又可以通过行引用row index和列引用column index选取数据
print(df.ix[0,"b"])
print(df.ix[0,1])
print(df.ix[0:3,["a","b"]])
print(df.ix[0:3,[0,1]])
print(df.ix[[1,5],["b","c"]])
print(df.ix[[1,5],[1,2]])
pandas缺省值NaN处理方法
- dropna:根据标签的值中是否存在缺失数据对轴标签进行过滤(删除), 可以通过阈值的调节对缺失值的容忍度
- fillna:用指定值或者插值的方式填充缺失数据,比如: ffill或者bfill
- isnull: 返回一个含有布尔值的对象,这些布尔值表示那些值是缺失值NA
notnull: isnull的否定式
df5=DataFrame([ ['Tom',np.NaN,456.67,'M'],['Merry',34,456.67,np.NaN], ['Gerry',np.NaN,np.NaN,np.NaN],['John',23,np.NaN,'M'], ['Joe',18,2300,'F']],columns=['name','age','salary','Gender'] ) print(df5) name age salary Gender 0 Tom NaN 456.67 M 1 Merry 34.0 456.67 NaN 2 Gerry NaN NaN NaN 3 John 23.0 NaN M 4 Joe 18.0 2300.00 F df5.dropna() #dropna删除行中包含NaN的行数据 df5.dropna(axis=1) #删除列中包含NaN的列(axis=0为行)数据 df5.dropna(how='all') #丢弃全部为NaN值的行数据 df6 = DataFrame(np.random.randn(7,3)) print(df6) 0 1 2 0 0.280872 -1.890914 -0.237311 1 0.721152 -0.300591 0.285356 2 -1.748477 0.991288 -0.349774 3 -1.678800 -0.608380 -0.002143 4 -1.273338 0.946480 -1.179870 5 -0.533472 0.669000 0.667644 6 1.339726 0.119211 -1.016756 df6.ix[:4,2] = np.nan #把0-4行第2列的的数值改为NaN print(df6) 0 1 2 0 0.280872 -1.890914 NaN 1 0.721152 -0.300591 NaN 2 -1.748477 0.991288 NaN 3 -1.678800 -0.608380 NaN 4 -1.273338 0.946480 NaN 5 -0.533472 0.669000 0.667644 6 1.339726 0.119211 -1.016756 df7 = df6.fillna(0) print(df7) 0 1 2 0 0.280872 -1.890914 0.000000 1 0.721152 -0.300591 0.000000 2 -1.748477 0.991288 0.000000 3 -1.678800 -0.608380 0.000000 4 -1.273338 0.946480 0.000000 5 -0.533472 0.669000 0.667644 6 1.339726 0.119211 -1.016756
pandas常用的数学统计方法
df8 = read3
df8 = df8.dropna()
print(df8)
name 语文 数学
0 98.5 89.5 88.5
1 98.5 85.5 80.0
3 70.0 85.0 60.0
4 80.0 85.0 82.0
# 针对Series或各DataFrame列计算总统计值
print(df8.describe())
name 语文 数学
count 4.000000 4.000000 4.000000
mean 86.750000 86.250000 77.625000
std 14.168627 2.179449 12.297527
min 70.000000 85.000000 60.000000
25% 77.500000 85.000000 75.000000
50% 89.250000 85.250000 81.000000
75% 98.500000 86.500000 83.625000
max 98.500000 89.500000 88.500000
print(df8.count())
print(df8.count(axis = 1))
name 4
语文 4
数学 4
dtype: int64
0 3
1 3
3 3
4 3
dtype: int64
相关系数与协方差
唯一值、值计数以及成员资格
- unique:数组去重
- value_counts:计算Series中各个元素出现的频率
isin:判断矢量化集合的元素是否是Series或DataFrame中数据的子集
s = Series(["a","b","b","d","c"]) print(s.value_counts()) print(s.isin(["a","b"])) print(s.unique()) 输出: b 2 d 1 c 1 a 1 dtype: int64 0 True 1 True 2 True 3 False 4 False dtype: bool ['a' 'b' 'd' 'c']
层次索引
data = Series([768,325,914,666],index=[
["2015","2015","2015","2016"],
["apple","banana","orange","apple"]
])
print(data)
2015 apple 768
banana 325
orange 914
2016 apple 666
dtype: int64
code:
df9 = DataFrame({
"year":[2001,2001,2002,2002,2003],
"fruit":["apple","banana","apple","banana","apple"],
"production":[121,122,123,124,125],
"profits":[22.1,22.2,22.3,22.4,22.5]
})
print(df9)
fruit production profits year
0 apple 121 22.1 2001
1 banana 122 22.2 2001
2 apple 123 22.3 2002
3 banana 124 22.4 2002
4 apple 125 22.5 2003
df9 = df9.set_index(["year","fruit"]) # 把year和fruit合并(方便计算某一年水果的情况)
print(df9)
production profits
year fruit
2001 apple 121 22.1
banana 122 22.2
2002 apple 123 22.3
banana 124 22.4
2003 apple 125 22.5
print(df9.ix[2002,"apple"]) #展示2002年的香蕉情况
print(df9.ix[2002]) #展示2002年的水果情况
production 123.0
profits 22.3
Name: (2002, apple), dtype: float64
production profits
fruit
apple 123 22.3
banana 124 22.4
df9 = df9.sum(level="year") # 以年为单位production,profits 相加
print(df9)
production profits
year
2001 243 44.3
2002 247 44.7
2003 125 22.5