Python（十一）pandas

最新推荐文章于 2022-11-26 20:05:08 发布

chengxuw

最新推荐文章于 2022-11-26 20:05:08 发布

阅读量442

点赞数

本文深入探讨了Pandas库中的三种核心数据结构：Series、DataFrame和Panel，详细讲解了它们的创建、操作、索引及合并等关键特性，同时提供了丰富的代码示例，帮助读者掌握Pandas在数据处理中的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

pandas有三种数据结构：

Series:一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，其区别是：List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。
DataFrame: 二维的表格型数据结构。多个Series结构就组成了DataFrame数据结构，这里需要特别注意的是DataFrame是按照列来存储的。
Panel: 三维的数组，可以理解为DataFrame的容器。

Series（data，index = index）data参数可以是整形、字符串、dict、ndarray、常量值。index是索引值，如果数据类型是ndarray,index的长度需要和data的长度一致，如果index没有指定，那么索引将会从[0，....., len(data) -1]递增；如果传入字典，字典的键key和值value将自动转换成Series对象的索引和元素。

#创建Seires：ndarray类型【np.random.randn（n）返回一个n个数且具有标准正态分布的样本】
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
print(s1)
s2 =  pd.Series(np.random.randn(5))
print(s2)  

#创建Seires：dict类型数据 
scores = {‘jack’:89, 'rose':95, 'john':61, 'tom':74}
names = ['rose','john','tracy','jack']
s3 = pd.Series(scores, index =names) 
#创建时以索引为依据，索引中没有的元素会被过滤掉；索引中有scores灭有的会被NaN（浮点型）代替
#返回为：
rose      95.0
john      61.0
tracy     NaN
jack      89.0
dtype: float64

#创建Seires：常量作为输入数据，常量值将会被重复index长度个数
s4 = pd.Series(5)
#返回为：
0    5
dtype: int64

Series既然是一维数组类型的数据结构，那么它支持想数组那样去操作它。通过数组下标索引、切片都可以去操作他，且它的data可以是dict类型的，那么它肯定也就支持字典的索引方式。

#查看数组的index值
print(res.index)

#查看数组的value值
print(res.values)

#取值（根据默认第零位开始取）
print(res[0])  #a

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
# 下标索引
print('下标索引方式s[0] = : %s' % s[0])
# 字典访问方式
print('字典访问方式s[b] = ：%s' % s['b'])
# 切片操作
print('切片操作s[2:]\n:%s' % s[2:])
print('a' in s)
print('k' in s)
OUT:

下标索引方式s[0] = : -0.799676067487
字典访问方式s[b] = ：-1.58170351838
切片操作s[2:]:
c   -1.240885
d    0.623757
e   -0.234417
True
False

修改value值：可以像对list操作一样对Series进行操作。

#直接操作
res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])
res['a_index']='new_a'
#返回为：
a_index    new_a
b_index        b
c_index        c
d_index        d

#copy复制数据并修改
sr1=pd.Series([12,13,14],index=['c','a','d'])
sr3=sr1[1:].copy()
print(sr3)
sr3[0]=1888
print(sr3)
#返回为：
a    13
d    14
dtype: int64

a    1888
d      14
dtype: int64

Series运算：算术操作都是按照索引去操作的，如果一个两个Series的对所不同，那么他们的结果将是一个union操作，他们的非交集将会用缺省值NaN代替。【索引值相同才能进行+-*/运算，索引a有b没有则会在计算结果中以NaN显示】

sr1=pd.Series([12,13,14],index=['c','a','d'])
sr2=pd.Series([14,15,16],index=['d','c','a'])
print(sr1+sr2)
#返回为：
a    29
c    27
d    28

s1 = pd.Series(np.random.randn(2), index=['a','b'])
s2 = pd.Series(np.random.randn(2), index=['e','f'])
print(s1 + s2)
#返回为：
a   NaN
b   NaN
e   NaN
f   NaN
dtype: float64

s3 = pd.Series(np.random.randn(3), index=['a','b','c'])
s4 = pd.Series(np.random.randn(3), index=['a','b','c'])
print(s3)
print(s4)
print(s3* s4)
print(s3 * 10)
#返回为：
a   -0.149056
b    0.637856
c   -1.357440
dtype: float64
a   -0.443937
b   -0.695017
c    2.217806
dtype: float64

a    0.066171
b   -0.443321
c   -3.010538
dtype: float64

a    -1.490556
b     6.378556
c   -13.574397
dtype: float64

问题：数据处理中，如何对NaN值进行处理？

#先构建一个缺失数据
sr1=pd.Series([12,13,14],index=['c','a','d'])
sr3=pd.Series([11,20,10,14], index=['d','c','a','b'])
#合并生成一个缺失数据
sr4=sr1+sr3
print(sr4)
#返回为：
a    23.0
b     NaN
c    32.0
d    25.0
dtype: float64

第一步：格式为 pd.isnull / notnull（Series对象），isnull / notnull用于过滤、查找NaN的值

isnull，返回布尔数组，缺失值对应True。notnull,返回布尔数组，缺失值对应为False。

res=pd.isnull(sr4)  #isnull，返回布尔数组，缺失值对应True
print(res)
#返回为：
a    False
b     True
c    False
d    False

res=pd.notnull(sr4)   #notnull,返回布尔数组，缺失值对应为False
print(res)
#返回为：
a     True
b    False
c     True
d     True
dtype: bool

第二步：格式为 pd.Series.dropna（series对象），删除有NaN的行。

注意对于Series的数据格式使用dropna必须是pd.Series.dropna(sr4)这个格式，不能使用pd.dropna()这个是无效的，

dropna,删除NaN的行(因为是Series数据格式只有行的概念)。

#dropna,过滤掉有NaN的行
res=pd.Series.dropna(sr4)
print(res)
#返回为：
a    23.0
c    32.0
d    25.0
dtype: float64

第三步：格式为 Series对象.fillna（‘要填充为的数据内容’）

fillna,填充缺失的数据。

#fillna,填充NaN缺失的数据
res=sr4.fillna('这是给NaN做填充的数据')
print(res)
#返回为：
a              23
b    这是给NaN做填充的数据
c              32
d              25
dtype: object

DataFrame数组创建

DataFrame是个二维数据结构，非常接近电子表格或者类似于mysql数据库的形式，是一个表格型的数据结构，含有一组有序的列。

pd.DataFrame（a，index = xx，columns = xxx）

pd.DataFrame（ {‘列名1’：xx1，‘列名2’：xx2，……}）
DataFrame可以被看做是由Series组成的字典，并且共用一个索引。

#DataFrame创建数组:简单方式
data={'name':['google','baidu','yahho'],'marks':[100,200,300],'price':[1,2,3]}
res=pd.DataFrame(data)
print(res)
#返回为：(默认索引是0开始)
   marks    name  price
0    100  google      1
1    200   baidu      2
2    300   yahho      3

#DataFrame创建数组:与Series结合方式
res=pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
print(res)
#返回为：
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

#它的index返回为：Index(['a', 'b', 'c', 'd'], dtype='object')

数组属性、方法

index	获取索引
T	转置
columns	获取列索引
dtypes	可以查看各列的数据类型。
values	获取值数组。（如何要查看某列所有值，xx[‘列名’].values）
describe()	获取描述性统计
sort_index(axis, …, ascending)	按行或列索引排序。（参数说明：axis=0/1行/列 ascending=True（默认）升序/降序）
sort_values(by, axis, ascending)	按值排序
head / tail（n=5）	可以查看前/后 n行的数据，默认的是前5行。
shape[0/1]	查看行/列数。参数为0表示查看行数，参数为1表示查看列数。

接name-marks-price这个DataFrame

#index,查看索引
print(res.index)    #RangeIndex(start=0, stop=3, step=1)

#columns，查看列索引
print(res.columns)   #Index(['name', 'marks', 'price'], dtype='object')

#values，查看值数组
print(res.values)
#返回为：
[['google' 100 1]
 ['baidu' 200 2]
 ['yahoo' 300 3]]

#describe(),获取快速统计
# print(res.describe())
#返回为：
     marks  price
count    3.0    3.0
mean   200.0    2.0
std    100.0    1.0
min    100.0    1.0
25%    150.0    1.5
50%    200.0    2.0
75%    250.0    2.5
max    300.0    3.0

#axis=0，按照行索引排序
res=res.sort_index(axis=0)
print(res)
#返回为：
     name  marks  price
0  google    100     1
1  baidu    200      2
2  yahoo    300      3

#axis=1，按照列索引排序
res=res.sort_index(axis=1,ascending=True)
print(res)
print(res)
   marks    name  price
   100  google      1
   200   baidu      2
   300   yahoo      3

#sort_values(by,axis,ascending) 按值排序
data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}
res=DataFrame(data)
res=res.sort_values(by=['name'],axis=0) #这里的axis只能是0，每一列的数值就是根据每个数值的大小顺序上下浮动排序的，参照的就是逐行去对比
print(res)
#返回为：
     name  marks  price
1  baidu    200      2
0 google    100      1
2  yahoo    300      3

使用loc iloc ix进行索引

一、loc——通过行标签索引行数据。

1.1 loc[1]表示索引的是第1行（index 是整数），loc[‘d’]表示索引的是第’d’行（index 是字符）， loc也可以获取多行数据。

1.2 如果想索引列数据，像1.1所示的操作做，会报错。【KeyError: 'the label [a] is not in the [index]'】

1.3 loc扩展——索引某列 & 索引某行某列。【其实获取某列数据最直接的方式是df.[列标签]，但是当列标签未知时可以通过这种方式获取列数据。】

【需要注意的是，dataframe的索引[1:3]是包含1,2,3的，与平时的不同。】

二、 iloc——通过行号获取行数据。

2.1 想要获取哪一行就输入该行数字，也同样可以通过行号可以索引多行。

2.2 通过行标签索引会报错。

2.3 iloc索引列数据

三、 ix——结合前两种的混合索引。（既可以通过行号索引也可以通过行标签索引）

import pandas as pd
df1 = pd.DataFrame([[1,2,3],[4,5,6]], index=[100.1,888], columns=['a','b','c'])
print df1.loc[100.1]     #index是数字时的loc索引方式
#返回为：
a    1
b    2
c    3
Name: 100.1, dtype: int64

df2 = pd.DataFrame([[1,2,3],[4,5,6]], index=['w','x'], columns=['a','b','c'])
print df2.loc['x']     #index是字符时的loc索引方式
#返回为：
a    4
b    5
c    6
Name: x, dtype: int64   

print df2.loc['a']   #这样使用来索引列是不正确的
#返回为：
KeyError: 'the label [a] is not in the [index]'

print df2.loc['w']     #loc可以用冒号：实现获取多行数据
#返回为：
   a  b  c
w  1  2  3
x  4  5  6

print df2.loc[:,['c']]     #loc可以用冒号：实现索引某列
#返回为：
   c
w  3
x  6

print df2.loc[‘w’,['c','b']]     #loc索引某行某列
#返回为：
c    3
b    2
Name: w, dtype: int64

   
print df2.iloc[0]     #iloc通过行号获取行数据,第0行就输入0
#返回为：
a    1
b    2
c    3
Name: w, dtype: int64

print df2.iloc[‘w’]    #iloc通过行标签索引会报错
TypeError: cannot do positional indexing on <class 'pandas.indexes.base.Index'> with these indexers [w] of <class 'str'>

print df2.iloc[0:]     #iloc通过行号可以索引多行
#返回为：
   a  b  c
w  1  2  3
x  4  5  6

print df2.iloc[:,[1]]   #iloc索引列数据
#返回为：
   b
w  2
x  5


print df2.ix[0]      #ix混合索引--通过行号索引
#返回为：
a    1
b    2
c    3
Name: w, dtype: int64

print df2.ix['w']      #ix混合索引--通过行标签索引
#返回为：
a    1
b    2
c    3
Name: w, dtype: int64

DataFrame运算

xxxx.sum（）默认对每列求和，sum(1)为对每行求和。

【注：就算元素是字符串，使用sum也会加起来。而一行中，有字符串有数值则只计算数值。】

dic1={'name':['小明','小红','狗蛋','铁柱'],'age':[17,20,5,40],'gender':['男','女','女','男']}
df3=pd.DataFrame(dic1)

df3.sum()    
#返回为：
age             82
gender        男女女男
name      小明小红狗蛋铁柱
dtype: object

df3.sum(1) 
#返回为：
0    17
1    20
2     5
3    40
dtype: int64

xxxx.apply（）数乘运算。如果元素是字符串，则会把字符串再重复一遍。【其实，直接使用df3 *2 和df4*2也能实现以上操作】

xxxx **2 乘方运算。乘方运算如果有元素是字符串的话，就会报错。

df4=pd.DataFrame([[1,2,3,4],[2,3,4,5],
                  [3,4,5,6],[4,5,6,7]],
                 index=list('ABCD'),columns=list('ABCD'))
df4.apply(lambda x:x*2)
#返回为：
   A   B   C   D
A  2   4   6   8
B  4   6   8  10
C  6   8  10  12
D  8  10  12  14

df3.apply(lambda x:x *2)
#返回为：
   age gender  name
0   34     男男  小明小明
1   40     女女  小红小红
2   10     女女  狗蛋狗蛋
3   80     男男  铁柱铁柱

df4 **2
#返回为：
    A   B   C   D
A   1   4   9  16
B   4   9  16  25
C   9  16  25  36
D  16  25  36  49

DataFrame的新增

扩充列可以直接像字典一样，列名对应一个list，但是注意list的长度要跟index的长度一致。

还可以使用insert，使用这个方法可以指定把列插入到第几列，其他的列顺延。

df4['E']=['999','999','999','999']
df4
#返回为：【注意：E列是字符串】
   A  B  C  D    E
A  1  2  3  4  888
B  2  3  4  5  888
C  3  4  5  6  888
D  4  5  6  7  888


df4.insert(0,'F',[888,888,888,888])
df4
#返回为：
     F  A  B  C  D    E
A  888  1  2  3  4  999
B  888  2  3  4  5  999
C  888  3  4  5  6  999
D  888  4  5  6  7  999

DataFrame的合并

使用join可以将两个DataFrame合并，但只根据行列名合并，并且以作用的那个DataFrame的为基准。如下所示，新的df7是以df4的行号index为基准的。

df6=pd.DataFrame(['my','name','is','a'],index=list('ACDH'),columns=list('G'))
df7=df4.join(df6)
df7
#返回为：
     F  A  B  C  D    E     G
A  888  1  2  3  4  999    my
B  888  2  3  4  5  999   NaN
C  888  3  4  5  6  999  name
D  888  4  5  6  7  999    is