Pandas入门学习（4）

最新推荐文章于 2024-09-20 21:11:59 发布

Python之简

最新推荐文章于 2024-09-20 21:11:59 发布

阅读量836

点赞数

分类专栏： Python数据分析文章标签： Pdndas

本文链接：https://blog.youkuaiyun.com/qq_1290259791/article/details/83316169

版权

Python数据分析专栏收录该内容

12 篇文章

订阅专栏

本文深入讲解Pandas库的高级功能，包括数据分组、聚合、过滤及数据合并技巧，同时介绍Pandas IO工具的使用，如读取Excel文件、自定义索引和转换器，适合有一定Python基础的数据分析师和技术人员学习。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

Pandas 常用功能

Pandas 常用功能

Pandas分组

在许多情况下，我们将数据分成多个集合，并在每个子集上应用一些函数。在应用函数中，可以执行以下操作。

将数据拆分成组

将数据进行分组

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
print(df.groupby('one'))

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10387dd68>

查看分组

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
print(df.groupby(['one','two']).groups)

{(0, 1): Index(['a'], dtype='object'), (3, 4): Index(['b'], dtype='object'), (6, 7): Index(['c'], dtype='object'), (9, 10): Index(['d'], dtype='object'), (12, 13): Index(['e'], dtype='object'), (15, 16): Index(['f'], dtype='object')}

迭代遍历分组

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
groupd = df.groupby('one')
for name, group in groupd:
	# print(name)
	print(group)

   one  two  three
a    0    1      2
   one  two  three
b    3    4      5
   one  two  three
c    6    7      8
   one  two  three
d    9   10     11
   one  two  three
e   12   13     14
   one  two  three
f   15   16     17

选择一个分组

使用get_group()方法，选择一个组

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
groupd = df.groupby('one')
print(groupd.get_group(3))

   one  two  three
b    3    4      5

聚合

聚合函数为每个组返回单个聚合值。
当创建了分组(group_by)对象，就可以对分组数据执行多个聚合操作。
常用的是通过聚合或等效的agg方法聚合。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
print(df)
grouped = df.groupby('one')
print(grouped['three'].agg(np.mean))
print(grouped['three'].agg(np.size))	# 查看分组大小

   one  two  three
a    0    1      2
b    3    4      5
c    6    7      8
d    9   10     11
e   12   13     14
f   15   16     17
one
0      2
3      5
6      8
9     11
12    14
15    17
Name: three, dtype: int64
one
0     1
3     1
6     1
9     1
12    1
15    1
Name: three, dtype: int64

一次使用多个聚合函数

传递函数的列表或字典来进行聚合

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
print(df)
grouped = df.groupby('one')
print(grouped['three'].agg([np.sum, np.mean, np.std]))

     sum  mean  std
one                
0      2     2  NaN
3      5     5  NaN
6      8     8  NaN
9     11    11  NaN
12    14    14  NaN
15    17    17  NaN

过滤

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(18).reshape(6,3),
index = list('abcdef'),
columns = ['one','two','three'])
print(df)
grouped = df.groupby('one')
print(grouped.filter(lambda x : len(x) >= 2 ))

   one  two  three
a    0    1      2
b    3    4      5
c    6    7      8
d    9   10     11
e   12   13     14
f   15   16     17
Empty DataFrame
Columns: [one, two, three]
Index: []

Pandas 合并/连接

Pandas提供了一个单独的merge()函数，作为DataFrame对象之间所有标准数据库连接操作的入口

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

参数	描述
left	另一个DataFrame对象
right	另一个DataFrame对象
on	列(名称)连接，必须在左和右DataFrame对象中存在。
letf_on	左侧DataFrame中的列用作键，可以是列名或长度等于DataFrame长度的数组。
right_on	右侧的DataFrame的列作为键，可以是列名或长度等于DataFrame长度的数组。
left_index	如果为True，使用左侧DataFrame中的索引(行标签)作为其连接键
fight_index	同上
how	默认inner。可选left、right、outher、inner
sort	按照字典顺序通过连接键对结果DataFrame进行排序。默认为True，设置为False时，在很多情况下大大提高性能。

在一个键上合并两个DataFrame

import pandas as pd
import numpy as np
left = pd.DataFrame({
	'id':[1,2,3,4],
	'Name':['hubo','vim','vi','kaka'],
	'answer_id':['sub1','sub2','sub3','sub4']
	})
right = pd.DataFrame({
	'id':[1,2,3,4],
	'Name':['li','bo','wn','su'],
	'answer_id':['sub2','sub3','sub5','sub6']
	})
mid = pd.merge(left,right,on='id')
print(mid)

   id Name_x answer_id_x Name_y answer_id_y
0   1   hubo        sub1     li        sub2
1   2    vim        sub2     bo        sub3
2   3     vi        sub3     wn        sub5
3   4   kaka        sub4     su        sub6

合并多个键上的两个DataFrame

import pandas as pd
import numpy as np
left = pd.DataFrame({
	'id':[1,2,3,4],
	'Name':['hubo','vim','vi','kaka'],
	'answer_id':['sub1','sub2','sub3','sub4']
	})
right = pd.DataFrame({
	'id':[1,2,3,4],
	'Name':['li','bo','wn','su'],
	'answer_id':['sub2','sub5','sub3','sub4']
	})
mid = pd.merge(left,right,on=['id','answer_id'])
print(mid)

   id Name_x answer_id Name_y
0   3     vi      sub3     wn
1   4   kaka      sub4     su

how参数

如何合并参数指定如何确定哪些键将被包含在结果表中。如果组合键没有出现在左侧或右侧表中，则连接表中的值将为NA。

left：使用左侧对象的键
right：使用右侧对象的键
outher：使用键的联合
inner：使用键的交集

import pandas as pd
import numpy as np
left = pd.DataFrame({
	'id':[1,2,3,4],
	'Name':['hubo','vim','vi','kaka'],
	'answer_id':['sub1','sub2','sub3','sub4']
	})
right = pd.DataFrame({
	'id':[1,2,3,4],
	'Name':['li','bo','wn','su'],
	'answer_id':['sub2','sub5','sub3','sub4']
	})
mid = pd.merge(left,right,on=['answer_id'], how='left')
print(mid)

   id_x Name_x answer_id  id_y Name_y
0     1   hubo      sub1   NaN    NaN
1     2    vim      sub2   1.0     li
2     3     vi      sub3   3.0     wn
3     4   kaka      sub4   4.0     su

3、Pandas IO工具

读取文件主要功能是read_csv()和read_table()
使用相同的解析来智能的将表格数据转换为DataFrame对象

import pandas as pd
df = pd.read_excel('catering_sale.xls')
print(df.head())

          日期      销量
0 2015-03-01    51.0
1 2015-02-28  2618.2
2 2015-02-27  2608.4
3 2015-02-26  2651.9
4 2015-02-25  3442.1

自定义索引

指定文件的一列来使用index_col定制索引

import pandas as pd
df = pd.read_excel('catering_sale.xls', index_col='日期')
print(df.head())

                销量
日期                
2015-03-01    51.0
2015-02-28  2618.2
2015-02-27  2608.4
2015-02-26  2651.9
2015-02-25  3442.1

转换器

dtype的列可以作为字典传递
将int类型变为float类型

import pandas as pd
import numpy as np
# df = pd.read_excel('catering_sale.xls', dtype={'销售':np.float64})
df = pd.read_excel('catering_sale.xls', dtype={'销售':np.float64})
print(df.dtypes)

日期    datetime64[ns]
销量           float64
dtype: object

指定标题名称

替换标题，并且从第二行开始显示

import pandas as pd
import numpy as np
df = pd.read_excel('catering_sale.xls', names=['a','b'], header=1)
print(df.head())

           a       b
0 2015-02-28  2618.2
1 2015-02-27  2608.4
2 2015-02-26  2651.9
3 2015-02-25  3442.1
4 2015-02-24  3393.1