Python数据分析【3】：pandas库详细教程

最新推荐文章于 2025-10-07 21:40:20 发布

原创

最新推荐文章于 2025-10-07 21:40:20 发布 · 2.1k 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#python #数据分析

这篇博客详细介绍了Python数据分析库pandas的使用，包括Series和DataFrame的创建、读取外部数据（如CSV、MySQL、MongoDB）、数据操作、统计方法、数据合并与分组聚合、时间序列分析，以及在实际案例（如电影数据、全球星巴克店铺统计、911紧急电话数据和空气质量数据）中的应用。

1. series和读取外部数据

1.1 前言

numpy：处理数据
matplotlib：数据分析
pandas：除处理数值之外，还能处理其他类型的数据（字符串，时间序列）。比如：通过爬虫获取到了存储在数据库中的数据；之前YouTube例子中除了数值之外，还有国家信息、视频分类(tag)信息，标题信息等

1.2 数据类型

Series：一维，带标签数组
DataFrame：二维，Series容器

1.3 Series创建

Series：本质上由两个数组构成，一个数组构成对象的键(index, 索引),一个数组构成对象的值(values), 键->值

import pandas as pd
t=pd.Series([1,2,31,12,3,4])
>>>0     1
1     2
2    31
3    12
4     3
5     4
dtype: int64

type(t)
>>>pandas.core.series.Series

t1=pd.Series([1,23,2,2,1],index=list('abcde'))
>>>a     1
b    23
c     2
d     2
e     1
dtype: int64

temp_dict={
   
   'name':'xiaohong','age':30,'tel':'10086'}
t2=pd.Series(temp_dict)
>>>name    xiaohong
age           30
tel        10086
dtype: object

1.4 Series索引和切片

temp_dict={
   
   'name':'xiaohong','age':30,'tel':'10086'}
t2=pd.Series(temp_dict)
>>>name    xiaohong
age           30
tel        10086
dtype: object

# 索引
t2['age']
>>>30

t2[0]
>>>'xiaohong'

t2[['age','tel']]
>>>age       30
tel    10086
dtype: object

# 切片
t2[1:2]
>>>age    30
dtype: object

1.5 Series索引和值

temp_dict={
   
   'name':'xiaohong','age':30,'tel':'10086'}
t2=pd.Series(temp_dict)
>>>name    xiaohong
age           30
tel        10086
dtype: object

# 索引
t2.index
>>>Index(['name', 'age', 'tel'], dtype='object')

for i in t2.index:
	print(i) 
>>>name
age
tel

type(t2.index)  # 获取类型
>>>pandas.core.indexes.base.Index
len(t2.index)  # 获取长度
>>>3
list(t2.index)  # 获取列表
>>>['name', 'age', 'tel']


# 值
t2.values
>>>array(['xiaohong', 30, '10086'], dtype=object)
type(t2.values)  # 获取类型
>>>numpy.ndarray

ndarray的很多方法都可以运用于series类型，比如argmax, clip
series具有where方法，但是结果和ndarray不同
百度搜索官网文档：padas Series where

1.6 读取外部数据

CSV数据：pd.read_csv
myql数据：pd.read_sql(sql_sentence,connection)
mongodb：第三方库(from pymongo import MongoClient )

import pandas as pd
# pandas读取csv中的文件
df=pd.read_csv('dogNames2.csv')
# 数据链接：https://www.kaggle.com/new-york-city/nyc-dog-names/data
print(df)
>>> Row_Labels  Count_AnimalName
0              1                 1
1              2                 2
2          40804                 1
3          90201                 1
4          90203                 1
...          ...               ...
16215      37916                 1
16216      38282                 1
16217      38583                 1
16218      38948                 1
16219      39743                 1

[16220 rows x 2 columns]

1.7 DataFrame

DataFrame既有行索引，也有列索引
-行索引：表明不同行，横向索引，叫index，0轴，axis=0
-列索引：表明不同列，纵向索引，叫columns，1轴，axis=1

import pandas as pd
import numpy as np
pd.DataFrame(np.arange(12).reshape(3,4))
>>>
    0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
# 竖着是行索引，横着是列索引

pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
>>>
    W	X	Y	Z
a	0	1	2	3
b	4	5	6	7
c	8	9	10	11

d1={
   
   'name':['xiaohong','xiaoming'],'age':[20,32],'tel':[10086,10010]}
t1=pd.DataFrame(d1)
>>>
        name	age	tel
0	xiaohong	20	10086
1	xiaoming	32	10010
type(t1)
>>>pandas.core.frame.DataFrame

d2=[{
   
   'name':'xiaohong','age':32,'tel':'10010'},{
   
   'name':'xiaogang','tel':'10000'},{
   
   'name':'xiaowang','age':22}]
t2=pd.DataFrame(d2)
>>>
        name	age	tel
0	xiaohong	32.0	10010
1	xiaogang	NaN	10000
2	xiaowang	22.0	NaN

# DataFrame的基础属性
t2.index  # 行索引
>>>RangeIndex(start=0, stop=3, step=1)
t2.columns  # 列索引
>>>Index(['name', 'age', 'tel'], dtype='object')
t2.values  # 对象值，二维ndarray数组
>>>array([['xiaohong', 32.0, '10010'],
       ['xiaogang', nan, '10000'],
       ['xiaowang', 22.0, nan]], dtype=object)
t2.shape  # 行数 列数
>>>(3, 3)
t2.dtypes  # 列数据类型
>>>name     object
age     float64
tel      object
dtype: object
t2.ndim  # 数据维度
>>>2

# DataFrame的整体情况查询
t2.head(a)  # 显示头部几行
t2.tail(b)  # 显示末尾几行
t2.info()  # 相关信息概览：行数，列数，列索引，列非空值个数，列类型，内存占用
t2.describe()  # 快速综合统计结果：计数，均值，标准差，最大值，最小值，四分位数

from pymongo import MongoClient

client=MongoClient()
collection=client['douban']['tv1']
data=collection.find()
data_list=[]
for i in data:
	temp={
   
   }
	temp['info']=i['info']
	temp['rating_count']=i['rating']['count']
	temp['rating_value']=i['rating']['value']
	temp['title']=i['title']
	temp['country']=i['tv_category']
	temp['directors']=i['directors']
	temp['actors']=i['actors']
	data_list.append(temp)

df=pd.DataFrame(data_list)
print(df)

1.8 取行取列

import pandas as pd
# pandas读取csv中的文件
df=pd.read_csv('dogNames2.csv')
# 数据链接：https://www.kaggle.com/new-york-city/nyc-dog-names/data

print(df.head())
print(df.info())
>>>  Row_Labels  Count_AnimalName
0          1                 1
1          2                 2
2      40804                 1
3      90201                 1
4      90203                 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16220 entries, 0 to 16219
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Row_Labels        16217 non-null  object
 1   Count_AnimalName  16220 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 253.6+ KB
None

# 问题：使用次数最高的前几个名字是什么呢？
# DataFrame中排序的方法
df=df.sort_values(by='Count_AnimalName',ascending=False)
df.head(5)  # 取前面5行
# 等价于df[:5]
>>>	Row_Labels	Count_AnimalName
1156	BELLA	1195
9140	MAX	1153
2660	CHARLIE	856
3251	COCO	852
12368	ROCKY	823

# 取前面5行的具体某一列
df[:5]['Row_Labels']
>>>1156       BELLA
9140         MAX
2660     CHARLIE
3251        COCO
12368      ROCKY
Name: Row_Labels, dtype: object
# pandas取行或者取列的注意点
# - [] 写数组，表示取行，对行进行排序
# - [] 写字符串，表示取列索引，对列进行操作

# 问题：如果有10列数据，按照其中第1,3,8列排序（看ipythpn的帮助文档）

1.9 loc

df.loc：通过标签索引行数据
df.iloc：通过位置获取行数据

t3=pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))

t3.loc['a','Z']
type(t3.loc['a','Z'])
>>>
    W	X	Y	Z
a	0	1	2	3
b	4	5	6	7
c	8	9	10	11
>>>numpy.int32

# loc
# 取行
t3.loc['a']
>>>W    0
X    1
Y    2
Z    3
Name: a, dtype: int32

# 取列
t3.loc[:,'Y']
>>>a     2
b     6
c    10
Name: Y, dtype: int32

# 取多行
t3.loc[['a','c']]  # 等价于t3.loc[['a','c'],:]
>>>	W	X	Y	Z
a	0	1	2	3
c	8	9	10	11

# 取多列
t3.loc[:,['W','Z']]
>>>	W	Z
a	0	3
b	4	7
c	8	11

# 取间隔的多行多列
t3.loc[['a','c'],['W','Z']]
>>>	W	Z
a	0	3
c	8	11
t3.loc['a':,['W','Z']]
>>>
    W	Z
a	0	3
b	4	7
c	8	11

# iloc
# 取行
t3.iloc[1]
>>>W    4
X    5
Y    6
Z    7
Name: b, dtype: int32

# 取列
t3.iloc[:,2]
>>>a     2
b     6
c    10
Name: Y, dtype: int32

# 取不连续的多行多列
t3.iloc[:,[2,1]]
>>>
    Y	X
a	2	1
b	6	5
c	10	9

# 取连续的多行多列
t3.iloc[[0,2],[2,1]]
>>>
    Y	X
a	2	1
c	10	9

t3.iloc[1:,:3]=10
>>>	W	X	Y	Z
a	0	1	2	3
b	10	10	10	7
c	10	10	10	11

t3.iloc[1:,:2]=np.nan
>>>	W	X	Y	Z
a	0.0	1.0	2	3
b	NaN	NaN	10	7
c	NaN	NaN	10	11

1.10 布尔索引

import pandas as pd
# pandas读取csv中的文件
df=pd.read_csv('dogNames2.csv')
# 数据链接：https://www.kaggle.com/new-york-city/nyc-dog-names/data

# 找到使用次数超过800的狗的名字
df[df['Count_AnimalName']>800]
>>>
   Row_Labels	Count_AnimalName
1156	BELLA	1195
9140	MAX	1153
2660	CHARLIE	856
3251	COCO	852
12368	ROCKY	823

# 找到使用次数800-1000的狗的名字 &或   |且
df[(800<df['Count_AnimalName']) & (df['Count_AnimalName']<1000)]
>>>	Row_Labels	Count_AnimalName
2660	CHARLIE	856
3251	COCO	852
12368	ROCKY	823

1.11 字符串方法

常用：contains, len, lower/upper, replace, split
contains: 返回表示各字符串是否含有指定模式的布尔型数值

# 找到使用次数超过800的狗的名字并且名字的字符串的长度大于4
df[(700<df['Count_AnimalName']) & (df['Row_Labels'].str.len()>4)]
>>>
Row_Labels	Count_AnimalName
1156	BELLA	1195
2660	CHARLIE	856
12368	ROCKY	823
8552	LUCKY	723

1.12 缺失数据的处理

数据缺失有两种情况：
① 空，None等，在pandas是nan(和np.nan一样)
② 没有数据，为0

判断数据是否为nan:pd.isnull(df), pd.notnull(df),
处理方式1：删除nan所在的行列dropna(axis,how=‘any’,inplace=False)
处理方式2：填充数据，t.fillna(t.mean()), t.fillna(t.median()), t.fillna(0),

处理为0的数据：t[t==0]=np.nan
当然并不是每次为0的数据都需要处理
计算平均值等情况，nan是不参与计算的，但是0会

pd.isnull(t3)
>>>	W	X	Y	Z
a	False	False	False	False
b	True	True	False	False
c	True	True	False	False
pd.notnull(t3)
>>>	W	X	Y	Z
a	True	True	True	True
b	False	False	True	True
c	False	False	True	True
t3[pd.notnull