Pandas教程：文件读取、数据结构与基本操作-优快云博客

本文链接：https://blog.youkuaiyun.com/lc960928/article/details/111411025

这篇博客介绍了Pandas库的基础知识，包括文件的读取（如csv, excel, txt）和写入，常用的数据结构Series和DataFrame，以及基本操作如排序、聚合函数和替换函数。此外，还讲解了窗口对象如滑动窗口和扩张窗口的使用，提供了实际操作示例，如口袋妖怪数据集和指数加权计算。" 137638544,7337247,分层强化学习：提升复杂环境下的强化学习效率,"['强化学习', '分层强化', '神经网络', '深度学习', '大数据', '人工智能', 'Python', 'Java', '架构设计', 'Agent']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Pandas基础

1 文件的读取和写入
- 1.1 文件的读取
- 1.2 数据写入(数据保存为文件）
2 基本数据结构
- 2.1 Series
- 2.2 DataFrame
3 常用基本函数
4 窗口对象
- 4.1 滑窗对象
- 4.2 扩张窗口
5 练习
- 5.1 口袋妖怪数据集
- 5.2 指数加权

import numpy as np
import pandas as pd

pd.__version__

'1.0.5'

pip install --upgrade pandas #第二步：更新pandas

Requirement already satisfied: pandas in d:\anaconda3\lib\site-packages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in d:\anaconda3\lib\site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in d:\anaconda3\lib\site-packages (from pandas) (2020.1)
Requirement already satisfied: numpy>=1.15.4 in d:\anaconda3\lib\site-packages (from pandas) (1.18.5)
Requirement already satisfied: six>=1.5 in d:\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Note: you may need to restart the kernel to use updated packages.

pip install --upgrade pip  #第一步：更新pip

Collecting pip
  Downloading pip-20.3.3-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.1.1
    Uninstalling pip-20.1.1:
      Successfully uninstalled pip-20.1.1
Successfully installed pip-20.3.3
Note: you may need to restart the kernel to use updated packages.

文件的读取和写入

文件的读取

pandas 可以读取的文件格式有很多，这里主要介绍读取 csv, excel, txt 文件。

df_csv=pd.read_csv('E:/datawhale/joyful-pandas-master/data/my_csv.csv')

df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_txt=pd.read_table('E:/datawhale/joyful-pandas-master/data/my_table.txt')

df_txt

	col1	col2	col3	col4
0	2	a	1.4	apple 2020/1/1
1	3	b	3.4	banana 2020/1/2
2	6	c	2.5	orange 2020/1/5
3	5	d	3.2	lemon 2020/1/7

df_excel=pd.read_excel('E:/datawhale/joyful-pandas-master/data/my_excel.xlsx')
df_excel

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

这里有一些常用的公共参数：

header=None 表示第一行不作为列名，
index_col 表示把某一列或几列作为索引，索引的内容将会在第三章进行详述，
usecols 表示读取列的集合，默认读取所有的列，
parse_dates 表示需要转化为时间的列，关于时间序列的有关内容将在第十章讲解，
nrows 表示读取的数据行数。

上面这些参数在上述的三个函数里都可以使用。

pd.read_table('E:/datawhale/joyful-pandas-master/data/my_table.txt', header=None) #第一行不作为列名，而作为数据

	0	1	2	3
0	col1	col2	col3	col4
1	2	a	1.4	apple 2020/1/1
2	3	b	3.4	banana 2020/1/2
3	6	c	2.5	orange 2020/1/5
4	5	d	3.2	lemon 2020/1/7

pd.read_csv('E:/datawhale/joyful-pandas-master/data/my_csv.csv', index_col=['col1', 'col2'])  #把col1、col2两列作为索引

		col3	col4	col5
col1	col2
2	a	1.4	apple	2020/1/1
3	b	3.4	banana	2020/1/2
6	c	2.5	orange	2020/1/5
5	d	3.2	lemon	2020/1/7

pd.read_table('E:/datawhale/joyful-pandas-master/data/my_table.txt', usecols=['col1', 'col2']) #只读取前两列

	col1	col2
0	2	a
1	3	b
2	6	c
3	5	d

pd.read_csv('E:/datawhale/joyful-pandas-master/data/my_csv.csv', parse_dates=['col5'])  #将第五列转换为时间，在时间序列中会用到

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020-01-01
1	3	b	3.4	banana	2020-01-02
2	6	c	2.5	orange	2020-01-05
3	5	d	3.2	lemon	2020-01-07

pd.read_excel('E:/datawhale/joyful-pandas-master/data/my_excel.xlsx', nrows=2)  #读取前两行

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2

在读取 txt 文件时，经常遇到分隔符非空格的情况， read_table 有一个分割参数 sep ，它使得用户可以自定义分割符号，进行 txt 数据的读取。例如，下面的读取的表以 |||| 为分割：

pd.read_table('E:/datawhale/joyful-pandas-master/data/my_table_special_sep.txt')  #这应该不会用到吧，看起来太乱了

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

上面的结果显然不是理想的，这时可以使用 sep ，同时需要指定引擎为 python ：

pd.read_table('E:/datawhale/joyful-pandas-master/data/my_table_special_sep.txt',sep=' \|\|\|\| ', engine='python')

	col1	col2
0	TS	This is an apple.
1	GQ	My name is Bob.
2	WT	Well done!
3	PT	May I help you?

sep 是正则参数

在使用 read_table 的时候需要注意，参数 sep 中使用的是正则表达式，因此需要对 | 进行转义变成 \ | ，否则无法读取到正确的结果。有关正则表达式的基本内容可以参考第八章或者其他相关资料。

数据写入(数据保存为文件）

一般在数据写入中，最常用的操作是把 index 设置为 False ，特别当索引没有特殊意义的时候，这样的行为能把索引在保存的时候去除。

df_csv.to_csv('E:/datawhale/joyful-pandas-master/data/my_csv_saved.csv', index=False)  #将之前读取的csv文件去掉索引后另存

pd.read_csv('E:/datawhale/joyful-pandas-master/data/my_csv_saved.csv')#之前设置col1和col2为两个索引列，现在取消索引

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_excel.to_excel('E:/datawhale/joyful-pandas-master/data/my_excel_saved.xlsx', index=False)  #也去除之前excel的索引

pandas 中没有定义 to_table 函数，但是 to_csv 可以保存为 txt 文件，并且允许自定义分隔符，常用制表符 \t 分割：

df_txt.to_csv('E:/datawhale/joyful-pandas-master/data/my_txt_saved.txt', sep='\t', index=False)   #table的格式为txt，可以用to_csv储存为txt格式文件，其分隔符为sep

如果想要把表格快速转换为 markdown 和 latex 语言，可以使用 to_markdown 和 to_latex 函数，此处需要安装 tabulate 包。

平时写blog用markdown，有些时候需要在blog中嵌入数学公式，公式就用latex语法写在markdown中，然后发布到Github，最终在网页上看到的就是一篇文章里面有数学公式

conda install tabulate

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Anaconda3

  added / updated specs:
    - tabulate
Note: you may need to restart the kernel to use updated packages.



The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.9.2                |   py38haa95532_0         2.9 MB
    tabulate-0.8.7             |           py38_0          56 KB
    ------------------------------------------------------------
                                           Total:         2.9 MB

The following NEW packages will be INSTALLED:

  tabulate           pkgs/main/win-64::tabulate-0.8.7-py38_0

The following packages will be UPDATED:

  conda                                        4.9.0-py38_0 --> 4.9.2-py38haa95532_0



Downloading and Extracting Packages

tabulate-0.8.7       | 56 KB     |            |   0% 
tabulate-0.8.7       | 56 KB     | ##8        |  29% 
tabulate-0.8.7       | 56 KB     | ########## | 100% 
tabulate-0.8.7       | 56 KB     | ########## | 100% 

conda-4.9.2          | 2.9 MB    |            |   0% 
conda-4.9.2          | 2.9 MB    |            |   1% 
conda-4.9.2          | 2.9 MB    | 1          |   1% 
conda-4.9.2          | 2.9 MB    | 5          |   5% 
conda-4.9.2          | 2.9 MB    | 6          |   6% 
conda-4.9.2          | 2.9 MB    | 9          |   9% 
conda-4.9.2          | 2.9 MB    | #1         |  11% 
conda-4.9.2          | 2.9 MB    | #3         |  14% 
conda-4.9.2          | 2.9 MB    | #5         |  16% 
conda-4.9.2          | 2.9 MB    | ##2        |  23% 
conda-4.9.2          | 2.9 MB    | ##6        |  26% 
conda-4.9.2          | 2.9 MB    | ##7        |  28% 
conda-4.9.2          | 2.9 MB    | ##9        |  29% 
conda-4.9.2          | 2.9 MB    | ####       |  41% 
conda-4.9.2          | 2.9 MB    | ####4      |  44% 
conda-4.9.2          | 2.9 MB    | ####6      |  47% 
conda-4.9.2          | 2.9 MB    | ####9      |  49% 
conda-4.9.2          | 2.9 MB    | #####2     |  52% 
conda-4.9.2          | 2.9 MB    | #####9     |  59% 
conda-4.9.2          | 2.9 MB    | ######3    |  64% 
conda-4.9.2          | 2.9 MB    | #######2   |  73% 
conda-4.9.2          | 2.9 MB    | #######5   |  75% 
conda-4.9.2          | 2.9 MB    | #######8   |  78% 
conda-4.9.2          | 2.9 MB    | ########   |  80% 
conda-4.9.2          | 2.9 MB    | ########7  |  88% 
conda-4.9.2          | 2.9 MB    | #########1 |  91% 
conda-4.9.2          | 2.9 MB    | #########3 |  94% 
conda-4.9.2          | 2.9 MB    | ########## | 100% 
conda-4.9.2          | 2.9 MB    | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

基本数据结构

pandas 中具有两种基本的数据存储结构，存储一维 values 的 Series 和存储二维 values 的 DataFrame ，在这两种结构上定义了很多的属性和方法。

Series

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中，索引也可以指定它的名字，默认为空。 series可以把array和list输出为一列索引一列数值，可以把字典输出为一列键名一列键值。

s = pd.Series(data = [100, 'a', {
   
   'dic1':5}],                #设置数值内容 
    index = pd.Index(['id1', 20, 'third'], name='my_idx'),  #设置索引内容和索引列名称
    dtype =