pandas学习笔记：01、数据文件的读取与写入

xMathematics

已于 2025-04-25 17:19:19 修改

阅读量2.1k

点赞数

CC 4.0 BY-SA版权

分类专栏：深度学习文章标签： python 数据挖掘数据分析 pandas 机器学习

于 2021-12-08 22:37:33 首次发布

本文链接：https://blog.youkuaiyun.com/GeekDongHuang/article/details/121801988

深度学习专栏收录该内容

7 篇文章

订阅专栏

本文介绍了如何使用pandas库在Python中读取CSV和Excel文件，并展示了如何设置DataFrame以完整显示数据。还详细解读了pandas官方提供的read_csv函数及其参数，以及常见的文件I/O操作API。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

以下是基于Pandas的数据文件读取与写入的全面教程，涵盖CSV、Excel、文本文件、数据库等常见格式的操作方法和核心参数说明，结合最新实践总结而成：

一、核心文件格式操作指南

1. CSV文件

读取

import pandas as pd
# 基础读取
df = pd.read_csv("data.csv")  
# 高级参数示例
df = pd.read_csv("data.tsv", sep="\t", header=None, names=["日期", "城市"], skiprows=3, encoding="utf-8", nrows=1000)

关键参数：
- sep：分隔符（如\t处理TSV文件）
- header：列名所在行（None表示无列名）
- nrows：限制读取行数（适合大文件）
- dtype：指定列数据类型（如{"价格": "float64"}）
- parse_dates：解析日期列（如parse_dates=["日期"]）

写入

df.to_csv("output.csv", index=False, encoding="utf-8-sig", sep=";")

核心选项：
- index=False：不保存行索引
- encoding="utf-8-sig"：解决Excel打开CSV中文乱码问题

2. Excel文件

读取

# 读取单个工作表
df = pd.read_excel("data.xlsx", sheet_name="Sheet1", engine="openpyxl")  
# 读取全部工作表（返回字典）
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)

依赖库：需安装openpyxl或xlrd（旧版.xls文件）
参数：skipfooter（跳过后几行）、usecols（选择列范围）

写入

with pd.ExcelWriter("output.xlsx", engine="openpyxl") as writer:
    df1.to_excel(writer, sheet_name="结果1", index=False)
    df2.to_excel(writer, sheet_name="结果2")

多表写入：通过ExcelWriter实现
格式控制：可添加单元格样式（需结合openpyxl扩展）

3. 文本文件（TXT/TSV）

读取

df = pd.read_table("log.txt", sep="\|\|\|", engine="python", comment="#")

特殊处理：正则分隔符（如多字符分隔符需用正则表达式）
跳过注释行：comment="#"

写入

df.to_csv("output.txt", sep="|", index=False)

4. 其他格式

JSON：

df = pd.read_json("data.json", orient="records")
df.to_json("output.json", orient="split")

SQL数据库：

from sqlalchemy import create_engine
engine = create_engine("sqlite:///data.db")
df = pd.read_sql("SELECT * FROM table", engine)
df.to_sql("new_table", engine, if_exists="replace")

HDF5：适合大型科学数据集（pd.HDFStore）

二、通用技巧与注意事项

编码问题
- CSV文件中文乱码：读取时用encoding="gbk"，写入时用utf-8-sig
- Excel文件编码：默认无需指定，但需确保引擎兼容性
性能优化
- 大文件分块读取：chunksize=10000逐块处理
- 内存压缩：dtype={"列名": "category"}减少内存占用
数据处理增强
- 缺失值处理：na_values=["NA", "null"]指定缺失标记
- 类型转换：converters={"电话": str}避免数值截断
索引与列控制
- 设置索引列：index_col="ID"
- 重命名列：pd.read_csv(..., names=["新列名1", "新列名2"])

三、最佳实践场景

跨平台数据交换
- 优先使用CSV（兼容性强）
- 含复杂格式用Excel（如多表、公式）
大数据处理
- 分块读取（chunksize）+ 并行处理（dask库）
- 使用Parquet格式（pd.read_parquet）提升I/O效率
自动化报告生成
- 结合to_markdown()生成文档
- 用df.style定制Excel输出样式（如高亮异常值）

四、常见问题解决

报错ParserError：检查分隔符是否匹配（如中文逗号需显式指定sep="，"）
内存不足：使用pd.read_csv(..., usecols=["关键列"])选择性加载
日期解析错误：parse_dates=["日期列"]或自定义date_parser函数

通过上述方法，可覆盖90%以上的数据读写需求。实际应用中，建议结合具体数据特点调整参数（如处理含特殊符号的文本需转义）。更多高级操作可参考Pandas官方文档或工具库（如pandas-ta扩展）。

五、模拟实测案例

'''
常用的读取数据函数
'''
import pandas as pd
'''
	./	代表当前目录,当前目录也可以什么都不写，直接寻找当前目录的文件
		比如：./data/ 和 data/ 都代表当前目录下的data文件夹下的文件
	../	代表上一级目录
	/	代表根目录
		Linux系统里面会用到根目录
	~	代表当前用户目录
		比如Windows用户Dongze代表的就是'C:\\Users\\Dongz'
'''
#读取CSV格式数据，返回DataFrame格式列表
data = pd.read_csv("数据目录/xxx.csv")
#还可以使用URL来读取
pd.read_csv("http://localhost/xxx.csv")
data = pd.read_excel("数据目录/xxx.xlsx")

如果数据过多，编译器会省略中间部分数据，如下图所示：

我们可以设置dataframe显示中间忽略的数据


'''
    设置dataframe显示数据
'''
#显示Dateframe所有行
pd.set_option('display.max_rows',None)
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#设置Dataframe数据的显示长度，默认为50
pd.set_option('max_colwidth',200)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)

这样就会显示出所有数据

**

六、官网提供的读取文件和写入文件的API

官网提供的read_csv函数参数详解
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

pandas.read_csv(
	#文件路径，必须要写的参数，其他参数按需要填写
	filepath_or_buffer, 
	sep=NoDefault.no_default, 
	delimiter=None, 
	header='infer', 
	names=NoDefault.no_default, 
	index_col=None, 
	usecols=None, 
	squeeze=False, 
	prefix=NoDefault.no_default, 
	mangle_dupe_cols=True, 
	dtype=None, 
	engine=None, 
	converters=None, 
	true_values=None, 
	false_values=None, 
	skipinitialspace=False, 
	skiprows=None, 
	skipfooter=0, 
	nrows=None, 
	na_values=None, 
	keep_default_na=True, 
	na_filter=True, 
	verbose=False, 
	skip_blank_lines=True, 
	parse_dates=False, 
	infer_datetime_format=False, 
	keep_date_col=False, 
	date_parser=None, 
	dayfirst=False, 
	cache_dates=True, 
	iterator=False, 
	chunksize=None, 
	compression='infer', 
	thousands=None, 
	decimal='.', 
	lineterminator=None, 
	quotechar='"', 
	quoting=0, 
	doublequote=True, 
	escapechar=None, 
	comment=None, 
	encoding=None, 
	encoding_errors='strict', 
	dialect=None, 
	error_bad_lines=None, 
	warn_bad_lines=None, 
	on_bad_lines=None, 
	delim_whitespace=False, 
	low_memory=True, 
	memory_map=False, 
	float_precision=None, 
	storage_options=None)

#Input/output
#Pickling
#读取pickling文件
read_pickle(filepath_or_buffer[, ...])
#Load pickled pandas object (or any object) from file.
#写入pickle文件
DataFrame.to_pickle(path[, compression, ...])
#Pickle (serialize) object to file.

#Flat file
read_table(filepath_or_buffer[, sep, ...])
#Read general delimited file into DataFrame.
read_csv(filepath_or_buffer[, sep, ...])
#Read a comma-separated values (csv) file into DataFrame.
DataFrame.to_csv([path_or_buf, sep, na_rep, ...])
#Write object to a comma-separated values (csv) file.
read_fwf(filepath_or_buffer[, colspecs, ...])
#Read a table of fixed-width formatted lines into DataFrame.

#Clipboard
read_clipboard([sep])
#Read text from clipboard and pass to read_csv.
DataFrame.to_clipboard([excel, sep])
#Copy object to the system clipboard.

#Excel
read_excel(io[, sheet_name, header, names, ...])
#Read an Excel file into a pandas DataFrame.
DataFrame.to_excel(excel_writer[, ...])
#Write object to an Excel sheet.
ExcelFile.parse([sheet_name, header, names, ...])
#Parse specified sheet(s) into a DataFrame.
Styler.to_excel(excel_writer[, sheet_name, ...])
#Write Styler to an Excel sheet.
ExcelWriter(path[, engine, date_format, ...])
#Class for writing DataFrame objects into excel sheets.

#JSON
read_json([path_or_buf, orient, typ, dtype, ...])
#Convert a JSON string to pandas object.
to_json(path_or_buf, obj[, orient, ...])
build_table_schema(data[, index, ...])
#Create a Table schema from data.

#HTML
read_html(io[, match, flavor, header, ...])
#Read HTML tables into a list of DataFrame objects.
DataFrame.to_html([buf, columns, col_space, ...])
#Render a DataFrame as an HTML table.
Styler.to_html([buf, table_uuid, ...])
#Write Styler to a file, buffer or string in HTML-CSS format.

#XML
read_xml(path_or_buffer[, xpath, ...])
#Read XML document into a DataFrame object.
DataFrame.to_xml([path_or_buffer, index, ...])
#Render a DataFrame to an XML document.

#Latex
DataFrame.to_latex([buf, columns, ...])
#Render object to a LaTeX tabular, longtable, or nested table/tabular.
Styler.to_latex([buf, column_format, ...])
#Write Styler to a file, buffer or string in LaTeX format.
HDFStore: PyTables (HDF5)
read_hdf(path_or_buf[, key, mode, errors, ...])
#Read from the store, close it if we opened it.
HDFStore.put(key, value[, format, index, ...])
#Store object in HDFStore.
HDFStore.append(key, value[, format, axes, ...])
#Append to Table in file.
HDFStore.get(key)
#Retrieve pandas object stored in file.
HDFStore.select(key[, where, start, stop, ...])
#Retrieve pandas object stored in file, optionally based on where criteria.
HDFStore.info()
#Print detailed information on the store.
HDFStore.keys([include])
#Return a list of keys corresponding to objects stored in HDFStore.
HDFStore.groups()
#Return a list of all the top-level nodes.
HDFStore.walk([where])
#Walk the pytables group hierarchy for pandas objects.