Python 文件 IO：LLM 语料与对话历史的持久化

原创于 2025-12-30 10:01:46 发布 · 215 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #大数据 #python #自然语言处理 #语言模型 #知识图谱 #rag

Python 从真零基础到纯文本 LLM 全栈实战专栏收录该内容

16 篇文章

订阅专栏

AgenticCoding·十二月创作之星挑战赛 10w+人浏览 535人参与

（专栏：Python 从真零基础到纯文本 LLM 全栈实战・第 7 篇 | 字数：10000 字 | 零基础友好 | LLM 场景深度绑定 | 代码可运行）

开篇：LLM 的 “记忆” 痛点

你有没有过这样的经历？

写了一个 LLM 对话脚本，关闭终端后，之前的对话历史全部消失
爬了 1000 条电商评论作为 LLM 训练语料，程序崩溃后，所有数据白爬
调用 LLM 生成的商品文案，没有保存，需要时只能重新生成

这些问题的本质都是 **“数据没有持久化”—— 程序运行时的数据只存在于内存中，程序结束或崩溃后就会丢失。而Python 文件 IO 就是 LLM 开发中实现数据持久化的核心技术 **，它能帮你：

持久化 LLM 对话历史：让 AI 记住之前的聊天内容
管理 LLM 训练语料：读取、写入、修改语料文件
保存 LLM 生成结果：将生成的文案、摘要等保存到本地
配置 LLM 系统：从文件中读取 API 密钥、模型配置等

本文将从LLM 开发场景出发，系统讲解 Python 文件 IO 的核心技术，并结合对话历史持久化、语料管理等真实需求给出实战代码。

一、核心概念：文件 IO 的基础认知

1.1 什么是文件 IO？

文件 IO（Input/Output） 是指程序与外部文件之间的数据交换，包括：

输入（Input）：从外部文件读取数据到程序内存
输出（Output）：从程序内存写入数据到外部文件

1.2 文件的类型

在 LLM 开发中，我们主要处理以下几种文件类型：

文本文件：以纯文本形式存储的数据，如.txt、.csv、.json、.md 等
二进制文件：以二进制形式存储的数据，如.jpg、.png、.pdf、.model 等

1.3 文件的打开与关闭

在 Python 中，使用open()函数打开文件，使用close()函数关闭文件。

# 打开文件
file = open("test.txt", "r", encoding="utf-8")
# 读取文件内容
content = file.read()
# 关闭文件
file.close()
print(content)

注意：一定要记得关闭文件！否则会导致文件句柄泄漏，程序崩溃或文件损坏。

1.4 `with`语句：自动关闭文件

为了避免忘记关闭文件，Python 提供了with语句，会自动关闭文件。

# with语句自动关闭文件
with open("test.txt", "r", encoding="utf-8") as file:
    content = file.read()
print(content)

推荐：在 LLM 开发中，始终使用with语句处理文件 IO。

二、核心操作：文件读写（文本文件）

在 LLM 开发中，90% 的文件操作都是针对文本文件的，以下是核心的读写操作。

2.1 文件打开模式

open()函数的第二个参数是打开模式，决定了文件的操作方式：

模式	含义	适用场景
`r`	只读模式（默认）	读取 LLM 语料、配置文件
`w`	写入模式，覆盖原有内容	保存 LLM 生成结果
`a`	追加模式，在文件末尾添加内容	保存 LLM 对话历史
`r+`	读写模式，从文件开头读写	修改 LLM 配置文件
`w+`	读写模式，覆盖原有内容	创建并读写新文件
`a+`	读写模式，在文件末尾追加	读写 LLM 日志文件

2.2 读取文件内容

2.2.1 `read()`：读取整个文件

# 读取整个LLM语料文件
with open("llm_corpus.txt", "r", encoding="utf-8") as file:
    corpus = file.read()
print(f"语料内容：{corpus}")
print(f"语料长度：{len(corpus)}")

2.2.2 `readline()`：读取一行

# 逐行读取LLM语料文件（适合大文件，避免内存溢出）
with open("llm_corpus.txt", "r", encoding="utf-8") as file:
    line = file.readline()
    while line:
        print(f"当前行：{line.strip()}")
        line = file.readline()

2.2.3 `readlines()`：读取所有行，返回列表

# 读取所有行到列表（适合中小文件）
with open("llm_corpus.txt", "r", encoding="utf-8") as file:
    lines = file.readlines()
# 清洗空行
clean_lines = [line.strip() for line in lines if line.strip()]
print(f"清洗后语料：{clean_lines}")

2.3 写入文件内容

2.3.1 `write()`：写入字符串

# 写入LLM生成的文案
with open("llm_result.txt", "w", encoding="utf-8") as file:
    file.write("LLM生成的电商文案：苹果15手机壳，29.9元防摔耐磨，颜值在线！")
print("文案保存成功！")

2.3.2 `writelines()`：写入列表

# 写入LLM批量生成的文案
copywritings = [
    "文案1：苹果15手机壳，29.9元防摔耐磨\n",
    "文案2：华为Mate60手机壳，39.9元颜值天花板\n",
    "文案3：小米14手机壳，19.9元轻薄透气\n"
]
with open("llm_copywritings.txt", "w", encoding="utf-8") as file:
    file.writelines(copywritings)
print("批量文案保存成功！")

2.3.3 `a`模式：追加内容

# 追加LLM生成的新文案
new_copy = "\n文案4：OPPO Find X7手机壳，29.9元支持无线充电\n"
with open("llm_copywritings.txt", "a", encoding="utf-8") as file:
    file.write(new_copy)
print("新文案追加成功！")

三、高级操作：常用文件格式处理

在 LLM 开发中，除了.txt 文件，我们还经常处理JSON、CSV、Markdown等结构化文件。

3.1 JSON 文件：配置与结构化数据

JSON 是轻量级的结构化数据格式，适合存储 LLM 的配置信息、对话历史、API 响应等。

3.1.1 读取 JSON 文件

import json

# 读取LLM配置文件（config.json）
# config.json内容：{"model": "gpt-3.5-turbo", "api_key": "sk-xxx", "temperature": 0.7}
with open("config.json", "r", encoding="utf-8") as file:
    config = json.load(file)

print(f"模型名称：{config['model']}")
print(f"API密钥：{config['api_key']}")
print(f"温度参数：{config['temperature']}")

3.1.2 写入 JSON 文件

import json

# 写入LLM对话历史到JSON文件
chat_history = [
    {"role": "user", "content": "什么是LLM？"},
    {"role": "assistant", "content": "LLM是大语言模型的缩写"}
]
with open("chat_history.json", "w", encoding="utf-8") as file:
    json.dump(chat_history, file, ensure_ascii=False, indent=2)
print("对话历史保存成功！")

3.1.3 LLM 场景应用：会话历史持久化

import json

class SessionManager:
    def __init__(self, session_file):
        self.session_file = session_file
        # 读取已有会话历史
        try:
            with open(self.session_file, "r", encoding="utf-8") as file:
                self.history = json.load(file)
        except FileNotFoundError:
            self.history = []
    
    def add_message(self, role, content):
        # 添加新消息
        self.history.append({"role": role, "content": content})
        # 保存到文件
        with open(self.session_file, "w", encoding="utf-8") as file:
            json.dump(self.history, file, ensure_ascii=False, indent=2)
    
    def get_history(self):
        return self.history

# 使用示例
session_manager = SessionManager("chat_history.json")
session_manager.add_message("user", "什么是LLM？")
session_manager.add_message("assistant", "LLM是大语言模型的缩写")
print("会话历史：", session_manager.get_history())

3.2 CSV 文件：批量语料管理

CSV 是逗号分隔值文件，适合存储批量 LLM 语料、训练数据等结构化数据。

3.2.1 读取 CSV 文件

import csv

# 读取LLM语料CSV文件（corpus.csv）
# corpus.csv内容：id,content,label
# 1,苹果15手机壳质量很好,正面
# 2,华为Mate60手机壳垃圾,负面
with open("corpus.csv", "r", encoding="utf-8") as file:
    reader = csv.DictReader(file)
    corpus = [row for row in reader]

print(f"语料数量：{len(corpus)}")
print(f"第一条语料：{corpus[0]}")

3.2.2 写入 CSV 文件

import csv

# 写入LLM批量生成的结果到CSV文件
results = [
    {"prompt": "生成苹果15手机壳文案", "result": "29.9元防摔耐磨", "model": "gpt-3.5-turbo"},
    {"prompt": "生成华为Mate60手机壳文案", "result": "39.9元颜值天花板", "model": "wenxin-3.5"}
]

with open("llm_results.csv", "w", encoding="utf-8", newline="") as file:
    fieldnames = ["prompt", "result", "model"]
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()  # 写入表头
    writer.writerows(results)  # 写入数据

print("批量结果保存成功！")

3.2.3 LLM 场景应用：语料清洗与保存

import csv

# 清洗爬取的电商评论语料
raw_corpus = [
    "★★★★★ 苹果15手机壳质量很好\n",
    "【中评】物流有点慢\n",
    "★★★ 材质一般\n",
    "【好评】支持无线充电\n"
]

# 清洗语料
clean_corpus = []
for i, line in enumerate(raw_corpus):
    cleaned = line.replace("★", "").replace("【好评】", "").replace("【中评】", "").strip()
    if cleaned:
        clean_corpus.append({"id": i+1, "content": cleaned, "rating": len(line.split("★"))-1})

# 保存到CSV文件
with open("clean_corpus.csv", "w", encoding="utf-8", newline="") as file:
    fieldnames = ["id", "content", "rating"]
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(clean_corpus)

print("清洗后的语料保存成功！")

3.3 Markdown 文件：文档与 Prompt 模板

Markdown 是轻量级标记语言，适合存储 LLM 的 Prompt 模板、技术文档等。

3.3.1 读取 Markdown 文件

# 读取LLM Prompt模板Markdown文件（prompt_templates.md）
with open("prompt_templates.md", "r", encoding="utf-8") as file:
    content = file.read()

# 提取Prompt模板
prompt_templates = {}
for section in content.split("## "):
    if "模板" in section:
        name, prompt = section.split("\n\n", 1)
        prompt_templates[name.strip()] = prompt.strip()

print(f"Prompt模板：{list(prompt_templates.keys())}")
print(f"客服模板：{prompt_templates['客服模板']}")

3.3.2 写入 Markdown 文件

# 写入LLM生成的技术文档到Markdown文件
doc_content = """# LLM开发指南
## 1. 环境搭建
安装Python 3.11及以上版本

## 2. 依赖安装
pip install openai transformers python-dotenv
"""

with open("llm_guide.md", "w", encoding="utf-8") as file:
    file.write(doc_content)
print("技术文档保存成功！")

四、LLM 场景实战：完整的文件 IO 应用

4.1 实战需求

开发一个LLM 对话系统，实现以下功能：

支持多轮对话
会话历史自动保存到 JSON 文件
支持读取 JSON 格式的 Prompt 模板
支持将对话历史导出为 Markdown 文件

4.2 项目结构

llm_chat_system/
├── config.json  # LLM配置文件
├── prompt_templates.json  # Prompt模板文件
├── session.py  # 会话管理模块
├── main.py  # 主程序

4.3 核心代码实现

4.3.1 `config.json`

{
  "model": "gpt-3.5-turbo",
  "api_key": "sk-xxx",
  "temperature": 0.7
}

4.3.2 `prompt_templates.json`

{
  "customer_service": "你是电商客服，请回答关于{product}的问题：{question}",
  "general": "你是专业的AI助手，请回答以下问题：{question}"
}

4.3.3 `session.py`

import json
import datetime

class LLMSession:
    def __init__(self, config_file, prompt_file, session_file="session_history.json"):
        self.config_file = config_file
        self.prompt_file = prompt_file
        self.session_file = session_file
        
        # 加载配置
        with open(self.config_file, "r", encoding="utf-8") as file:
            self.config = json.load(file)
        
        # 加载Prompt模板
        with open(self.prompt_file, "r", encoding="utf-8") as file:
            self.prompt_templates = json.load(file)
        
        # 加载会话历史
        try:
            with open(self.session_file, "r", encoding="utf-8") as file:
                self.history = json.load(file)
        except FileNotFoundError:
            self.history = []
    
    def generate_prompt(self, template_name, **kwargs):
        """生成Prompt"""
        if template_name not in self.prompt_templates:
            raise ValueError(f"Prompt模板{template_name}不存在")
        return self.prompt_templates[template_name].format(**kwargs)
    
    def add_message(self, role, content):
        """添加消息到会话历史并保存"""
        message = {
            "role": role,
            "content": content,
            "timestamp": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        }
        self.history.append(message)
        
        # 保存到文件
        with open(self.session_file, "w", encoding="utf-8") as file:
            json.dump(self.history, file, ensure_ascii=False, indent=2)
    
    def export_to_markdown(self, export_file="chat_history.md"):
        """将会话历史导出为Markdown文件"""
        md_content = f"# LLM对话历史\n生成时间：{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n"
        
        for message in self.history:
            if message["role"] == "user":
                md_content += f"## 用户 ({message['timestamp']})\n{message['content']}\n\n"
            elif message["role"] == "assistant":
                md_content += f"## AI ({message['timestamp']})\n{message['content']}\n\n"
        
        with open(export_file, "w", encoding="utf-8") as file:
            file.write(md_content)
        
        return export_file
    
    def get_config(self):
        return self.config
    
    def get_history(self):
        return self.history

4.3.4 `main.py`

from session import LLMSession

# 初始化会话
session = LLMSession("config.json", "prompt_templates.json")

# 模拟多轮对话
while True:
    user_input = input("用户：")
    if user_input == "exit":
        break
    
    # 生成Prompt
    prompt = session.generate_prompt("general", question=user_input)
    
    # 模拟LLM生成回答
    ai_response = f"AI回答：{user_input} → 这是模拟的LLM回答"
    
    # 保存到会话历史
    session.add_message("user", user_input)
    session.add_message("assistant", ai_response)
    
    # 打印AI回答
    print(ai_response)

# 导出为Markdown文件
export_file = session.export_to_markdown()
print(f"\n对话历史已导出到：{export_file}")

4.4 运行结果

用户：什么是LLM？
AI回答：什么是LLM？ → 这是模拟的LLM回答
用户：LLM有哪些应用？
AI回答：LLM有哪些应用？ → 这是模拟的LLM回答
用户：exit

对话历史已导出到：chat_history.md

生成的session_history.json内容：

[
  {
    "role": "user",
    "content": "什么是LLM？",
    "timestamp": "2024-12-15 10:00:00"
  },
  {
    "role": "assistant",
    "content": "AI回答：什么是LLM？ → 这是模拟的LLM回答",
    "timestamp": "2024-12-15 10:00:00"
  }
]

五、性能优化：大文件处理

在 LLM 开发中，我们经常会处理GB 级别的语料文件，如果直接读取到内存会导致内存溢出。以下是大文件处理的优化技巧。

5.1 逐行读取

使用readline()逐行读取文件，避免将整个文件加载到内存。

# 处理GB级别的LLM语料文件
with open("large_corpus.txt", "r", encoding="utf-8") as file:
    line = file.readline()
    while line:
        # 处理当前行
        process_line(line)
        line = file.readline()

5.2 使用生成器

定义一个生成器函数，逐行返回文件内容，节省内存。

# 生成器函数：逐行返回文件内容
def read_large_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            yield line.strip()

# 使用生成器处理大文件
for line in read_large_file("large_corpus.txt"):
    process_line(line)

5.3 分块读取

使用read(chunk_size)分块读取文件，适合二进制文件或超大文本文件。

# 分块读取大文件
chunk_size = 1024 * 1024  # 1MB
with open("large_file.bin", "rb") as file:
    chunk = file.read(chunk_size)
    while chunk:
        # 处理当前块
        process_chunk(chunk)
        chunk = file.read(chunk_size)

六、零基础避坑指南

6.1 编码错误

问题：读取或写入文件时提示UnicodeDecodeError或UnicodeEncodeError。解决：在open()函数中指定正确的编码，如encoding="utf-8"、encoding="gbk"等。

6.2 文件路径错误

问题：提示FileNotFoundError。解决：

检查文件路径是否正确
使用绝对路径代替相对路径
使用os.path.abspath()获取绝对路径

6.3 忘记关闭文件

问题：文件句柄泄漏，程序崩溃或文件损坏。解决：始终使用with语句处理文件 IO。

6.4 覆盖文件错误

问题：使用w模式时，不小心覆盖了原有文件。解决：

操作前备份文件
使用a模式追加内容
使用os.path.exists()检查文件是否存在

6.5 内存溢出

问题：处理大文件时，内存不足。解决：使用逐行读取、生成器或分块读取的方式处理大文件。

七、总结：文件 IO 与 LLM 开发的「对应关系」

文件 IO 操作	LLM 开发场景
读取文本文件	LLM 语料加载、Prompt 模板读取
写入文本文件	LLM 生成结果保存、对话历史追加
JSON 文件读写	LLM 配置管理、会话历史持久化
CSV 文件读写	批量 LLM 语料管理、训练数据保存
Markdown 文件读写	LLM 文档生成、Prompt 模板存储
大文件处理	GB 级 LLM 语料清洗、预处理

Python 文件 IO 是LLM 开发的基础技术，掌握它能帮你解决 LLM 开发中的 “记忆” 痛点，实现数据的持久化管理。在实际开发中，要注意：

始终使用with语句处理文件 IO
选择合适的文件格式存储不同类型的数据
处理大文件时使用逐行或分块读取的方式
注意文件编码和路径的正确性

下一篇我们将学习《Python 正则表达式：LLM 语料清洗与文本预处理》，讲解如何使用正则表达式清洗和预处理 LLM 语料。