LlamaIndex中记忆Memory技术详解

原创于 2025-12-01 08:15:00 发布 · 410 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#java #langchain #python #llama #机器学习 #人工智能 #Agent

记忆Memory是Agent的核心组件。它可以存储和检索历史信息。在 LlamaIndex 中，通常可以使用现有的 BaseMemory 类或创建自定义类来自定义内存。

可调用 memory.put（） 来存储信息，并调用 memory.get（） 来检索信息。

记忆Memory分为短期记忆和长期记忆，详情如下：

一、短期记忆

默认情况下，Memory 类将存储符合令牌限制的最后 X 条消息。您可以通过将 token_limit 和 chat_history_token_ratio 参数传递给 Memory 类来自定义此属性。

1、token_limit

token_limit （默认值：30000）：要存储的短期和长期令牌的最大数量。

作用：确保整个 memory 存储的 token 数不会超过此限制，避免上下文过长导致性能下降或模型无法处理。

2、chat_history_token_ratio

chat_history_token_ratio （默认值： 0.7）：短期聊天记录中的令牌与总令牌限制的比率。如果聊天记录超过此比率，则最早的消息将被刷新到长期记忆中（如果启用了长期记忆）。

计算：short_term_limit = token_limit * chat_history_token_ratio。

例如，token_limit=50 且 ratio=0.7 时，短期最多存储 35 个 token，其余用于长期记忆。

作用：控制短期（最近对话）的 token 分配。当短期占比超过设定值后，旧消息会被移出短期并移入长期 memory。

3、token_flush_size

token_flush_size （默认值：3000）：当短期消息超过 short_term_limit（= token_limit * ratio）后，每次从短期 memory 中“刷新”移入长期 memory 的 token 数。

作用：以固定批量迁移旧消息，避免单条消息过大或迁移不规则。若未启用长期 memory，刷新消息会被归档并从短期移除。

示例代码

memory = Memory.from_defaults(
session_id="my_session",
token_limit=40000,
chat_history_token_ratio=0.7,
token_flush_size=3000
)

代码分析

对话过程中：

新消息加到短期记忆队列（FIFO），计算当前短期 token 数。

若超过 token_limit * ratio（如 28000 token），触发 flush。
触发 flush 时：

将最旧的消息（共约 token_flush_size）迁移到长期 memory。

长期 memory 处理后，短期释放相同 token 空间。
Memory.get()：

拉取短期 + 长期内容，合并为最终 context，但整体不超过 token_limit。

长期 memory 不同模块（如 FactExtraction、Vector 等）将被按优先级管理、必要时 truncate。

参数总结

参数	含义	作用
`token_limit`	Memory 总 token 上限	限制短期 + 长期记忆容量
`chat_history_token_ratio`	短期记忆 token 相对于总限额的比例	控制短期 vs 长期 token 分配
`token_flush_size`	每次迁移到长期 memory 的 token 数	批量清理短期，迁移旧消息

二、长期记忆

1、定义

在 LlamaIndex 的 Memory 系统里，短期记忆存储最近的对话内容（基于 chat_history_token_ratio 和 token_limit），长期记忆则是将刷新的旧消息存入结构化记忆模块，长期保留有用信息。

长期记忆由多个 Memory Block 组成，每块负责不同类型的信息处理。当检索记忆时，短期和长期记忆会合并在一起。

目前，有三个预定义的内存块：

StaticMemoryBlock、FactExtractionMemoryBlock和VectorMemoryBlock。

2、三种 Memory Block 类型

StaticMemoryBlock

存储静态不变的信息，比如用户基础简介、系统设定等。这些信息始终保留，优先级最高（priority=0）。

FactExtractionMemoryBlock

使用 LLM 自动从刷新的对话中提取关键事实（如“用户 29 岁”、“用户喜欢猫”），并以结构化列表保存，最多可保存 max_facts 条信息，超过上限会被自动压缩。

VectorMemoryBlock

将刷新的消息批次存储到向量数据库（如 Qdrant/Chroma），后续可基于嵌入检索与当前对话相关的历史信息，提供上下文连续性。

3、长期记忆的工作机制

当短期记忆的 token 数超出阈值（token_limit × chat_history_token_ratio），会触发自动 flush，将最旧约 token_flush_size tokens 的消息推送到所有 Memory Block 中处理。
Memory Blocks 根据各自逻辑（提取事实、存储向量、静态写入）处理这些消息。
调用 memory.get() 时，系统将短期消息与各 Block 中的长期记忆结合，并按 Block 的 priority 顺序，如果整体超过 token limit，再进行 truncate。
自动触发flush，会自动向FactExtractionMemoryBlock和VectorMemoryBlock里推送数据，我们不需要关注内部细节。
StaticMemoryBlock需要手动设置数据，不会自动flush数据。
长期记忆在Memory中不是必填项，三这个block也可以根据需要设置或不设置。

为什么用长期记忆？

多轮对话持久化：保留用户信息和上下文，让机器人“不忘之前说的”。
结构化信息呈现：FactBlock 可输出清晰可用的事实，便于推理调用。
高效召回历史内容：VectorBlock 支持语义级检索，而非纯线性扫描。
适用于客服机器人、个性化对话、或多轮复杂推理场景。

5、示例代码

from llama_index.core.memory import (
StaticMemoryBlock,
FactExtractionMemoryBlock,
VectorMemoryBlock,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
llm = OpenAI(model="gpt-4.1-mini")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
client = chromadb.EphemeralClient()
vector_store = ChromaVectorStore(
chroma_collection=client.get_or_create_collection("test_collection")
)
from llama_index.core.memory import (
StaticMemoryBlock,
FactExtractionMemoryBlock,
VectorMemoryBlock,
)
blocks = [
StaticMemoryBlock(
name="core_info",
static_content="My name is Logan, and I live in Saskatoon. I work at LlamaIndex.",
priority=0,
),
FactExtractionMemoryBlock(
name="extracted_info",
llm=llm,
max_facts=50,
priority=1,
),
VectorMemoryBlock(
name="vector_memory",
# required: pass in a vector store like qdrant, chroma, weaviate, milvus, etc.
vector_store=vector_store,
priority=2,
embed_model=embed_model,
# The top-k message batches to retrieve
# similarity_top_k=2,
# optional: How many previous messages to include in the retrieval query
# retrieval_context_window=5
# optional: pass optional node-postprocessors for things like similarity threshold, etc.
# node_postprocessors=[...],
),
]
async def main():
memory = Memory.from_defaults(
session_id="my_session",
token_limit=500,
memory_blocks=blocks,
insert_method=InsertMethod.SYSTEM,
)
await memory.aput_messages([ChatMessage(role="system", content="你是一位技术专家")])
# Simulate a long conversation
for i in range(100):
await memory.aput_messages(
[
ChatMessage(role="user", content="Hello, world!"),
ChatMessage(role="assistant", content="Hello, world to you too!"),
ChatMessage(role="user", content="What is the capital of France?"),
ChatMessage(role="assistant", content="The capital of France is the Paris. It is known for its art, fashion, and culture."),
]
)
current_chat_history = await memory.aget()
for msg in current_chat_history:
print(msg)
print(f"memory.aget().len={len(current_chat_history)}")
print("="*40)
all_messages = await memory.aget_all()
print(f"memory.aget_all().len={len(all_messages)}")
if __name__ == "__main__":
asyncio.run(main())

6、如何确保短期记忆不丢失？

默认情况下，Memory 类使用内存中的 SQLite 数据库。您可以通过更改数据库 URI 来插入任何远程数据库。还可以自定义表名，也可以选择直接传入异步引擎。这对于管理连接池非常有用。下面的代码，实现将使用pgsql作为内存，这样短期记忆不会丢失了。

from llama_index.core.memory import Memory
memory = Memory.from_defaults(
session_id="my_session",
token_limit=40000,
async_database_uri="postgresql+asyncpg://postgres:mark90@localhost:5432/postgres",
# Optional: specify a table name
# table_name="memory_table",
# Optional: pass in an async engine directly
# this is useful for managing your own connection pool
# async_engine=engine,
)