年初本来想做一个读论文的 RAG 工具,后来开始写论文,就把这件事暂停了。现在再看,API 已经全变了。。。那干脆重新开始搞吧。这次希望可以把 RAG 的每个部分研究透,争取最后做出一个效果拔群的东西。
这次的框架使用 LangChain。使用什么框架并不重要,有条件的话,我也会把相同功能用 LlamaIndex 实现一遍甚至手搓一遍(这可是个大坑)。先从一个简单的 RAG 工具开始,一点一点完善。
环境准备
conda 的安装
我个人习惯用 Miniconnda,但是 Annaconda 也没问题。两个办法安装 conda:
- 到
https://docs.anaconda.com/anaconda/install/
(Annaconda)或https://docs.anaconda.com/miniconda/miniconda-other-installer-links/
(Miniconda)下载对应的最新版; - 使用命令行安装。比如在安装了 homebrew 的 macOS 上使用
brew install --cask anaconda
(Annaconda)或brew install --cask miniconda
(Miniconda)。
换源
可参考 推荐一个好用的国内源合集 找到合适的源。
创建虚拟环境
在命令行输入 conda create -n langchain python=3.11 jupyterlab ipykernel -y
即创建了一个 python 3.11 环境备用,然后输入 conda activate langchain
进入这个环境,再输入 pip install langchain langchain-community langchain-huggingface
安装必要的包。开始只安装绝对必要的包,后面有需要再安装更多包。截止到本文完成的时间(2024 年 11 月 11 日),主要包的版本如下:
transformers-4.46.2
langchain-0.3.7
langchain-core-0.3.15
langchain-text-splitters-0.3.2
langchain-community-0.3.5
langchain-huggingface-0.1.2
有一个小细节,安装包里使用的是 -
,python 脚本里导入这些包使用的是 _
。
一个简单的 RAG demo
RAG 包含几个主要阶段:消化,检索,生成。
消化
包含文档的读取、转化(切片等)、向量化并入库。如需持久化(把数据保存到物理介质上),向量数据库会以文件的形式保存到硬盘上。
- 读取
作为一个 demo,先从读取一个简单的 TXT 文件开始。在命令行中输入
mkdir -p 'data/paul_graham/'
wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
即可下载一个 TXT 文件。
langchain_community.document_loaders
类中保存了很多文件读取器,可以按需使用。这里使用 TextLoader
。
from langchain_community.document_loaders import TextLoader
root = Path(__file__).parent.parent.absolute()
loader = TextLoader(os.path.join(root, 'data/paul_graham/paul_graham_essay.txt'))
data = loader.load()
我们可以先简单看看 data
里有什么东西。
print(len(data)) # 1
print(type(data[0])) # <class 'langchain_core.documents.base.Document'>
print(data[0].dict().keys()) # dict_keys(['id', 'metadata', 'page_content', 'type'])
print(len(data[0].page_content)) # 75014
上面的信息说明:
- 载入后,每个文件变成了一个
langchain_core.documents.base.Document
类; - 这个类有
'id', 'metadata', 'page_content', 'type'
属性; - 这篇文档在切分之前一共有 75014 个字符。
- 转化
读取后的文档一般不能直接使用,这是因为:
- 文档内可能有噪声,需要先过滤;
- 文档可能太长;
- 文档可能包含其它模态,如图像、表格、代码等。
所以我们需要先把读取后的文档转化成可以利用的形式。在这个 demo 里,我们只做切分。
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(data)
RecursiveCharacterTextSplitter
是一个常用的文档分割器,细节以后再讲,这里的意思是按每段 500 个字符切分,并保留 50 个字符的重叠。
print(len(docs)) # 217
print(docs[0])
# page_content='What I Worked On
#
# February 2021
#
# Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.' metadata={'source': 'data/paul_graham/paul_graham_essay.txt'}
这样切出来了 217 段。
- 向量化并入库
RAG 的核心是向量检索,所以下一步是把文本等信息向量化,这里使用 BAAI/bge-small-en-v1.5
。第一次使用需要花一点时间下载模型。
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
model_id = "BAAI/bge-small-en-v1.5"
embed = HuggingFaceEmbeddings(model_name=model_id)
embedding = embed.embed_documents(docs[0].page_content)
print(f'{len(embedding[0])}: {embedding[0][:5]} ...') # 384: [-0.04461495578289032, -0.04581078141927719, 0.044918518513441086, -0.01816270500421524, 0.007471249904483557] ...
BAAI/bge-small-en-v1.5
把每个文本转换成了一个 384 维的向量。
无论是否持久化,我们都需要一个特别的数据库承载这些向量。这个 demo 中使用 FAISS(安装命令是pip install faiss-cpu langchain_faiss
)。
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(docs, embed)
至此一个向量数据库就准备好了。
检索
向量数据库可以直接变成一个检索器,然后调用 invoke
方法检索。
retriever = vector_store.as_retriever(search_kwargs={'k': 3})
query = 'What did the author do growing up?'
retrieval = retriever.invoke(input=query)
print(len(retrieval))
# 3
for i, doc in enumerate(retrieval):
print(i, '\n', doc.page_content, '\n')
# 0
# What I Worked On
#
# February 2021
#
# Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
#
# 1
# book. But there's nothing like writing a book about something to help you learn it. The book, On Lisp, wasn't published till 1993, but I wrote much of it in grad school.
#
# 2
# Over the next several years I wrote lots of essays about all kinds of different topics. O'Reilly reprinted a collection of them as a book, called Hackers & Painters after one of the essays in it. I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office.
生成
这个项目中使用 Ollama,一个方便快速调用大模型的后端框架。首先到 https://ollama.com/download
下载并安装安装包,然后在终端输入
pip install langchain-ollama
ollama pull qwen2.5:0.5b
稍等片刻下载就完成了。然后在 Python 脚本中载入:
from langchain_ollama.llms import OllamaLLM
llm = OllamaLLM(model="qwen2.5:0.5b")
所谓 RAG,就是把检索结果与问题一同构造成一条 prompt,然后大模型进行下一个字符的预测,直到停止的过程。所以这里要把检索结果提取出来(.page_content
属性),然后和问题拼到一起。
context = '\n'.join([doc.page_content for doc in retrieval])
prompt = f'''
You are a helpful assistant.
Answer the question based on the following context:
{context}
Question: {query}
'''
kanswer = llm.invoke(prompt)
print(answer)
# The author spent most of their time outside of school writing and programming before college. They didn't write essays but wrote short stories that were supposed to be deep, though they found them awful. While not explicitly stated in your context, it is implied that these early writings are based on what beginning writers would have written at the time, as evidenced by the characters having "strong feelings."
这样一个完整的 RAG 流程就完成了。在后续的文章里,每一个细节都会被拿出来检查并优化,最终做成一个功能强大、健壮的 RAG 产品。代码在 https://github.com/vincent507cpu/RAG_zero_to_hero/blob/main/demo/LangChain.py
。Stay tuned!