引言
在现代AI应用中,检索增强生成(RAG)系统逐渐受到关注。此类系统通过结合检索和生成模型,实现更智能、更准确的信息获取。Activeloop的Deep Memory工具集可优化你的向量存储,以提高LLM应用程序的检索精度。然而,这种技术也面临一些挑战,如准确性、成本和延迟等。本文将深入探讨如何使用Activeloop Deep Memory克服这些挑战,并为提高RAG系统性能提供实用建议。
主要内容
1. 数据集创建
首先,我们需要解析Activeloop的文档,以创建用于RAG系统的数据集。我们将使用BeautifulSoup
和LangChain的文档解析器,如Html2TextTransformer
和AsyncHtmlLoader
。所需库的安装命令如下:
%pip install --upgrade --quiet tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas
此外,还需一个Activeloop账号,并获取相应的API令牌。以下是相关代码:
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Enter your ActiveLoop API token: ")
# Initialize DeepLake
db = DeepLake(
dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory",
embedding=OpenAIEmbeddings(),
runtime={"tensor_db": True},
token=os.getenv("ACTIVELOOP_TOKEN"),
read_only=False,
)
2. 数据解析与存储
使用BeautifulSoup解析文档页面链接,并通过AsyncHtmlLoader
加载文档内容。
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def get_all_links(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {url}")
return []
soup = BeautifulSoup(response.content, "html.parser")
return [urljoin(url, a["href"]) for a in soup.find_all("a", href=True)]
base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)
from langchain_community.document_loaders.async_html import AsyncHtmlLoader
loader = AsyncHtmlLoader(all_links)
docs = loader.load()
转换数据为可读格式,并划分为小段进行处理:
from langchain_community.document_transformers import Html2TextTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
chunk_size = 4096
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
docs_new = []
for doc in docs_transformed:
if len(doc.page_content) < chunk_size:
docs_new.append(doc)
else:
docs_new.extend(text_splitter.create_documents([doc.page_content]))
# Add documents to DeepLake
docs = db.add_documents(docs_new)
3. 训练Deep Memory
生成合成查询并进行Deep Memory训练:
from langchain.chains.openai_functions import create_structured_output_chain
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
import random
from langchain_openai import OpenAIEmbeddings
from tqdm import tqdm
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
class Questions(BaseModel):
question: str = Field(..., description="Questions about text")
# Generate synthetic questions
def generate_queries(docs, ids, n=100):
questions, relevances = [], []
for _ in range(n):
r = random.randint(0, len(docs) - 1)
text, label = docs[r], ids[r]
generated_qs = [chain.run(input=text).question]
questions.extend(generated_qs)
relevances.extend([[(label, 1)] for _ in generated_qs])
return questions, relevances
questions, relevances = generate_queries(docs, ids, n=200)
train_questions, train_relevances = questions[:100], relevances[:100]
# Train Deep Memory
job_id = db.vectorstore.deep_memory.train(
queries=train_questions,
relevance=train_relevances,
)
4. 性能评估
使用Deep Memory内置的评估方法评估模型性能:
recall = db.vectorstore.deep_memory.evaluate(
queries=test_questions,
relevance=test_relevances,
)
展示了模型的显著性能提升。
常见问题和解决方案
- 潜在问题:网络限制可能导致API访问受限。
- 解决方案:考虑使用API代理服务来提高访问稳定性。
总结与进一步学习资源
通过本文,您学习到了如何使用Activeloop Deep Memory提升RAG系统的检索性能,并克服挑战。为了进一步学习,可以参考以下资源:
- Deep Memory和Deep Lake官方文档
- LangChain社区资源
参考资料
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!
—END—