使用 ClickHouse 作为高性能向量数据库进行向量存储和检索

最新推荐文章于 2025-06-30 17:05:11 发布

bBADAS

最新推荐文章于 2025-06-30 17:05:11 发布

阅读量451

点赞数 3

CC 4.0 BY-SA版权

文章标签： clickhouse 数据库 python

本文链接：https://blog.youkuaiyun.com/bBADAS/article/details/145747124

ClickHouse 是一个开源的数据库，以其高性能和资源效率在实时应用和分析领域广受欢迎。它支持完整的 SQL 并提供广泛的功能以帮助用户编写分析查询。最近，ClickHouse 添加了数据结构和距离搜索功能（如 L2Distance），以及近似最近邻搜索索引，使其能够作为高性能、可扩展的向量数据库，以 SQL 形式存储和搜索向量。

本文将展示如何使用 ClickHouse 的向量存储功能。

1. 技术背景介绍

向量化和向量搜索在自然语言处理（NLP）、推荐系统和机器学习等领域变得越来越重要。将文档、查询等数据转化为向量，并能快速、高效地从大量向量中检索相似项，是许多应用的核心需求。ClickHouse 作为一个高效的数据库，通过支持向量搜索功能，使得它可以作为向量存储和检索的数据库。

2. 核心原理解析

ClickHouse 通过添加 L2Distance 等距离搜索函数，以及近似最近邻搜索索引，支持高效的向量存储和检索。向量存储通常涉及以下几个步骤：

向量化文本或其他数据。
将向量存储在数据库中。
根据查询向量执行相似度搜索，检索相似项。

3. 代码实现演示

让我们通过代码示例展示如何在 ClickHouse 中实现向量存储和检索。

环境设置

首先，通过 Docker 启动本地 ClickHouse 服务：

docker run -d -p 8123:8123 -p 9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.4.2.11

安装所需的 Python 包：

pip install -qU langchain-community clickhouse-connect langchain-openai langchain-huggingface langchain-core

向量存储设置

导入必要的库并配置 OpenAI、HuggingFace 和 ClickHouse：

import os
import getpass
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.embeddings import FakeEmbeddings
from langchain_community.vectorstores import Clickhouse, ClickhouseSettings
from langchain_core.documents import Document
from uuid import uuid4

# 配置 OpenAI API 密钥
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# 使用 OpenAI 向量嵌入
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# 设置 ClickHouse 向量存储
settings = ClickhouseSettings(table="clickhouse_example")
vector_store = Clickhouse(embeddings, config=settings)

添加文档到向量存储

创建一些文档并添加到向量存储：

documents = [
    Document(page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.", metadata={"source": "tweet"}),
    Document(page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.", metadata={"source": "news"}),
    Document(page_content="Building an exciting new project with LangChain - come check it out!", metadata={"source": "tweet"}),
    Document(page_content="Robbers broke into the city bank and stole $1 million in cash.", metadata={"source": "news"}),
    Document(page_content="Wow! That was an amazing movie. I can't wait to see it again.", metadata={"source": "tweet"}),
    Document(page_content="Is the new iPhone worth the price? Read this review to find out.", metadata={"source": "website"}),
    Document(page_content="The top 10 soccer players in the world right now.", metadata={"source": "website"}),
    Document(page_content="LangGraph is the best framework for building stateful, agentic applications!", metadata={"source": "tweet"}),
    Document(page_content="The stock market is down 500 points today due to fears of a recession.", metadata={"source": "news"}),
    Document(page_content="I have a bad feeling I am going to get deleted :(", metadata={"source": "tweet"}),
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

查询向量存储

执行相似度搜索：

results = vector_store.similarity_search("LangChain provides abstractions to make working with LLMs easy", k=2)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

也可以搜索并返回分数：

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

通过 SQL 语句过滤：

meta = vector_store.metadata_column
results = vector_store.similarity_search_with_relevance_scores(
    "What did I eat for breakfast?", k=4, where_str=f"{meta}.source = 'tweet'"
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")