AI 一、Python中，使用Embeddings（嵌入）和向量数据库（Vector Database）

原创已于 2025-02-24 17:14:59 修改 · 689 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#ai

于 2025-02-24 16:10:15 首次发布

ai 专栏收录该内容

16 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

在Python中，使用Embeddings（嵌入）和向量数据库（Vector Database）常用于自然语言处理（NLP）任务，如文本相似度计算、搜索引擎优化等。Embeddings是将文本转换为固定大小的向量表示，这些向量可以通过距离度量来比较它们的相似度。常用的技术包括Word2Vec、GloVe、BERT等。

1. Embeddings

Embeddings通常是使用深度学习模型生成的，如BERT、GPT或使用预训练模型。使用这些嵌入，可以将文本转换为向量表示。

示例：使用`transformers`库和BERT生成文本嵌入

from transformers import BertTokenizer, BertModel
import torch

# 加载预训练的BERT模型和tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# 示例文本
text = "Hello, how are you?"

# 编码文本并生成嵌入
inputs = tokenizer(text, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)  # 获取句子级别的嵌入（平均池化）

print(embeddings)

2. 向量数据库

向量数据库（如FAISS、Weaviate等）用于存储和搜索嵌入向量。它们通过高效的索引结构（例如IVF、HNSW）来加速向量相似度搜索。

示例：使用FAISS进行向量相似度搜索

FAISS（Facebook AI Similarity Search）是一个高效的相似度搜索库，支持大规模的向量检索。

安装FAISS：

pip install faiss-cpu

然后使用FAISS创建一个简单的向量搜索系统：

import faiss
import numpy as np

# 创建示例数据：10个随机向量，每个向量是128维
d = 128  # 向量维度
nb = 1000  # 数据库大小
nq = 5  # 查询数量

# 生成数据库向量（1000个128维向量）
np.random.seed(1234)
xb = np.random.random((nb, d)).astype('float32')

# 创建查询向量（5个128维向量）
xq = np.random.random((nq, d)).astype('float32')

# 创建一个FAISS索引
index = faiss.IndexFlatL2(d)  # 使用L2距离
index.add(xb)  # 将数据库向量添加到索引中

# 查询最相似的向量
D, I = index.search(xq, 5)  # 查询5个最相似的向量

print("Distances:", D)
print("Indices:", I)

3. 使用向量数据库（Weaviate）

Weaviate是一个基于向量的开源数据库，支持图形和文本的语义搜索，通常用于存储和管理嵌入向量。

pip install weaviate-client

使用Weaviate来存储和查询向量：

import weaviate

# 连接到Weaviate服务
client = weaviate.Client("http://localhost:8080")

# 创建Schema（如果没有）
client.schema.create({
    "classes": [{
        "class": "TextData",
        "properties": [{
            "name": "text",
            "dataType": ["text"]
        }]
    }]
})

# 插入数据
data = [
    {"text": "Hello, how are you?"},
    {"text": "This is an example sentence."}
]

# 向Weaviate中插入向量
for item in data:
    client.data_object.create(item, "TextData")