二， python语法教程.外部库：pymilvus

Python 客户端 pymilvus 核心 API 详解

原创已于 2025-06-22 01:53:36 修改 · 1.3k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#python #开发语言

于 2025-06-22 00:48:40 首次发布

python教程专栏收录该内容

9 篇文章

订阅专栏

该文章已生成可运行项目，

pymilvus 是 Milvus（一款开源向量数据库）的 Python 客户端，用于高效存储、索引和查询大规模向量数据。以下是其核心 API 的详细介绍：

一. 连接管理

建立连接

from pymilvus import connections

# 连接到本地 Milvus 服务
connections.connect(
    alias="default",
    host='localhost',
    port='19530'
)

# 断开连接
connections.disconnect("default")

检查连接状态

# 获取所有连接别名
print(connections.list_connections())

# 检查连接是否存在
if connections.has_connection("default"):
    print("连接存在")

二. 集合管理

创建集合

from pymilvus import Collection, FieldSchema, CollectionSchema, DataType

# 定义字段
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),  # 768维向量
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512)
]

# 创建集合
schema = CollectionSchema(fields=fields, description="文本向量集合")
collection = Collection(
    name="text_embeddings",
    schema=schema,
    using="default",  # 连接别名
    shards_num=2
)

查看集合

# 列出所有集合
print(connections.list_collections())

# 检查集合是否存在
if connections.has_collection("text_embeddings"):
    print("集合存在")

# 获取集合信息
collection = Collection("text_embeddings")
print(collection.schema)
print(collection.num_entities)  # 集合中的实体数量

删除集合

collection.drop()  # 删除集合及其所有数据

三. 索引管理

创建索引

# 定义索引参数（以 HNSW 为例）
index_params = {
    "metric_type": "L2",  # 欧氏距离
    "index_type": "HNSW",
    "params": {"M": 8, "ef_construction": 64}
}

# 创建索引
collection.create_index(
    field_name="embedding",
    index_params=index_params,
    index_name="hnsw_index"
)

# 加载索引到内存（查询前必须执行）
collection.load()

查看索引

# 获取索引信息
indexes = collection.indexes
print(indexes)

删除索引

collection.drop_index(index_name="hnsw_index")

四. 数据操作

插入数据

import random

# 生成示例数据
entities = [
    [i for i in range(100)],  # ID 字段（若 auto_id=True 则自动生成）
    [[random.random() for _ in range(768)] for _ in range(100)],  # 向量字段
    [f"文本描述 {i}" for i in range(100)]  # 文本字段
]

# 插入数据
insert_result = collection.insert(entities)
print(f"插入 {insert_result.insert_count} 条数据")

# 刷新集合（使插入的数据可见）
collection.flush()

查询数据

# 通过主键获取数据
results = collection.query(
    expr="id in [1, 2, 3]",  # 过滤表达式
    output_fields=["id", "text", "embedding"]  # 返回的字段
)
print(results)

删除数据

# 根据主键删除
delete_result = collection.delete(expr="id in [1, 2, 3]")
print(f"删除 {delete_result.delete_count} 条数据")

五. 向量搜索

collection.search()向量相似度搜索： 通过计算向量间的相似度（如欧氏距离、余弦相似度）来查找最相似的数据

参数	作用
data	待查询的向量列表（如 [[0.1, 0.2, …], [0.3, 0.4, …]]），支持批量查询多个向量。
anns_field	指定进行搜索的向量字段名（需与建表时的向量字段名一致，如 “embedding”）。
param	搜索参数，取决于索引类型。例如： - HNSW 索引：{“metric_type”: “L2”, “ef”: 50} - IVF 索引：{“metric_type”: “IP”, “nprobe”: 16}
limit	指定返回的结果数量（Top-K），例如 limit=10 返回最相似的 10 条记录。
output_fields	除主键外，需要返回的其他字段（如 [“text”, “category”]）。

metric_type详解:距离度量

参数	作用
“L2”	欧氏距离（适用于图像、音频等特征向量）
“IP”	内积（适用于文本嵌入，如 BERT 向量，需先归一化）
“COSINE”	余弦相似度（本质上是归一化后的内积）

索引特定参数:不同索引类型需要不同的搜索参数

索引类型	关键参数	作用
HNSW	ef	搜索时的探索范围，值越大精度越高，但性能越低（如 ef=50）
IVF	nprobe	搜索时检查的倒排表数量，值越大精度越高（如 nprobe=16）
FAISS_GPU	同上	与 CPU 索引参数类似，但需考虑 GPU 内存限制

搜索结果处理：搜索结果是一个嵌套结构，包含多个查询向量的结果

for hits in results:  # 遍历每个查询向量的结果
    for hit in hits:  # 遍历每个命中结果
        print(f"ID: {hit.id}, 距离: {hit.distance}, 文本: {hit.entity.get('text')}")

关键属性

hit.id：匹配记录的主键。
hit.distance：与查询向量的距离（值越小越相似）。
hit.entity：包含 output_fields 中指定的字段值

示例：文本语义搜索

from pymilvus import Collection
from sentence_transformers import SentenceTransformer

# 加载文本编码器
model = SentenceTransformer('all-MiniLM-L6-v2')

# 对查询文本进行编码
query_text = "机器学习"
query_vector = model.encode(query_text).tolist()

# 连接集合
collection = Collection("text_embeddings")
collection.load()  # 加载到内存

# 搜索参数（HNSW 索引）
search_params = {
    "metric_type": "IP",  # 内积（适用于文本向量）
    "params": {"ef": 50}
}

# 执行搜索
results = collection.search(
    data=[query_vector],  # 单个查询向量
    anns_field="embedding",
    param=search_params,
    limit=5,  # 返回 Top 5
    output_fields=["text", "category"]
)

# 处理结果
for hits in results:
    print(f"与 '{query_text}' 最相似的文本:")
    for i, hit in enumerate(hits):
        print(f"Top-{i+1}: ID={hit.id}, 距离={hit.distance:.4f}")
        print(f"  文本: {hit.entity.get('text')}")
        print(f"  类别: {hit.entity.get('category')}")

性能优化建议

项	内容
合理设置索引参数	HNSW 索引：增大 M 和 ef_construction 可提高精度，但增加索引体积。 IVF 索引：增大 nlist 可提高索引效率，但需增大 nprobe 以保持精度
批量搜索	# 同时搜索多个查询向量 query_vectors = [ model.encode(“问题1”).tolist(), model.encode(“问题2”).tolist() ] results = collection.search(data=query_vectors, …)
过滤条件（Filter）	results = collection.search( data=query_vectors, anns_field=“embedding”, param=search_params, limit=10, expr=“category == ‘技术’ AND timestamp > 1690000000” # 结合标量过滤 )

常见问题

问题	处理
向量维度不匹配	查询向量维度必须与集合创建时的 dim 一致
未加载集合	搜索前需调用 collection.load() 将集合加载到内存
距离值异常	检查向量是否归一化（特别是使用 IP 或 COSINE 时）

collection.query()标量字段过滤搜索： 允许通过布尔表达式过滤、分页控制和一致性级别精确地获取所需数据

参数	作用
expr	查询表达式，用于过滤数据（例如 “id > 100 AND category == ‘image’”）。空字符串表示不添加过滤条件，返回所有记录。
output_fields	指定要返回的字段列表（如 [“id”, “embedding”, “text”]）。默认只返回主键，需显式指定其他字段。
limit	限制返回的记录数（分页大小），类似于 SQL 中的 LIMIT。
offset	指定结果的起始偏移量（用于分页），类似于 SQL 中的 OFFSET。
consistency_level	指定数据一致性级别

一致性级别说明

级别	特点
Strong	强一致性，查询时会等待所有写操作完成，确保返回最新数据。适用场景：对数据实时性要求极高的场景。
Bounded	有界一致性，允许一定时间窗口内的数据延迟。适用场景：平衡性能和一致性的常见场景。
Eventually	最终一致性，查询可能返回旧数据，但系统会逐渐收敛到一致状态。适用场景：对实时性要求不高的大规模数据查询。

查询表达式语法：expr 参数支持丰富的布尔表达式

# 示例1：过滤主键范围
expr = "id >= 100 AND id <= 200"

# 示例2：字符串匹配（精确匹配）
expr = "category == 'image' OR category == 'text'"

# 示例3：向量字段条件（需结合索引）
expr = "embedding_distance < 0.8"  # 假设已创建距离索引

# 示例4：组合条件
expr = "id > 0 AND (category == 'image' OR timestamp > '2023-01-01')"

分页查询示例：通过调整 offset 和 limit 实现分页：

batch_size = 1000  # 每页1000条记录
offset = 0

while True:
    results = collection.query(
        expr="",  # 获取所有记录
        output_fields=["id", "text"],
        limit=batch_size,
        offset=offset,
        consistency_level="Strong"
    )
    
    if not results:
        break  # 没有更多数据
        
    print(f"获取了 {len(results)} 条记录，偏移量: {offset}")
    
    # 处理结果...
    for row in results:
        print(f"ID: {row['id']}, 文本: {row['text']}")
    
    offset += batch_size  # 移动到下一页

对比

操作	query() 方法	search() 方法
查询类型	基于标量字段的布尔过滤	基于向量相似度的近似搜索
核心参数	expr, output_fields	data, anns_field, param
典型场景	精确条件筛选（如按 ID、时间过滤）	语义搜索、推荐系统（如找相似图片）

六. 分区管理

创建分区

# 创建分区
collection.create_partition(partition_name="partition_1")

# 查看所有分区
partitions = collection.partitions
print(partitions)

向分区插入数据

collection.insert(entities, partition_name="partition_1")

在分区内搜索

results = collection.search(
    data=query_vectors,
    anns_field="embedding",
    param=search_params,
    limit=10,
    partition_names=["partition_1"]
)

七. 异步操作

# 异步插入
future = collection.insert(entities, async_req=True)
future.add_done_callback(callback_function)  # 添加回调函数

# 等待结果
result = future.result()

八. 异常处理

from pymilvus import MilvusException

try:
    collection = Collection("non_existent_collection")
except MilvusException as e:
    print(f"错误: {e.code}, {e.message}")

常用索引类型参数

索引类型	适用场景	参数示例
HNSW	高精度搜索	{“metric_type”: “L2”, “index_type”: “HNSW”, “params”: {“M”: 8, “ef_construction”: 64}}
IVF_FLAT	平衡速度与精度	{“metric_type”: “L2”, “index_type”: “IVF_FLAT”, “params”: {“nlist”: 1024}}
FAISS_GPU	GPU 加速	{“metric_type”: “L2”, “index_type”: “IVF_FLAT”, “params”: {“nlist”: 1024}}

九. utility 模块

在 pymilvus 库中，utility 模块提供了一系列辅助函数，用于管理和查询 Milvus 服务的状态、元数据及其他实用功能。这些工具函数不直接操作数据，而是帮助你了解和监控整个系统。

9.1 核心功能与 API

连接检查

from pymilvus import utility

# 检查是否已连接到 Milvus 服务
print(utility.connected())  # 返回布尔值

# 检查特定别名的连接状态
print(utility.connected(alias="default"))

服务状态

# 获取 Milvus 服务版本
print(utility.get_server_version())  # 例如：'2.3.0'

# 检查服务是否健康
print(utility.health_check())  # 返回 True 或 False

集合与分区管理

# 列出所有集合
print(utility.list_collections())  # 返回集合名称列表

# 检查集合是否存在
print(utility.has_collection("text_embeddings"))  # 返回布尔值

# 查看集合的实体数量
print(utility.count_entities("text_embeddings"))  # 返回整数

# 列出集合的所有分区
print(utility.list_partitions("text_embeddings"))

索引管理

# 获取集合的索引信息
print(utility.get_index_info("text_embeddings"))  # 返回索引参数

统计信息

# 获取集合的统计信息（如向量维度、索引类型等）
print(utility.get_collection_stats("text_embeddings"))

异步任务状态

# 检查异步任务状态（如创建索引、删除集合等）
task_id = "123456"  # 任务 ID
print(utility.get_task_progress(task_id))  # 返回进度百分比
print(utility.get_task_state(task_id))     # 返回任务状态

数据加载状态

# 检查集合是否已加载到内存（用于向量搜索）
print(utility.has_collection_in_memory("text_embeddings"))

# 获取集合的加载进度
print(utility.get_load_state("text_embeddings"))

9.2 实用示例

示例 1：检查环境与连接

if utility.connected():
    print("已连接到 Milvus 服务")
    print(f"服务版本: {utility.get_server_version()}")
else:
    print("未连接到 Milvus 服务")

示例 2：验证集合是否存在并获取信息

collection_name = "text_embeddings"

if utility.has_collection(collection_name):
    entity_count = utility.count_entities(collection_name)
    print(f"集合 '{collection_name}' 存在，包含 {entity_count} 条记录")
    
    if utility.has_collection_in_memory(collection_name):
        print("集合已加载到内存，可进行搜索")
    else:
        print("集合未加载到内存，需先调用 collection.load()")
else:
    print(f"集合 '{collection_name}' 不存在")

示例 3：监控异步任务

# 假设创建索引时返回了 task_id
task_id = collection.create_index(
    field_name="embedding",
    index_params=index_params,
    async_req=True
)

# 轮询检查任务进度
while True:
    progress = utility.get_task_progress(task_id)
    print(f"索引创建进度: {progress}%")
    
    if progress == 100:
        state = utility.get_task_state(task_id)
        print(f"任务完成，状态: {state}")
        break
    
    time.sleep(1)  # 每秒检查一次