Milvus向量数据库常见用法

最新推荐文章于 2025-04-22 12:26:34 发布

luxinfeng666

最新推荐文章于 2025-04-22 12:26:34 发布

阅读量3.6k

点赞数 1

文章标签：数据库 milvus 专业知识库向量数据库大语言模型

本文链接：https://blog.youkuaiyun.com/luxinfeng666/article/details/131506186

版权

本文介绍了如何使用MilvusPythonSDK进行客户端连接与断开，创建与管理集合，包括定义字段、创建、重命名和删除集合。同时，文章涵盖了数据插入、删除、查询，以及向量索引的创建与搜索，还有分区的管理。内容详细展示了Milvus中数据操作的各种功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

创建/断开客户端连接

from pymilvus import connections
# 创建连接
connections.connect(
  alias="default",
  user='username',
  password='password',
  host='localhost',
  port='19530'
)

# 断开连接
connections.disconnect("default")

管理Collection

创建Collection

# 定义Collection中的各个字段
fields = [
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="random", dtype=DataType.DOUBLE),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=1024)
]
# 创建Collection
schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")
hello_milvus = Collection("hello_milvus", schema)

主要参数：

Parameter	Description	Option
using (optional)	By specifying the server alias here, you can choose in which Milvus server you create a collection.	N/A
shards_num (optional)	Number of the shards for the collection to create.	[1,16]
num_partitions (optional)	Number of logical partitions for the collection to create.	[1,4096]
*kwargs: collection.ttl.seconds (optional)	Collection time to live (TTL) is the expiration time of a collection. Data in an expired collection will be cleaned up and will not be involved in searches or queries. Specify TTL in the unit of seconds.	The value should be 0 or greater. 0 means TTL is disabled.

重命名Collection

utility.rename_collection("old_collection", "new_collection") # Output: True

修改Collection属性

collection.set_properties(properties={"collection.ttl.seconds": 1800})

获取Collection各类属性

from pymilvus import Collection
collection = Collection("book")  # Get an existing collection.

collection.schema                # Return the schema.CollectionSchema of the collection.
collection.description           # Return the description of the collection.
collection.name                  # Return the name of the collection.
collection.is_empty              # Return the boolean value that indicates if the collection is empty.
collection.num_entities          # Return the number of entities in the collection.
collection.primary_field         # Return the schema.FieldSchema of the primary key field.
collection.partitions            # Return the list[Partition] object.
collection.indexes               # Return the list[Index] object.
collection.properties		# Return the expiration time of data in the collection.

删除一个集合（集合内的所有数据都被删除）

from pymilvus import utility
utility.drop_collection("book")

管理分区（Partitions）

使用分区可以更有效地组织和查询数据：我们可以将数据插入到特定的分区中，然后可以在查询时只查询和加载该分区，从而提高查询效率和减少资源占用。

创建分区

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.create_partition("novel")

判断分区是否存在

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.has_partition("novel")

删除分区（先释放再删除）

from pymilvus import Collection
collection.drop_partition("novel")

加载分区

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.load(["novel"], replica_number=2)

from pymilvus import Partition
partition = Partition("novel")       # Get an existing partition.
partition.load(replica_number=2)

释放分区

from pymilvus import Partition
partition = Partition("novel")       # Get an existing partition.
partition.release()

管理数据

插入数据

import random
data = [
  [i for i in range(2000)],
  [str(i) for i in range(2000)],
  [i for i in range(10000, 12000)],
  [[random.random() for _ in range(2)] for _ in range(2000)]
]

data.append([str("dy"*i) for i in range(2000)])

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
mr = collection.insert(data)
collection.flush()

删除数据

expr = "book_id in [0,1]"
from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.delete(expr)

管理索引

矢量索引是用于加速矢量相似性搜索的元数据的组织单元。如果没有基于向量构建的索引，Milvus将默认执行暴力搜索

创建矢量索引

index_params = {
  "metric_type":"L2",
  "index_type":"IVF_FLAT",
  "params":{"nlist":1024}
}

from pymilvus import Collection, utility
collection = Collection("book")      
collection.create_index(
  field_name="book_intro", 
  index_params=index_params
)

utility.index_building_progress("book")

创建标量索引

标量索引不需要设置索引类型以及索引参数，直接创建即可。

from pymilvus import Collection

collection = Collection("book")   
collection.create_index(
  field_name="book_name", 
  index_name="scalar_index",
)
collection.load()

删除索引

删除索引是删除该集合下的所有索引文件

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.drop_index()

搜索与查询

向量相似性搜索

Milvus中的向量相似度搜索会计算查询亮相与具有指定相似度度量的集合中的向量之间的距离，并返回最相似的结果。

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.drop_index()

search_params = {"metric_type": "L2", "params": {"nprobe": 10}, "offset": 5}

results = collection.search(
	data=[[0.1, 0.2]], 
	anns_field="book_intro", 
	param=search_params,
	limit=10, 
	expr=None,
	# set the names of the fields you want to retrieve from the search result.
	output_fields=['title'],
	consistency_level="Strong"
)

results[0].ids

results[0].distances

hit = results[0][0]
hit.entity.get('title')

# 搜索完成后，需要释放Milvus中加载的集合以减少内存消耗
collection.release()

必要的搜索参数

范围	描述
data	用于搜索的向量。
anns_field	要搜索的字段的名称。
param	特定于索引的搜索参数。有关详细信息，请参阅https://milvus.io/docs/index.md
offset	返回集中要跳过的结果数。该值与“limit”之和应小于 16384。
limit	要返回的最相似结果的数量。该值与“offset”之和应小于 16384。
expr	用于过滤属性的布尔表达式。有关详细信息，请参阅https://milvus.io/docs/boolean.md
partition_names（选修的）	要搜索的分区名称列表。
output_fields（选修的）	要返回的字段的名称。当前版本不支持矢量场。
timeout（选修的）	允许 RPC 的持续时间（以秒为单位）。当设置为 None 时，客户端会等待服务器响应或发生错误。
round_decimal（选修的）	返回距离的小数位数。
consistency_level（选修的）	搜索的一致性级别。

向量标量查询

from pymilvus import Collection
collection = Collection("book")      # Get an existing collection.
collection.load()

res = collection.query(
  expr = "book_id in [2,4,6,8]",
  offset = 0,
  limit = 10, 
  output_fields = ["book_id", "book_intro"],
  consistency_level="Strong"
)