官方网站https://www.oceanbase.ai/
我试用的是嵌入式
安装
python3 par/pip.pyz install pyseekdb --break-system-packages -i https://pypi.tuna.tsinghua.edu.cn/simple
它卸载了numpy 2.3.4, 安装了numpy-1.26.4,其他没什么影响。
Attempting uninstall: numpy
Found existing installation: numpy 2.3.4
Uninstalling numpy-2.3.4:
Successfully uninstalled numpy-2.3.4
Successfully installed anyio-4.11.0 certifi-2025.11.12 charset_normalizer-3.4.4 coloredlogs-15.0.1 filelock-3.20.0 flatbuffers-25.9.23 fsspec-2025.10.0 h11-0.16.0 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface_hub-0.36.0 humanfriendly-10.0 idna-3.11 mpmath-1.3.0 numpy-1.26.4
然后把文档https://www.oceanbase.ai/docs/zh-CN/using-seekdb-in-python-sdk中代码复制到文件
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction
# ==================== Step 1: 创建连接 ====================
# You can use embedded mode, server mode, or OceanBase mode
# For this example, we'll use embedded mode (you can change to server or OceanBase mode)
# Server mode (connecting to remote seekdb server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# 连接嵌入式 seekdb
client = pyseekdb.Client(
path="./seekdb",
database="test"
)
# Alternative: OceanBase mode
# client = pyseekdb.OBClient(
# host="127.0.0.1",
# port=11402,
# tenant="mysql",
# database="test",
# user="root",
# password=""
# )
# ==================== Step 2: 创建带有 Embedding Functions 的 collection ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"
# 使用默认 embedding function 创建 collection
# embedding function 自动将 documents 转化为 vectors
collection = client.create_collection(
name=collection_name,
embedding_function=DefaultEmbeddingFunction() # 使用默认模式 (384 dimensions)
)
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")
# ==================== Step 3: 往 Collection 添加数据 ====================
# 通过 embedding function, 您可以直接添加 documents 不需要提供 vectors
# embedding function 将自动从 documents 生成 vectors
documents = [
"Machine learning is a subset of artificial intelligence",
"Python is a popular programming language",
"Vector databases enable semantic search",
"Neural networks are inspired by the human brain",
"Natural language processing helps computers understand text"
]
ids = ["id1", "id2", "id3", "id4", "id5"]
# 仅添加 documents - 通过 embedding function 自动生成 vectors
collection.add(
ids=ids,
documents=documents, # Vectors will be automatically generated
metadatas=[
{"category": "AI", "index": 0},
{"category": "Programming", "index": 1},
{"category": "Database", "index": 2},
{"category": "AI", "index": 3},
{"category": "NLP", "index": 4}
]
)
print(f"\nAdded {len(documents)} documents to collection")
print("Note: Vectors were automatically generated from documents using the embedding function")
# ==================== Step 4: 查询 Collection ====================
# With embedding function, you can query using text directly
# embedding function 自动将查询 text 转化为查询 vector
# 使用 text 进行查询- 通过 embedding function 将自动生成查询 vector
query_text = "artificial intelligence and machine learning"
results = collection.query(
query_texts=query_text, # Query text - will be embedded automatically
n_results=3 # 返回前 3 个 documents
)
print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results)} items found")
# ==================== Step 5: 打印查询结果 ====================
for i, item in enumerate(results, 1):
print(f"\nResult {i}:")
print(f" ID: {item._id}")
print(f" Distance: {item.distance:.4f}")
print(f" Document: {item.document}")
print(f" Metadata: {item.metadata}")
# ==================== Step 6: 清理 collection ====================
# 删除 collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")
添加收集的时候会自动下载模型
model.onnx: 100%|█████████████████████████████████████████████████████████████████| 86.2M/86.2M [00:10<00:00, 8.56MiB/s]
tokenizer.json: 455kiB [00:00, 1.42MiB/s]
config.json: 612iB [00:00, 2.73MiB/s]
special_tokens_map.json: 112iB [00:00, 724kiB/s]
tokenizer_config.json: 350iB [00:00, 2.00MiB/s]
vocab.txt: 226kiB [00:00, 774kiB/s]
代码打印查询结果错
Result 1:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
AttributeError: 'str' object has no attribute '_id'
我把它改成如下,出结果了,原来属性名是ids不是_id。
>>> print(results)
{'ids': [['id1', 'id4', 'id5']], 'distances': [[0.3007760478126502, 0.5982640429376553, 0.685555282270486]], 'documents': [['Machine learning is a subset of artificial intelligence', 'Neural networks are inspired by the human brain', 'Natural language processing helps computers understand text']], 'metadatas': [[{'index': 0, 'category': 'AI'}, {'index': 3, 'category': 'AI'}, {'index': 4, 'category': 'NLP'}]]}
如果不删除收集,下次create会报错,可以用以下语法打开已有的收集
>>> collection = client.get_collection(
name=collection_name,
embedding_function=DefaultEmbeddingFunction() # 使用默认模式 (384 dimensions)
)
主页上还有另一种语法
import pyseekdb
client = pyseekdb.Client()
# create a knowledge base
collection = client.get_or_create_collection("product_database")
# Add product documents
collection.upsert(
documents=[
"Laptop Pro with 16GB RAM, 512GB SSD, and high-speed processor",
"Gaming Laptop with 32GB RAM, 1TB SSD, and high-performance graphics",
"Business Ultrabook with 8GB RAM, 256GB SSD, and long battery life",
"Tablet with 6GB RAM, 128GB storage, and 10-inch display"
],
metadatas=[
{"category": "laptop", "ram": 16, "storage": 512, "price": 12000, "type": "professional"},
{"category": "laptop", "ram": 32, "storage": 1000, "price": 25000, "type": "gaming"},
{"category": "laptop", "ram": 8, "storage": 256, "price": 9000, "type": "business"},
{"category": "tablet", "ram": 6, "storage": 128, "price": 6000, "type": "consumer"}
],
ids=["1", "2", "3", "4"]
)
# Hybrid search for high-performance laptops
results = collection.query(
query_texts=["powerful computer for professional work"], # Vector search
where={ # Relational filter
"category": "laptop",
"ram": {"$gte": 16}
},
where_document={"$contains": "RAM"}, # Full-text search
n_results=2
)
print("\nResults:")
for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
print(f" {i+1}. {doc}")
也能新建收集
>>> client = pyseekdb.Client()
>>> collection = client.get_or_create_collection("collection_name")
[seekdb] seekdb has opened
客户端/服务器模式只要修改第一部分,其他都一样
import pyseekdb
client = pyseekdb.Client(
host = "127.0.0.1", # server host
port = 2881, # server port (default: 2881)
)
# create a knowledge base
collection = client.get_or_create_collection("product_database")
实际执行会报错,需要mysql服务器
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")

450

被折叠的 条评论
为什么被折叠?



