试用oceanbase seekdb

最新推荐文章于 2025-11-28 08:52:23 发布

原创最新推荐文章于 2025-11-28 08:52:23 发布 · 418 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#oceanbase #数据库

编程语言软件工程同时被 2 个专栏收录

263 篇文章

订阅专栏

数据库

174 篇文章

订阅专栏

官方网站https://www.oceanbase.ai/
我试用的是嵌入式
安装

python3 par/pip.pyz install pyseekdb --break-system-packages -i https://pypi.tuna.tsinghua.edu.cn/simple
它卸载了numpy 2.3.4, 安装了numpy-1.26.4，其他没什么影响。
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.4
    Uninstalling numpy-2.3.4:
      Successfully uninstalled numpy-2.3.4
Successfully installed anyio-4.11.0 certifi-2025.11.12 charset_normalizer-3.4.4 coloredlogs-15.0.1 filelock-3.20.0 flatbuffers-25.9.23 fsspec-2025.10.0 h11-0.16.0 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface_hub-0.36.0 humanfriendly-10.0 idna-3.11 mpmath-1.3.0 numpy-1.26.4

然后把文档https://www.oceanbase.ai/docs/zh-CN/using-seekdb-in-python-sdk中代码复制到文件

import pyseekdb
from pyseekdb import DefaultEmbeddingFunction

# ==================== Step 1: 创建连接 ====================
# You can use embedded mode, server mode, or OceanBase mode
# For this example, we'll use embedded mode (you can change to server or OceanBase mode)

# Server mode (connecting to remote seekdb server)
# client = pyseekdb.Client(
#     host="127.0.0.1",
#     port=2881,
#     database="test",
#     user="root",
#     password=""
# )

# 连接嵌入式 seekdb
client = pyseekdb.Client(
    path="./seekdb",
    database="test"
)

# Alternative: OceanBase mode
# client = pyseekdb.OBClient(
#     host="127.0.0.1",
#     port=11402,
#     tenant="mysql",
#     database="test",
#     user="root",
#     password=""
# )

# ==================== Step 2: 创建带有 Embedding Functions 的 collection ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"

# 使用默认 embedding function 创建 collection
# embedding function 自动将 documents 转化为 vectors
collection = client.create_collection(
    name=collection_name,
    embedding_function=DefaultEmbeddingFunction()  # 使用默认模式 (384 dimensions)
)

print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")

# ==================== Step 3: 往 Collection 添加数据 ====================
# 通过 embedding function, 您可以直接添加 documents 不需要提供 vectors
# embedding function 将自动从 documents 生成 vectors

documents = [
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language",
    "Vector databases enable semantic search",
    "Neural networks are inspired by the human brain",
    "Natural language processing helps computers understand text"
]

ids = ["id1", "id2", "id3", "id4", "id5"]

# 仅添加 documents - 通过 embedding function 自动生成 vectors
collection.add(
    ids=ids,
    documents=documents,  # Vectors will be automatically generated
    metadatas=[
        {"category": "AI", "index": 0},
        {"category": "Programming", "index": 1},
        {"category": "Database", "index": 2},
        {"category": "AI", "index": 3},
        {"category": "NLP", "index": 4}
    ]
)

print(f"\nAdded {len(documents)} documents to collection")
print("Note: Vectors were automatically generated from documents using the embedding function")

# ==================== Step 4: 查询 Collection ====================
# With embedding function, you can query using text directly
# embedding function 自动将查询 text 转化为查询 vector

# 使用 text 进行查询- 通过 embedding function 将自动生成查询 vector 
query_text = "artificial intelligence and machine learning"

results = collection.query(
    query_texts=query_text,  # Query text - will be embedded automatically
    n_results=3  # 返回前 3 个 documents
)

print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results)} items found")

# ==================== Step 5: 打印查询结果 ====================
for i, item in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"  ID: {item._id}")
    print(f"  Distance: {item.distance:.4f}")
    print(f"  Document: {item.document}")
    print(f"  Metadata: {item.metadata}")

# ==================== Step 6: 清理 collection ====================
# 删除 collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")

添加收集的时候会自动下载模型

model.onnx: 100%|█████████████████████████████████████████████████████████████████| 86.2M/86.2M [00:10<00:00, 8.56MiB/s]
tokenizer.json: 455kiB [00:00, 1.42MiB/s]
config.json: 612iB [00:00, 2.73MiB/s]
special_tokens_map.json: 112iB [00:00, 724kiB/s]
tokenizer_config.json: 350iB [00:00, 2.00MiB/s]
vocab.txt: 226kiB [00:00, 774kiB/s]

代码打印查询结果错

Result 1:
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
AttributeError: 'str' object has no attribute '_id'

我把它改成如下，出结果了，原来属性名是ids不是_id。

>>> print(results)
{'ids': [['id1', 'id4', 'id5']], 'distances': [[0.3007760478126502, 0.5982640429376553, 0.685555282270486]], 'documents': [['Machine learning is a subset of artificial intelligence', 'Neural networks are inspired by the human brain', 'Natural language processing helps computers understand text']], 'metadatas': [[{'index': 0, 'category': 'AI'}, {'index': 3, 'category': 'AI'}, {'index': 4, 'category': 'NLP'}]]}

如果不删除收集，下次create会报错，可以用以下语法打开已有的收集

>>> collection = client.get_collection(
        name=collection_name,
        embedding_function=DefaultEmbeddingFunction()  # 使用默认模式 (384 dimensions)
 )

主页上还有另一种语法

import pyseekdb

client = pyseekdb.Client()

# create a knowledge base
collection = client.get_or_create_collection("product_database")
# Add product documents
collection.upsert(
    documents=[
        "Laptop Pro with 16GB RAM, 512GB SSD, and high-speed processor",
        "Gaming Laptop with 32GB RAM, 1TB SSD, and high-performance graphics",
        "Business Ultrabook with 8GB RAM, 256GB SSD, and long battery life",
        "Tablet with 6GB RAM, 128GB storage, and 10-inch display"
    ],
    metadatas=[
        {"category": "laptop", "ram": 16, "storage": 512, "price": 12000, "type": "professional"},
        {"category": "laptop", "ram": 32, "storage": 1000, "price": 25000, "type": "gaming"},
        {"category": "laptop", "ram": 8, "storage": 256, "price": 9000, "type": "business"},
        {"category": "tablet", "ram": 6, "storage": 128, "price": 6000, "type": "consumer"}
    ],
    ids=["1", "2", "3", "4"]
)

# Hybrid search for high-performance laptops
results = collection.query(
    query_texts=["powerful computer for professional work"],  # Vector search
    where={                                                   # Relational filter
        "category": "laptop",
        "ram": {"$gte": 16}
    },
    where_document={"$contains": "RAM"},                      # Full-text search
    n_results=2
)

print("\nResults:")
for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
    print(f"  {i+1}. {doc}")

也能新建收集

>>> client = pyseekdb.Client()
>>> collection = client.get_or_create_collection("collection_name")
[seekdb] seekdb has opened

客户端/服务器模式只要修改第一部分，其他都一样

import pyseekdb

client = pyseekdb.Client(
        host = "127.0.0.1",          # server host
        port = 2881,                 # server port (default: 2881)
        )

# create a knowledge base
collection = client.get_or_create_collection("product_database")

实际执行会报错，需要mysql服务器

pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")