使用AWS SDK for pandas操作OpenSearch Serverless全指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01096/article/details/148549823

使用AWS SDK for pandas操作OpenSearch Serverless全指南

aws-sdk-pandas aws/aws-sdk-pandas: 是一个用于 Pandas 的 AWS SDK，可以方便地在 Python 中访问 AWS 服务。适合对 AWS、Pandas 和想要实现 AWS 服务访问的开发者。项目地址: https://gitcode.com/gh_mirrors/aw/aws-sdk-pandas

前言

OpenSearch Serverless是AWS提供的一种无服务器配置的OpenSearch服务，它消除了管理基础设施的复杂性，让开发者能够专注于数据分析和搜索功能的实现。本文将详细介绍如何通过AWS SDK for pandas（原awswrangler）来操作OpenSearch Serverless，包括创建集合、建立索引、文档操作等核心功能。

环境准备

在开始之前，需要安装AWS SDK for pandas的OpenSearch扩展模块：

pip install 'awswrangler[opensearch]'

安装完成后，导入必要的库：

import awswrangler as wr
import pandas as pd

OpenSearch Serverless核心概念

集合(Collection)

在OpenSearch Serverless中，集合是一个逻辑分组，包含一个或多个索引，代表一个分析工作负载。每个集合必须配置：

加密策略：定义数据加密方式
网络策略：控制访问权限
数据访问策略：授权对资源的访问

创建集合

基本集合创建

首先需要定义数据访问策略，以下是一个示例策略，允许当前用户对集合和索引拥有所有权限：

data_access_policy = [
    {
        "Rules": [
            {
                "ResourceType": "index",
                "Resource": ["index/my-collection/*"],
                "Permission": ["aoss:*"],
            },
            {
                "ResourceType": "collection",
                "Resource": ["collection/my-collection"],
                "Permission": ["aoss:*"],
            },
        ],
        "Principal": [wr.sts.get_current_identity_arn()],
    }
]

创建集合的代码非常简单：

collection = wr.opensearch.create_collection(
    name="my-collection",
    data_policy=data_access_policy,
)
collection_endpoint = collection["collectionEndpoint"]

默认情况下，SDK会创建：

允许公共访问的网络策略
使用AWS管理KMS密钥的加密策略

高级集合配置

如果需要更安全的配置，可以指定KMS密钥和VPC端点：

kms_key_arn = "arn:aws:kms:..."
vpc_endpoint = "vpce-..."

collection = wr.opensearch.create_collection(
    name="my-secure-collection",
    data_policy=data_access_policy,
    kms_key_arn=kms_key_arn,
    vpc_endpoints=[vpc_endpoint],
)

连接OpenSearch Serverless

创建集合后，可以通过端点连接：

client = wr.opensearch.connect(host=collection_endpoint)

索引操作

创建索引

index = "my-index-1"
wr.opensearch.create_index(client=client, index=index)

文档操作

索引文档

可以直接索引Python字典列表：

documents = [
    {"_id": "1", "name": "John"},
    {"_id": "2", "name": "George"}, 
    {"_id": "3", "name": "Julia"}
]

result = wr.opensearch.index_documents(
    client,
    documents=documents,
    index=index,
)

索引Pandas DataFrame

df = pd.DataFrame([
    {"_id": "1", "name": "John", "tags": ["foo", "bar"]},
    {"_id": "2", "name": "George", "tags": ["foo"]}
])

result = wr.opensearch.index_df(
    client,
    df=df,
    index="index-df",
)

搜索文档

使用搜索DSL进行查询：

search_result = wr.opensearch.search(
    client, 
    index=index, 
    search_body={"query": {"match": {"name": "Julia"}}}
)

删除索引

wr.opensearch.delete_index(client=client, index=index)

最佳实践

权限最小化：在实际应用中，应根据需要限制权限，而不是使用aoss:*这样的宽泛权限
批量操作：对于大量文档，考虑使用批量API提高效率
错误处理：检查操作返回结果中的错误信息
资源清理：不再使用的集合应及时删除以避免不必要的费用

总结

通过AWS SDK for pandas操作OpenSearch Serverless，开发者可以轻松实现：

集合的创建和管理
索引的创建和删除
文档的索引和搜索
与Pandas DataFrame的无缝集成

这种集成大大简化了数据分析工作流，使得在Python环境中处理OpenSearch Serverless数据变得更加高效和直观。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考