使用boto库操作AWS CloudSearch的完整指南

原创于 2025-06-06 09:20:39 发布 · 319 阅读

CC 4.0 BY-SA版权

使用boto库操作AWS CloudSearch的完整指南

boto For the latest version of boto, see https://github.com/boto/boto3 -- Python interface to Amazon Web Services 项目地址: https://gitcode.com/gh_mirrors/bo/boto

概述

AWS CloudSearch是一项完全托管的搜索服务，可以让开发者轻松地在应用程序中集成强大的搜索功能。boto库作为AWS的Python SDK，提供了与CloudSearch交互的便捷接口。本文将详细介绍如何使用boto库来创建、配置和管理CloudSearch服务。

环境准备

在开始之前，请确保已安装boto库并配置好AWS凭证。可以通过以下两种方式配置凭证：

直接在代码中指定：

import boto.cloudsearch
conn = boto.cloudsearch.connect_to_region("us-west-2",
            aws_access_key_id='你的访问密钥',
            aws_secret_access_key='你的秘密密钥')

通过环境变量配置：

export AWS_ACCESS_KEY_ID=你的访问密钥
export AWS_SECRET_ACCESS_KEY=你的秘密密钥

然后在代码中简化为：

import boto.cloudsearch
conn = boto.cloudsearch.connect_to_region("us-west-2")

创建搜索域(Domain)

搜索域是CloudSearch中的核心概念，它包含了索引数据、索引配置和元数据。创建域非常简单：

from boto.cloudsearch.domain import Domain
domain = Domain(conn, conn.create_domain('demo'))

配置访问策略

在开始索引文档前，需要配置访问策略以控制谁可以访问搜索和文档服务：

# 允许特定IP访问
our_ip = '192.168.1.0'
policy = domain.get_access_policies()
policy.allow_search_ip(our_ip)  # 允许搜索
policy.allow_doc_ip(our_ip)     # 允许文档操作

定义索引字段

CloudSearch允许为每个域定义最多20个索引字段。字段类型可以是文本(text)或整数(uint)：

# 创建文本字段
uname_field = domain.create_index_field('username', 'text')

# 创建整数字段并设置默认值
time_field = domain.create_index_field('last_activity', 'uint', default=0)

# 创建可用于分面搜索的字段
loc_field = domain.create_index_field('location', 'text', facet=True)

# 创建可直接在结果中返回的片段字段
snippet_field = domain.create_index_field('snippet', 'text', result=True)

索引文档

准备好字段后，就可以开始索引文档了：

# 获取文档服务
doc_service = domain.get_document_service()

# 准备文档数据
users = [
    {
        'id': 1,
        'username': 'dan',
        'last_activity': 1334252740,
        'follower_count': 20,
        'location': 'USA',
        'snippet': 'Dan likes watching sunsets and rock climbing',
    },
    # 更多用户数据...
]

# 批量添加文档
for user in users:
    doc_service.add(user['id'], user['last_activity'], user)

# 提交更改
result = doc_service.commit()

注意：每次提交后，如果需要继续使用文档服务，需要调用clear_sdf()方法清除内部缓存。

执行搜索

索引完成后，就可以执行搜索了：

# 获取搜索服务
search_service = domain.get_search_service()

# 简单搜索
results = search_service.search(q="dan")
print(results.hits)  # 匹配结果数
print([x['id'] for x in results])  # 匹配文档ID

# 布尔查询(支持通配符)
results = search_service.search(bq="'dan*'")
print([x['id'] for x in results])

# 复杂布尔查询(OR/AND/NOT)
results = search_service.search(bq="'watched|moved'")
print([x['id'] for x in results])

# 分页控制
results = search_service.search(bq="'dan*'", size=2, start=2)
print([x['id'] for x in results])

结果排序

可以通过自定义排序表达式来控制搜索结果的排序：

# 按粉丝数降序
results = search_service.search(bq=query, rank=['-follower_count'])

# 创建自定义排序表达式
domain.create_rank_expression('recently_active', 'last_activity')
domain.create_rank_expression('activish', 
    'text_relevance + ((follower_count/(time() - last_activity))*1000)')

# 使用自定义排序
results = search_service.search(bq=query, rank=['-recently_active'])

词干处理配置

词干处理可以将相关词映射到共同的词根，提高搜索召回率：

# 获取当前词干配置
stems = domain.get_stemming()

# 添加词干映射
stems['stems']['running'] = 'run'
stems['stems']['ran'] = 'run'

# 保存更改
stems.save()

停用词配置

停用词是在索引和搜索时通常应该忽略的词：

# 获取当前停用词列表
stopwords = domain.get_stopwords()

# 添加停用词
stopwords['stopwords'].append('foo')
stopwords['stopwords'].append('bar')

# 保存更改
stopwords.save()

同义词配置

同义词配置可以让不同词匹配相同的文档：

# 获取当前同义词配置
synonyms = domain.get_synonyms()

# 添加同义词
synonyms['synonyms']['cat'] = ['feline', 'kitten']
synonyms['synonyms']['dog'] = ['canine', 'puppy']

# 保存更改
synonyms.save()

删除文档

删除文档与添加文档类似：

import time
from datetime import datetime

doc_service = domain.get_document_service()

# 使用当前时间作为版本号
doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))
doc_service.commit()