搜索项目实践:基于 ElasticSearch 的搜索引擎快速搭建
数据建模与导入流程
1 ) 数据源说明
使用类 Airbnb 房屋数据集(CSV 格式),包含字段:
bathrooms(淋浴房数量)room_type(房间类型)bedrooms(卧室数量)date_from/date_to(可租日期范围)availability(当前可用状态)images(图片URL)description(文本描述)
2 ) ES 索引建模关键配置
PUT /test_airbnb
{
"mappings": {
"dynamic": false, // 禁用动态映射
"properties": {
"bedrooms": {
"type": "text", // 支持全文检索
"fields": { "keyword": { "type": "keyword" } }
},
"date_from": {
"type": "date",
"format": "yyyy-MM-dd"
},
"images": {
"type": "keyword",
"index": false // 关闭索引节省资源
},
"name": {
"type": "text",
"analyzer": "autosuggest_analyzer" // 自定义自动补全分析器
},
"location": { "type": "geo_point" }
}
},
"settings": {
"analysis": {
"analyzer": {
"autosuggest_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "autosuggest_filter"]
}
},
"filter": {
"autosuggest_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
}
技术细节:
edge_ngram分词器实现输入时自动补全(如输入 “be” 匹配 “bedroom”)- 地理位置字段
geo_point支持距离排序和地图展示
或
PUT /test_airbnb
{
"mappings": {
"dynamic": false,
"properties": {
"bath_type": {
"type": "text",
"fields": { "keyword": { "type": "keyword" } }
},
"date_from": {
"type": "date",
"format": "yyyy-MM-dd"
},
"date_to": {
"type": "date",
"format": "yyyy-MM-dd"
},
"host_image": {
"type": "keyword",
"index": false
},
"name": {
"type": "text",
"analyzer": "autosuggest_analyzer"
},
"property_type": {
"type": "text",
"analyzer": "standard"
},
"location": {
"type": "geo_point"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"autosuggest_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "autosuggest_filter"]
}
},
"filter": {
"autosuggest_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
}
关键配置说明:
edge_ngram分词器实现自动补全功能- 动态映射关闭(
dynamic: false)避免字段污染 - 非搜索字段(如图片URL)设置
index: false减少存储 - 地理位置字段使用
geo_point支持空间检索
3 ) Logstash 数据导入配置
input { file { path => "/data/airbnb.csv" } }
filter {
csv {
columns => ["bathrooms", "room_type", ...] # 所有列名
}
mutate { lowercase => ["availability"] } # 布尔值转小写
}
output { elasticsearch { hosts => ["localhost:9200"] index => "test_airbnb" } }
执行命令:bin/logstash -f airbnb.conf
数据导入与Kibana探索
Logstash导入配置
input {
file {
path => "/data/airbnb.csv"
start_position => "beginning"
}
}
filter {
csv {
separator => ","
columns => ["id", "name", "bath_type", "date_from", ...]
}
mutate {
lowercase => ["availability"]
}
date {
match => ["date_from", "yyyy-MM-dd"]
target => "date_from"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "test_airbnb"
}
}
数据处理要点:
- 对布尔字段(如availability)进行小写标准化
- 显式指定日期格式转换
- CSV列名与索引字段严格对应
Kibana可视化技巧
- 字段显示优化:
- URL字段设置
Format: URL→Type: Link - 图片字段设置
Format: Image→Type: URL
- URL字段设置
- 过滤查询示例:
{ "query": { "bool": { "must": [ { "range": { "bedrooms": { "gte": 3 } } }, { "range": { "price": { "lte": 400 } } } ] } } } - 数据表格配置字段:
- 房东名称(host)
- 房屋图片(image_url)
- 房间类型(property_type)
- 价格(price)
- 预订链接(booking_url)
前端搜索界面:Reactivesearch 快速实现
1 ) 技术选型与部署
- 框架:Reactivesearch(基于 React 的 ES 前端组件库)
- 部署流程:
# 安装依赖 npm install -g @appbaseio/reactivesearch git clone https://github.com/appbaseio/reactivesearch-airbnb-demo cd reactivesearch-airbnb-demo yarn install # 安装依赖 yarn start # 启动服务(端口 3000)
2 ) 核心组件与功能映射
| 组件 | 功能 | ES 字段映射 |
|---|---|---|
<DataSearch> | 顶部搜索框 | name |
<DateRange> | 日期筛选器 | date_from |
<NumberBox> | 卧室数量选择 | bedrooms |
<RangeSlider> | 价格区间滑块 | price |
<ResultCard> | 房屋卡片展示 | 多字段渲染 |
3 ) 自动补全与结果渲染示例
// 搜索框组件(支持自动补全)
<DataSearch
componentId="searchbox"
dataField="name"
autosuggest={true}
placeholder="搜索房源名称"
/>
// 结果卡片组件
<ResultCard
dataField="name"
react={{ and: ["searchbox", "daterange"] }} // 关联其他组件
onData={(res) => ({
image: res.images[0],
title: res.name,
description: `${res.bedrooms} 卧室 · ${res.bathrooms} 浴室`,
price: `¥${res.price}/晚`
})}
/>
或 参考核心组件配置
<ReactiveBase
app="test_airbnb"
url="http://localhost:9200"
>
<DataSearch
componentId="searchbox"
dataField="name"
placeholder="Search rentals..."
autosuggest={true}
highlight={true}
/>
<RangeSlider
componentId="priceslider"
dataField="price"
title="Price Range"
range={{ start: 10, end: 1000 }}
/>
<MultiList
componentId="propertyfilter"
dataField="property_type.keyword"
title="Property Type"
/>
<ResultCard
componentId="results"
dataField="name"
pagination={true}
react={{ and: ["searchbox", "priceslider"] }}
onData={(res) => ({
image: res.image_url,
title: res.name,
description: `${res.bedrooms} bedrooms · ${res.bath_type}`,
price: `$${res.price}/night`
})}
/>
</ReactiveBase>
功能扩展示例(添加标签云)
import { TagCloud } from "@appbaseio/reactivesearch";
<TagCloud
componentId="tagfilter"
dataField="amenities.keyword"
title="Popular Amenities"
size={50}
/>
日志分析项目:Nginx 访问日志处理
1 ) 日志格式与字段解析
原始日志样例:
192.168.1.1 [09/Nov/2023:10:12:33] "GET /video/123.ts HTTP/1.1" 304 "https://www.google.com" "Chrome/117.0"
2 ) Logstash Grok 解析配置
filter {
grok {
match => { "message" => "%{IP:client_ip} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{URIPATH:path} %{DATA}\" %{NUMBER:status} \"%{DATA:referrer}\" \"%{DATA:user_agent}\"" }
}
date { match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss"] }
geoip { source => "client_ip" } # IP转地理位置
useragent { source => "user_agent" } # 解析设备信息
}
3 ) ES 索引关键配置
PUT /nginx_logs
{
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"client_ip": { "type": "ip" },
"geoip": { "type": "geo_point" },
"response_time_ms": { "type": "float" }
}
}
}
日志分析项目:ELK实战与业务洞察
日志数据结构,Nginx日志示例:
66.249.73.135 [08/Nov/2016:00:01:02 +0000] "GET /video/1234 HTTP/1.1" 304 154 "www.google.com" "Mozilla/5.0" 192.168.1.10 192.168.1.20 0.016 0.016
字段解析:
$remote_addr:客户端IP$time_local:访问时间$request:请求路径$status:响应状态码$http_referer:来源页$http_user_agent:客户端设备
Logstash管道配置
filter {
grok {
match => { "message" => "%{IP:clientip} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{PATH:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:status} %{NUMBER:body_bytes} \"%{URI:referrer}\" \"%{DATA:agent}\" %{IP:upstream_ip} %{NUMBER:upstream_time} %{NUMBER:request_time}" }
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
}
mutate {
convert => {
"upstream_time" => "float"
"request_time" => "float"
}
}
urldecode {
field => "request"
}
useragent {
source => "agent"
target => "ua"
}
geoip {
source => "clientip"
}
}
Kibana仪表板关键指标
-
流量分析:
- 请求量趋势图(15分钟粒度)
- 地域分布热力图(GeoIP)
- 流量TOP10页面
-
性能监控:
{ "aggs": { "response_time": { "percentiles": { "field": "request_time", "percents": [50, 95, 99] } } } } -
业务洞察:
- 热门视频内容排行
- 用户活跃时段分布
- 流量来源渠道分析
工程示例:1
1 ) 方案 1:基础 CRUD 操作
// 安装依赖:npm install @nestjs/elasticsearch
import { ElasticsearchService } from '@nestjs/elasticsearch';
@Injectable()
export class SearchService {
constructor(private readonly esService: ElasticsearchService) {}
async createListing(listing: any) {
return this.esService.index({
index: 'test_airbnb',
body: listing
});
}
async searchListings(query: string) {
return this.esService.search({
index: 'test_airbnb',
body: {
query: { match: { name: query } }
}
});
}
}
2 ) 方案 2:复杂查询(地理位置 + 分页)
async findNearbyLocations(lat: number, lon: number, radius: string) {
return this.esService.search({
index: 'test_airbnb',
body: {
query: {
bool: {
filter: {
geo_distance: {
distance: radius,
location: { lat, lon }
}
}
}
},
sort: [{ _geo_distance: { location: [lon, lat], order: "asc" } }],
from: 0,
size: 10
}
});
}
3 ) 方案 3:批量写入与索引管理
async bulkImport(listings: any[]) {
const body = listings.flatMap(listing => [
{ index: { _index: 'test_airbnb' } },
listing
]);
return this.esService.bulk({ body });
}
async createAutosuggestIndex() {
return this.esService.indices.create({
index: 'autosuggest',
body: {
settings: { /* 同前文分析器配置 */ },
mappings: { properties: { name: { type: "text", analyzer: "autosuggest_analyzer" } } }
}
});
}
工程示例:2
模块配置(es.module.ts)
import { Module } from '@nestjs/common';
import { ElasticsearchModule } from '@nestjs/elasticsearch';
@Module({
imports: [
ElasticsearchModule.register({
node: 'http://localhost:9200',
maxRetries: 3,
requestTimeout: 30000
})
],
exports: [ElasticsearchModule]
})
export class EsModule {}
服务层封装(search.service.ts)
import { Injectable } from '@nestjs/common';
import { ElasticsearchService } from '@nestjs/elasticsearch';
@Injectable()
export class SearchService {
constructor(private readonly esService: ElasticsearchService) {}
async searchHomes(params: SearchParamsDTO) {
const { query, minPrice, maxPrice, location, radius } = params;
return this.esService.search({
index: 'test_airbnb',
body: {
query: {
bool: {
must: [{
multi_match: {
query,
fields: ['name^3', 'description']
}
}],
filter: [
{ range: { price: { gte: minPrice, lte: maxPrice } } },
{
geo_distance: {
distance: `${radius}km`,
location
}
}
]
}
},
highlight: {
fields: { name: {}, description: {} }
}
}
});
}
async logBulkInsert(logs: any[]) {
const body = logs.flatMap(log => [
{ index: { _index: 'nginx_logs' } },
log
]);
return this.esService.bulk({
refresh: 'wait_for',
body
});
}
}
控制器调用(search.controller.ts)
import { Controller, Post, Body } from '@nestjs/common';
import { SearchService } from './search.service';
@Controller('search')
export class SearchController {
constructor(private readonly searchService: SearchService) {}
@Post('homes')
async searchHomes(@Body() params: any) {
return this.searchService.searchHomes(params);
}
@Post('logs')
async ingestLogs(@Body() logs: any[]) {
return this.searchService.logBulkInsert(logs);
}
}
关键配置优化
-
连接池配置:
ElasticsearchModule.register({ nodes: [ 'http://node1:9200', 'http://node2:9200' ], ConnectionPool: require('@elastic/elasticsearch').ConnectionPool, maxRetries: 5, sniffOnStart: true }) -
安全认证:
auth: { username: process.env.ES_USER, password: process.env.ES_PASS }, ssl: { ca: readFileSync('./certs/ca.crt'), rejectUnauthorized: false } -
性能调优参数:
maxSockets: 100, // 最大连接数 compression: true, // 启用压缩 suggestCompression: true // 建议服务端压缩
工程示例:3
1 ) 方案1:基础数据索引服务
import { Injectable } from '@nestjs/common';
import { ElasticsearchService } from '@nestjs/elasticsearch';
@Injectable()
export class ListingService {
constructor(private readonly esService: ElasticsearchService) {}
async indexListing(listing: any) {
return this.esService.index({
index: 'airbnb',
body: {
...listing,
timestamp: new Date()
}
});
}
async searchListings(query: string) {
return this.esService.search({
index: 'airbnb',
body: {
query: {
multi_match: {
query,
fields: ["name^3", "description"]
}
}
}
});
}
}
2 ) 方案2:批量异步处理管道
import { Processor, Process } from '@nestjs/bull';
import { ElasticsearchService } from '@nestjs/elasticsearch';
@Processor('log-queue')
export class LogProcessor {
constructor(private readonly esService: ElasticsearchService) {}
@Process()
async handleLogs(job: Job) {
const bulkBody = job.data.flatMap(log => ([
{ index: { _index: 'nginx_logs' } },
log
]));
await this.esService.bulk({
refresh: 'wait_for',
body: bulkBody
});
}
}
3 ) 方案3:安全与性能优化
elasticsearch.yml 关键配置
thread_pool.write.queue_size: 1000 # 增大写入队列
indices.query.bool.max_clause_count: 8192 # 提升复杂查询能力
安全模块集成
xpack.security.enabled: true
xpack.security.authc.api_key.enabled: true
日志分析系统实战
1 )Nginx日志建模
日志格式解析:
$remote_addr - [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"
索引模板配置:
PUT /nginx_logs
{
"mappings": {
"properties": {
"geoip": { "type": "geo_point" },
"response_time": { "type": "float" },
"user_agent": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }
}
}
}
2 ) 日志收集架构
3 ) Grok解析配置
Logstash过滤器示例:
filter {
grok {
match => { "message" => "%{IP:client} %{GREEDYDATA:ident} ..." }
}
date { match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ] }
geoip { source => "client" target => "geoip" }
useragent { source => "agent" target => "os" }
}
生产环境最佳实践
Elasticsearch集群配置
# elasticsearch.yml
cluster.name: production-cluster
node.name: ${HOSTNAME}
network.host: 0.0.0.0
discovery.seed_hosts: ["es-node1", "es-node2"]
cluster.initial_master_nodes: ["es-node1"]
# 硬件优化
bootstrap.memory_lock: true
indices.queries.cache.size: 10%
thread_pool.write.queue_size: 1000
监控与告警配置
-
Kibana监控看板:
- JVM堆内存使用率
- 索引速率/查询速率
- 分片分配状态
-
告警规则示例:
PUT _watcher/watch/cluster_health_watch { "trigger": { "schedule": { "interval": "10s" } }, "input": { "http": { "request": { "host": "localhost", "port": 9200, "path": "/_cluster/health" } } }, "condition": { "compare": { "ctx.payload.status": { "eq": "red" } } }, "actions": { "email_alert": { "email": { "to": "admin@example.com", "subject": "ES Cluster Status RED", "body": "Cluster health status is RED at {{ctx.execution_time}}" } } } }
性能优化策略
-
索引设计:
- 冷热数据分层(ILM策略)
- 时序数据使用时间序列索引
PUT _ilm/policy/logs_policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_size": "50GB" } } }, "delete": { "min_age": "30d", "actions": { "delete": {} } } } } } -
查询优化:
- 避免深度分页(改用search_after)
- 使用runtime_mappings替代脚本
- 聚合查询开启
pre_filter_shard_size
关键要点总结:
- 数据建模是Elasticsearch性能基础,需明确字段类型和使用场景
- 自动补全通过edge_ngram分词器实现,需平衡存储和查询效率
- 前端集成使用Reactivesearch可快速构建生产级搜索界面
- 日志分析需结合Grok解析和GeoIP扩展实现深度洞察
- NestJS集成通过模块化封装提升后端服务可维护性
- 集群运维需关注分片策略、ILM管理和实时监控
ES 周边配置最佳实践
1 ) 跨域访问配置 (elasticsearch.yml)
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
2 ) 安全加固方案
- 使用
readonlyrest插件设置权限 - 通过 API Key 替代基础认证:
curl -XPOST "localhost:9200/_security/api_key" -H 'Content-Type: application/json' -d' { "name": "airbnb_app_key" }'
3 )性能优化建议
PUT /test_airbnb/_settings
{
"index.refresh_interval": "30s", // 降低刷新频率
"index.number_of_replicas": 1 // 生产环境建议≥2
}
初学者提示:
edge_ngram:将单词按前缀切分(如 “bedroom” → [“b”,“be”,“bed”,…])实现输入时自动补全geo_distance:计算经纬度两点间距离(支持 km/mi 单位)- 全文检索与精准查询区别:
text类型分词处理,keyword类型精确匹配
1321

被折叠的 条评论
为什么被折叠?



