ElasticSearch 梳理:房屋搜索平台与日志分析系统构建指南

搜索项目实践:基于 ElasticSearch 的搜索引擎快速搭建


数据建模与导入流程

1 ) 数据源说明
使用类 Airbnb 房屋数据集(CSV 格式),包含字段:

  • bathrooms(淋浴房数量)
  • room_type(房间类型)
  • bedrooms(卧室数量)
  • date_from/date_to(可租日期范围)
  • availability(当前可用状态)
  • images(图片URL)
  • description(文本描述)

2 ) ES 索引建模关键配置

PUT /test_airbnb
{
  "mappings": {
    "dynamic": false,  // 禁用动态映射
    "properties": {
      "bedrooms": {
        "type": "text",  // 支持全文检索 
        "fields": { "keyword": { "type": "keyword" } }
      },
      "date_from": { 
        "type": "date", 
        "format": "yyyy-MM-dd" 
      },
      "images": { 
        "type": "keyword", 
        "index": false  // 关闭索引节省资源
      },
      "name": {
        "type": "text",
        "analyzer": "autosuggest_analyzer"  // 自定义自动补全分析器
      },
      "location": { "type": "geo_point" }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "autosuggest_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "autosuggest_filter"]
        }
      },
      "filter": {
        "autosuggest_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      }
    }
  }
}

技术细节:

  • edge_ngram 分词器实现输入时自动补全(如输入 “be” 匹配 “bedroom”)
  • 地理位置字段 geo_point 支持距离排序和地图展示

PUT /test_airbnb
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "bath_type": { 
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "date_from": { 
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "date_to": {
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "host_image": {
        "type": "keyword",
        "index": false
      },
      "name": {
        "type": "text",
        "analyzer": "autosuggest_analyzer"
      },
      "property_type": {
        "type": "text",
        "analyzer": "standard"
      },
      "location": {
        "type": "geo_point"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "autosuggest_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "autosuggest_filter"]
        }
      },
      "filter": {
        "autosuggest_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20 
        }
      }
    }
  }
}

关键配置说明:

  • edge_ngram 分词器实现自动补全功能
  • 动态映射关闭(dynamic: false)避免字段污染
  • 非搜索字段(如图片URL)设置 index: false 减少存储
  • 地理位置字段使用 geo_point 支持空间检索

3 ) Logstash 数据导入配置

input { file { path => "/data/airbnb.csv" } }
filter {
  csv {
    columns => ["bathrooms", "room_type", ...]  # 所有列名 
  }
  mutate { lowercase => ["availability"] }  # 布尔值转小写
}
output { elasticsearch { hosts => ["localhost:9200"] index => "test_airbnb" } }

执行命令:bin/logstash -f airbnb.conf

数据导入与Kibana探索


Logstash导入配置

input {
  file {
    path => "/data/airbnb.csv"
    start_position => "beginning"
  }
}
 
filter {
  csv {
    separator => ","
    columns => ["id", "name", "bath_type", "date_from", ...]
  }
  mutate {
    lowercase => ["availability"]
  }
  date {
    match => ["date_from", "yyyy-MM-dd"]
    target => "date_from"
  }
}
 
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "test_airbnb"
  }
}

数据处理要点:

  • 对布尔字段(如availability)进行小写标准化
  • 显式指定日期格式转换
  • CSV列名与索引字段严格对应

Kibana可视化技巧

  1. 字段显示优化:
    • URL字段设置 Format: URLType: Link
    • 图片字段设置 Format: ImageType: URL
  2. 过滤查询示例:
    {
      "query": {
        "bool": {
          "must": [
            { "range": { "bedrooms": { "gte": 3 } } },
            { "range": { "price": { "lte": 400 } } }
          ]
        }
      }
    }
    
  3. 数据表格配置字段:
    • 房东名称(host)
    • 房屋图片(image_url)
    • 房间类型(property_type)
    • 价格(price)
    • 预订链接(booking_url)

前端搜索界面:Reactivesearch 快速实现


1 ) 技术选型与部署

  • 框架:Reactivesearch(基于 React 的 ES 前端组件库)
  • 部署流程:
    # 安装依赖
    npm install -g @appbaseio/reactivesearch
    
    git clone https://github.com/appbaseio/reactivesearch-airbnb-demo
    cd reactivesearch-airbnb-demo
    yarn install  # 安装依赖
    yarn start    # 启动服务(端口 3000)
    

2 ) 核心组件与功能映射

组件功能ES 字段映射
<DataSearch>顶部搜索框name
<DateRange>日期筛选器date_from
<NumberBox>卧室数量选择bedrooms
<RangeSlider>价格区间滑块price
<ResultCard>房屋卡片展示多字段渲染

3 ) 自动补全与结果渲染示例

// 搜索框组件(支持自动补全)
<DataSearch
  componentId="searchbox"
  dataField="name"
  autosuggest={true}
  placeholder="搜索房源名称"
/>
 
// 结果卡片组件
<ResultCard
  dataField="name"
  react={{ and: ["searchbox", "daterange"] }} // 关联其他组件
  onData={(res) => ({
    image: res.images[0],
    title: res.name,
    description: `${res.bedrooms} 卧室 · ${res.bathrooms} 浴室`,
    price: `¥${res.price}/晚`
  })}
/>

或 参考核心组件配置

<ReactiveBase
  app="test_airbnb"
  url="http://localhost:9200"
>
  <DataSearch
    componentId="searchbox"
    dataField="name"
    placeholder="Search rentals..."
    autosuggest={true}
    highlight={true}
  />
  <RangeSlider
    componentId="priceslider"
    dataField="price"
    title="Price Range"
    range={{ start: 10, end: 1000 }}
  />
  <MultiList
    componentId="propertyfilter"
    dataField="property_type.keyword"
    title="Property Type"
  />
  <ResultCard
    componentId="results"
    dataField="name"
    pagination={true}
    react={{ and: ["searchbox", "priceslider"] }}
    onData={(res) => ({
      image: res.image_url,
      title: res.name,
      description: `${res.bedrooms} bedrooms · ${res.bath_type}`,
      price: `$${res.price}/night`
    })}
  />
</ReactiveBase>

功能扩展示例(添加标签云)

import { TagCloud } from "@appbaseio/reactivesearch";
 
<TagCloud
  componentId="tagfilter"
  dataField="amenities.keyword"
  title="Popular Amenities"
  size={50}
/>

日志分析项目:Nginx 访问日志处理


1 ) 日志格式与字段解析

原始日志样例:
192.168.1.1 [09/Nov/2023:10:12:33] "GET /video/123.ts HTTP/1.1" 304 "https://www.google.com" "Chrome/117.0"

2 ) Logstash Grok 解析配置

filter {
  grok {
    match => { "message" => "%{IP:client_ip} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{URIPATH:path} %{DATA}\" %{NUMBER:status} \"%{DATA:referrer}\" \"%{DATA:user_agent}\"" }
  }
  date { match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss"] }
  geoip { source => "client_ip" }  # IP转地理位置 
  useragent { source => "user_agent" } # 解析设备信息
}

3 ) ES 索引关键配置

PUT /nginx_logs
{
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "client_ip": { "type": "ip" },
      "geoip": { "type": "geo_point" },
      "response_time_ms": { "type": "float" }
    }
  }
}

日志分析项目:ELK实战与业务洞察


日志数据结构,Nginx日志示例:

66.249.73.135 [08/Nov/2016:00:01:02 +0000] "GET /video/1234 HTTP/1.1" 304 154 "www.google.com" "Mozilla/5.0" 192.168.1.10 192.168.1.20 0.016 0.016

字段解析:

  • $remote_addr:客户端IP
  • $time_local:访问时间
  • $request:请求路径
  • $status:响应状态码
  • $http_referer:来源页
  • $http_user_agent:客户端设备

Logstash管道配置

filter {
  grok {
    match => { "message" => "%{IP:clientip} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{PATH:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:status} %{NUMBER:body_bytes} \"%{URI:referrer}\" \"%{DATA:agent}\" %{IP:upstream_ip} %{NUMBER:upstream_time} %{NUMBER:request_time}" }
  }
  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
  }
  mutate {
    convert => { 
      "upstream_time" => "float"
      "request_time" => "float"
    }
  }
  urldecode { 
    field => "request"
  }
  useragent {
    source => "agent"
    target => "ua"
  }
  geoip {
    source => "clientip"
  }
}

Kibana仪表板关键指标

  1. 流量分析:

    • 请求量趋势图(15分钟粒度)
    • 地域分布热力图(GeoIP)
    • 流量TOP10页面
  2. 性能监控:

    {
      "aggs": {
        "response_time": {
          "percentiles": {
            "field": "request_time",
            "percents": [50, 95, 99]
          }
        }
      }
    }
    
  3. 业务洞察:

    • 热门视频内容排行
    • 用户活跃时段分布
    • 流量来源渠道分析

工程示例:1


1 ) 方案 1:基础 CRUD 操作

// 安装依赖:npm install @nestjs/elasticsearch
import { ElasticsearchService } from '@nestjs/elasticsearch';
 
@Injectable()
export class SearchService {
  constructor(private readonly esService: ElasticsearchService) {}
 
  async createListing(listing: any) {
    return this.esService.index({
      index: 'test_airbnb',
      body: listing 
    });
  }
 
  async searchListings(query: string) {
    return this.esService.search({
      index: 'test_airbnb',
      body: {
        query: { match: { name: query } }
      }
    });
  }
}

2 ) 方案 2:复杂查询(地理位置 + 分页)

async findNearbyLocations(lat: number, lon: number, radius: string) {
  return this.esService.search({
    index: 'test_airbnb',
    body: {
      query: {
        bool: {
          filter: {
            geo_distance: {
              distance: radius,
              location: { lat, lon }
            }
          }
        }
      },
      sort: [{ _geo_distance: { location: [lon, lat], order: "asc" } }],
      from: 0,
      size: 10
    }
  });
}

3 ) 方案 3:批量写入与索引管理

async bulkImport(listings: any[]) {
  const body = listings.flatMap(listing => [
    { index: { _index: 'test_airbnb' } },
    listing
  ]);
 
  return this.esService.bulk({ body });
}
 
async createAutosuggestIndex() {
  return this.esService.indices.create({
    index: 'autosuggest',
    body: {
      settings: { /* 同前文分析器配置 */ },
      mappings: { properties: { name: { type: "text", analyzer: "autosuggest_analyzer" } } }
    }
  });
}

工程示例:2


模块配置(es.module.ts)

import { Module } from '@nestjs/common';
import { ElasticsearchModule } from '@nestjs/elasticsearch';
 
@Module({
  imports: [
    ElasticsearchModule.register({
      node: 'http://localhost:9200',
      maxRetries: 3,
      requestTimeout: 30000
    })
  ],
  exports: [ElasticsearchModule]
})
export class EsModule {}

服务层封装(search.service.ts)

import { Injectable } from '@nestjs/common';
import { ElasticsearchService } from '@nestjs/elasticsearch';
 
@Injectable()
export class SearchService {
  constructor(private readonly esService: ElasticsearchService) {}
 
  async searchHomes(params: SearchParamsDTO) {
    const { query, minPrice, maxPrice, location, radius } = params;
    
    return this.esService.search({
      index: 'test_airbnb',
      body: {
        query: {
          bool: {
            must: [{
              multi_match: {
                query,
                fields: ['name^3', 'description']
              }
            }],
            filter: [
              { range: { price: { gte: minPrice, lte: maxPrice } } },
              { 
                geo_distance: {
                  distance: `${radius}km`,
                  location
                }
              }
            ]
          }
        },
        highlight: {
          fields: { name: {}, description: {} }
        }
      }
    });
  }
 
  async logBulkInsert(logs: any[]) {
    const body = logs.flatMap(log => [
      { index: { _index: 'nginx_logs' } },
      log
    ]);
 
    return this.esService.bulk({
      refresh: 'wait_for',
      body
    });
  }
}

控制器调用(search.controller.ts)

import { Controller, Post, Body } from '@nestjs/common';
import { SearchService } from './search.service';
 
@Controller('search')
export class SearchController {
  constructor(private readonly searchService: SearchService) {}
 
  @Post('homes')
  async searchHomes(@Body() params: any) {
    return this.searchService.searchHomes(params);
  }
 
  @Post('logs')
  async ingestLogs(@Body() logs: any[]) {
    return this.searchService.logBulkInsert(logs);
  }
}

关键配置优化

  1. 连接池配置:

    ElasticsearchModule.register({
      nodes: [
        'http://node1:9200',
        'http://node2:9200'
      ],
      ConnectionPool: require('@elastic/elasticsearch').ConnectionPool,
      maxRetries: 5,
      sniffOnStart: true
    })
    
  2. 安全认证:

    auth: {
      username: process.env.ES_USER,
      password: process.env.ES_PASS
    },
    ssl: { 
      ca: readFileSync('./certs/ca.crt'),
      rejectUnauthorized: false 
    }
    
  3. 性能调优参数:

    maxSockets: 100, // 最大连接数
    compression: true, // 启用压缩
    suggestCompression: true // 建议服务端压缩
    

工程示例:3


1 ) 方案1:基础数据索引服务

import { Injectable } from '@nestjs/common';
import { ElasticsearchService } from '@nestjs/elasticsearch';
 
@Injectable()
export class ListingService {
  constructor(private readonly esService: ElasticsearchService) {}
 
  async indexListing(listing: any) {
    return this.esService.index({
      index: 'airbnb',
      body: {
        ...listing,
        timestamp: new Date()
      }
    });
  }
 
  async searchListings(query: string) {
    return this.esService.search({
      index: 'airbnb',
      body: {
        query: { 
          multi_match: { 
            query, 
            fields: ["name^3", "description"] 
          } 
        }
      }
    });
  }
}

2 ) 方案2:批量异步处理管道

import { Processor, Process } from '@nestjs/bull';
import { ElasticsearchService } from '@nestjs/elasticsearch';
 
@Processor('log-queue')
export class LogProcessor {
  constructor(private readonly esService: ElasticsearchService) {}
 
  @Process()
  async handleLogs(job: Job) {
    const bulkBody = job.data.flatMap(log => ([
      { index: { _index: 'nginx_logs' } },
      log
    ]));
    
    await this.esService.bulk({ 
      refresh: 'wait_for',
      body: bulkBody
    });
  }
}

3 ) 方案3:安全与性能优化

elasticsearch.yml 关键配置
thread_pool.write.queue_size: 1000  # 增大写入队列
indices.query.bool.max_clause_count: 8192  # 提升复杂查询能力
 
安全模块集成
xpack.security.enabled: true
xpack.security.authc.api_key.enabled: true 

日志分析系统实战


1 )Nginx日志建模

日志格式解析:

$remote_addr - [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"

索引模板配置:

PUT /nginx_logs
{
  "mappings": {
    "properties": {
      "geoip": { "type": "geo_point" },
      "response_time": { "type": "float" },
      "user_agent": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }
    }
  }
}

2 ) 日志收集架构

Nginx服务器
Filebeat
Kafka集群
Logstash管道
ElasticSearch
Kibana仪表板

3 ) Grok解析配置
Logstash过滤器示例:

filter {
  grok {
    match => { "message" => "%{IP:client} %{GREEDYDATA:ident} ..." }
  }
  date { match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ] }
  geoip { source => "client" target => "geoip" }
  useragent { source => "agent" target => "os" }
}

生产环境最佳实践

Elasticsearch集群配置

# elasticsearch.yml
cluster.name: production-cluster
node.name: ${HOSTNAME}
network.host: 0.0.0.0
discovery.seed_hosts: ["es-node1", "es-node2"]
cluster.initial_master_nodes: ["es-node1"]
 
# 硬件优化
bootstrap.memory_lock: true
indices.queries.cache.size: 10%
thread_pool.write.queue_size: 1000

监控与告警配置

  1. Kibana监控看板:

    • JVM堆内存使用率
    • 索引速率/查询速率
    • 分片分配状态
  2. 告警规则示例:

    PUT _watcher/watch/cluster_health_watch
    {
      "trigger": { "schedule": { "interval": "10s" } },
      "input": { "http": { "request": { "host": "localhost", "port": 9200, "path": "/_cluster/health" } } },
      "condition": {
        "compare": { "ctx.payload.status": { "eq": "red" } }
      },
      "actions": {
        "email_alert": {
          "email": {
            "to": "admin@example.com",
            "subject": "ES Cluster Status RED",
            "body": "Cluster health status is RED at {{ctx.execution_time}}"
          }
        }
      }
    }
    

性能优化策略

  1. 索引设计:

    • 冷热数据分层(ILM策略)
    • 时序数据使用时间序列索引
    PUT _ilm/policy/logs_policy 
    {
      "policy": {
        "phases": {
          "hot": { "actions": { "rollover": { "max_size": "50GB" } } },
          "delete": { "min_age": "30d", "actions": { "delete": {} } }
        }
      }
    }
    
  2. 查询优化:

    • 避免深度分页(改用search_after)
    • 使用runtime_mappings替代脚本
    • 聚合查询开启pre_filter_shard_size

关键要点总结:

  1. 数据建模是Elasticsearch性能基础,需明确字段类型和使用场景
  2. 自动补全通过edge_ngram分词器实现,需平衡存储和查询效率
  3. 前端集成使用Reactivesearch可快速构建生产级搜索界面
  4. 日志分析需结合Grok解析和GeoIP扩展实现深度洞察
  5. NestJS集成通过模块化封装提升后端服务可维护性
  6. 集群运维需关注分片策略、ILM管理和实时监控

ES 周边配置最佳实践


1 ) 跨域访问配置 (elasticsearch.yml)

http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE

2 ) 安全加固方案

  • 使用 readonlyrest 插件设置权限
  • 通过 API Key 替代基础认证:
    curl -XPOST "localhost:9200/_security/api_key" -H 'Content-Type: application/json' -d'
    { "name": "airbnb_app_key" }'
    

3 )性能优化建议

PUT /test_airbnb/_settings
{
  "index.refresh_interval": "30s",  // 降低刷新频率
  "index.number_of_replicas": 1     // 生产环境建议≥2
}

初学者提示:

  • edge_ngram:将单词按前缀切分(如 “bedroom” → [“b”,“be”,“bed”,…])实现输入时自动补全
  • geo_distance:计算经纬度两点间距离(支持 km/mi 单位)
  • 全文检索与精准查询区别:text 类型分词处理,keyword 类型精确匹配
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Wang's Blog

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值