Elastic Stack梳理：北京空气质量数据分析实战之从数据建模到可视化洞察与NestJS集成方案

最新推荐文章于 2025-12-11 15:30:25 发布

原创最新推荐文章于 2025-12-11 15:30:25 发布 · 391 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#数据分析 #linux #服务器

ES-Private 专栏收录该内容

28 篇文章

订阅专栏

项目背景与数据准备

数据来源：美国大使馆公开的北京空气质量CSV数据集（2008–2017年），包含每小时记录的PM2.5浓度值（字段：city, parameter, date, year, month, day, hour, value）
核心问题：

北京空气质量多年趋势是否改善？
2016年底公众感知的雾霾加剧与官方数据矛盾的原因

数据特点：

时间粒度：每小时一条记录（需聚合到日/月/年维度分析）
数据质量：包含缺失值（标记为-99），需过滤无效数据

1 ) 数据建模

索引设计：
- 索引名：airquality（原始小时数据）、airquality_days（聚合后的日维度数据）
- 字段类型：value（浮点数）、date（日期类型），禁用分词器（因数值型数据无需文本分析）
索引结构：字段包括city（城市）、parameter（参数类型，如PM2.5）、date（日期）、value（监测值）
设计原则：禁用分词（"index": false），数值型字段使用float类型，日期字段定义为date格式

索引定义示例

// Elasticsearch Mapping示例
PUT /air_quality
{
  "mappings": {
    "properties": {
      "date": { "type": "date" },
      "value": { "type": "float" },
      "city": { "type": "keyword" },
      "parameter": { "type": "keyword" }
    }
  }
}

2 ) 数据导入

工具：filebeat + ingest pipeline
关键处理：
- 排除无效行（如以a/d/c开头的行）
- 通过grok解析CSV字段：
```
%{DATA:city},%{DATA:parameter},%{DATA:date},%{INT:year},%{INT:month},%{INT:day},%{INT:hour},%{NUMBER:value}
```
- 生成唯一ID：city+date组合避免重复
- 移除冗余字段（如year/month/day，由date派生）
数据导入流程：
- 使用Filebeat + Ingest Node管道处理CSV
- 关键步骤：
  - 排除无效行（如以a、d、c开头的行）
  - 生成唯一ID（city+date组合），避免重复导入
  - 转换时间戳格式（如date字段解析为ISO格式）
  - 移除冗余字段（如duration）

或者参考 logstash 的方式

input {  
  file {  
    path => "/data/airquality.csv"  
    start_position => "beginning"  
    exclude => ["#*"]  # 跳过注释行  
  }  
}  
filter { csv { separator => "," } }  
output {  
  elasticsearch {  
    hosts => ["localhost:9200"]  
    index => "airquality"  
    pipeline => "airquality_pipeline"  
    document_id => "%{site}_%{date}"  # 防重复导入  
  }  
}

3 ) Elasticsearch Ingest Pipeline 配置：

PUT _ingest/pipeline/airquality_pipeline  
{  
  "description": "Process Beijing air quality CSV",  
  "processors": [  
    {  
      "grok": {  
        "field": "message",  
        "patterns": [  
          "%{WORD:city},%{WORD:parameter},%{DATE:date},%{INT:year},%{INT:month},%{INT:day},%{INT:hour},%{NUMBER:value}"  
        ]  
      }  
    },  
    {  
      "set": {  
        "field": "_id",  
        "value": "{{city}}_{{date}}"  
      }  
    },  
    {  
      "date": {  
        "field": "date",  
        "formats": ["yyyy-MM-dd HH:mm:ss"],  
        "target_field": "@timestamp"  
      }  
    },  
    {  
      "remove": {  
        "field": ["message", "duration", "error"]  
      }  
    },  
    {  
      "convert": {  
        "field": "value",  
        "type": "float"  
      }  
    }  
  ]  
}

数据聚合与分析策略

目标：将小时数据聚合为日维度（减少粒度，便于趋势分析）。
Python聚合脚本核心逻辑：

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
query = {
  "query": {"range": {"value": {"gte": 1}}},  # 过滤无效值（如-99）
  "aggs": {
    "daily": {
      "date_histogram": {"field": "date", "calendar_interval": "1d"},
      "aggs": {"avg_pm25": {"avg": {"field": "value"}}}
    }
  }
}
res = es.search(index="air_quality", body=query)
 
# 写入新索引 air_quality_daily
for bucket in res['aggregations']['daily']['buckets']:
    doc = {
        "date": bucket['key_as_string'],
        "avg_pm25": bucket['avg_pm25']['value'],
        "year": datetime.strptime(bucket['key_as_string'], "%Y-%m-%d").year,
        "month": datetime.strptime(bucket['key_as_string'], "%Y-%m-%d").month
    }
    es.index(index="air_quality_daily", document=doc)

或参考如下

目标：将小时数据聚合为日维度（计算日均值/最大值/最小值），存入新索引airquality_days。

from elasticsearch import Elasticsearch  
 
es = Elasticsearch()  
 
# 聚合查询：按日分组，计算PM2.5统计值  
query = {  
  "size": 0,  
  "query": { "range": { "value": { "gte": 1 } } },  # 过滤无效值  
  "aggs": {  
    "days": {  
      "date_histogram": {  
        "field": "@timestamp",  
        "calendar_interval": "1d"  
      },  
      "aggs": {  
        "avg_value": { "avg": { "field": "value" } },  
        "max_value": { "max": { "field": "value" } },  
        "min_value": { "min": { "field": "value" } }  
      }  
    }  
  }  
}  
 
response = es.search(index="airquality", body=query)  
 
# 写入新索引  
for day in response["aggregations"]["days"]["buckets"]:  
    doc = {  
        "date": day["key_as_string"],  
        "year": day["key_as_string"][:4],  
        "month": day["key_as_string"][5:7],  
        "avg_value": day["avg_value"]["value"],  
        "max_value": day["max_value"]["value"],  
        "min_value": day["min_value"]["value"]  
    }  
    es.index(index="airquality_days", document=doc)

数据分析与可视化实现

关键问题解答与数据聚合：

多年趋势分析：

将小时数据聚合为日维度（计算每日PM2.5的max、min、avg），使用Python脚本：

from elasticsearch import Elasticsearch  
es = Elasticsearch()  
# 聚合查询：按天分组，计算PM2.5统计值  
query = {  
    "size": 0,  
    "query": {"range": {"value": {"gte": 1}}},  # 过滤无效值（如-99）  
    "aggs": {  
        "days": {  
            "date_histogram": {"field": "date", "calendar_interval": "1d"},  
            "aggs": {"pm25_stats": {"stats": {"field": "value"}}}  
        }  
    }  
}  
response = es.search(index="airquality", body=query)  
# 写入新索引 airquality_days  
for bucket in response['aggregations']['days']['buckets']:  
    doc = {  
        "date": bucket['key_as_string'],  
        "year": bucket['key_as_string'][:4],  
        "month": bucket['key_as_string'][5:7],  
        "day": bucket['key_as_string'][8:10],  
        "pm25_max": bucket['pm25_stats']['max'],  
        "pm25_avg": bucket['pm25_stats']['avg']  
    }  
    es.index(index="airquality_days", document=doc)

Kibana可视化结论：
- 趋势图表：堆叠柱状图展示不同空气质量等级（按AQI划分）的年度占比。
  - AQI分级：
    - Good (0–50)：绿色
    - Moderate (51–100)：黄色
    - Unhealthy (101–200)：橙色
    - Very Unhealthy (>200)：红色
  - 结论：2008–2017年，蓝天（AQI<150）占比从38%升至47%，整体改善。
- 2016年底矛盾解析：
  - 冬季（2016年10月–2017年2月）雾霾天数（AQI>200）达60天（2015年仅45天），且PM2.5均值更高，导致公众感知恶化。

Kibana高级图表实现：

动态字段计算（Scripted Field）：

def pm25 = doc['pm25_max'].value;  
if (pm25 <= 50) return "1-Good";  
else if (pm25 <= 100) return "2-Moderate";  
else if (pm25 <= 150) return "3-Unhealthy";  
else return "4-VeryUnhealthy";

时间对比（Offset）：

# 比较2016年与2015年冬季数据  
es_query: {  
  "aggs": {  
    "2016": { "avg": { "field": "pm25_avg" } },  
    "2015": { "avg": { "field": "pm25_avg", "offset": "-1y" } }  
  }  
}

Kibana可视化分析

1 ）核心问题1：空气质量长期趋势

分析结论：

2008–2017年北京蓝天占比（AQI≤100）从32%升至47%，污染天数比例下降，整体持续改善。

可视化方案：

堆叠柱状图（按年统计AQI等级分布）

关键配置：
- Y轴：count
- X轴：date_histogram（年间隔）
- 拆分系列：scripted_field（AQI等级）

等级计算脚本（Kibana Scripted Field）：

if (doc['max_value'].value <= 50) return "1_good";  
else if (doc['max_value'].value <= 100) return "2_moderate";  
else if (doc['max_value'].value <= 150) return "3_unhealthy_sensitive";  
// ... 其他等级

百分比面积图（简化空气质量趋势）
- 三类分组：
  - Good (AQI≤50)
  - Unhealthy (AQI>100)
  - Very Unhealthy (AQI>200)
- 配置要点：启用stacked as percentage模式。

2 ) 核心问题2：2016年底雾霾感知矛盾

分析结论：

全年数据：2016年蓝天占比（47%）高于2015年（43%）。
冬季数据：2016年冬季PM2.5均值比2015年高18%，雾霾天（AQI>200）占比达40%（2015年为27%）。

可视化对比：

时间序列对比图（2015 vs 2016冬季）

// Kibana TSVB表达式  
{  
  "expression": "es_index='airquality_days' | where max_value>150 | divide={math 'count()/total'} | multiply=100 | label='2016'",  
  "series": [  
    {  
      "expression": "offset=-1y",  
      "label": "2015"  
    }  
  ]  
}

每日AQI热力图
- X轴：日期（日间隔）
- Y轴：AQI等级（rate_level）
- 颜色：PM2.5浓度值（max_value）

工程示例：1

1 ) 方案1：基础数据导入与查询服务

依赖安装：

npm install @nestjs/elasticsearch @elastic/elasticsearch

NestJS模块配置（elastic.module.ts）：

import { Module } from '@nestjs/common';  
import { ElasticsearchModule } from '@nestjs/elasticsearch';  
 
@Module({  
  imports: [  
    ElasticsearchModule.register({  
      node: 'http://localhost:9200',  
      auth: { username: 'elastic', password: 'your_password' }  
    }),  
  ],  
  exports: [ElasticsearchModule],  
})  
export class ElasticModule {}

数据导入服务（air-import.service.ts）：

import { Injectable } from '@nestjs/common';  
import { ElasticsearchService } from '@nestjs/elasticsearch';  
import * as fs from 'fs';  
import * as csv from 'csv-parser';  
 
@Injectable()  
export class AirImportService {  
  constructor(private readonly esService: ElasticsearchService) {}  
 
  async importCSV(filePath: string): Promise<void> {  
    const stream = fs.createReadStream(filePath).pipe(csv());  
    const bulkActions = [];  
 
    stream.on('data', (row) => {  
      bulkActions.push({ index: { _index: 'airquality' } });  
      bulkActions.push({  
        city: row['Site'],  
        date: `${row['Year']}-${row['Month']}-${row['Day']} ${row['Hour']}:00:00`,  
        value: parseFloat(row['Value']),  
        parameter: 'PM2.5'  
      });  
    });  
 
    stream.on('end', async () => {  
      await this.esService.bulk({ body: bulkActions });  
      console.log(`Imported ${bulkActions.length / 2} records`);  
    });  
  }  
}

2 ) 方案2：聚合分析与API暴露
聚合查询服务（air-analysis.service.ts）：

import { Injectable } from '@nestjs/common';  
import { ElasticsearchService } from '@nestjs/elasticsearch';  
 
@Injectable()  
export class AirAnalysisService {  
  constructor(private readonly esService: ElasticsearchService) {}  
 
  async getYearlyAQI(year: number): Promise<any> {  
    const response = await this.esService.search({  
      index: 'airquality_days',  
      body: {  
        size: 0,  
        query: { term: { year } },  
        aggs: {  
          aqi_levels: {  
            terms: { field: 'rate_level' }  // 使用Scripted Field  
          }  
        }  
      }  
    });  
    return response.aggregations.aqi_levels.buckets;  
  }  
}

控制器（air.controller.ts）：

import { Controller, Get, Param } from '@nestjs/common';  
import { AirAnalysisService } from './air-analysis.service';  
 
@Controller('air')  
export class AirController {  
  constructor(private readonly analysisService: AirAnalysisService) {}  
 
  @Get('yearly/:year')  
  async getYearlyData(@Param('year') year: number) {  
    return this.analysisService.getYearlyAQI(year);  
  }  
}

3 ）方案3：实时监控与告警系统
Elasticsearch Watcher 配置：

PUT _watcher/watch/pm25_alert  
{  
  "trigger": { "schedule": { "interval": "1h" } },  
  "input": {  
    "search": {  
      "request": {  
        "indices": ["airquality"],  
        "body": {  
          "query": {  
            "range": { "value": { "gte": 200 } }  // PM2.5 > 200触发告警  
          }  
        }  
      }  
    }  
  },  
  "actions": {  
    "email_alert": {  
      "email": {  
        "to": "admin@example.com",  
        "subject": "High PM2.5 Alert",  
        "body": "PM2.5 levels exceeded 200 at {{ctx.payload.hits.total.value}} locations."  
      }  
    }  
  }  
}

NestJS 告警订阅服务：

import { Injectable } from '@nestjs/common';  
import { ElasticsearchService } from '@nestjs/elasticsearch';  
 
@Injectable()  
export class AlertService {  
  constructor(private readonly esService: ElasticsearchService) {}  
 
  async subscribeAlerts(): Promise<void> {  
    // 模拟Watcher回调（生产环境用Webhook）  
    setInterval(async () => {  
      const result = await this.esService.search({  
        index: 'airquality',  
        body: { query: { range: { value: { gte: 200 } } } }  
      });  
      if (result.hits.total.value > 0) {  
        this.sendAlert(result.hits.total.value);  
      }  
    }, 3600000); // 每小时检查  
  }  
 
  private sendAlert(count: number): void {  
    console.log(`ALERT: ${count} locations with PM2.5 > 200!`);  
    // 集成短信/邮件服务（如Nodemailer）  
  }  
}

工程示例：2

1 ) 方案1：基础客户端调用

import { Injectable } from '@nestjs/common';
import { Client } from '@elastic/elasticsearch';
 
@Injectable()
export class ElasticService {
  private readonly client: Client;
 
  constructor() {
    this.client = new Client({ node: 'http://localhost:9200' });
  }
 
  async searchAirQuality(query: any) {
    return this.client.search({
      index: 'air_quality_daily',
      body: query 
    });
  }
}

2 ) 方案2：模块化封装（Repository模式）

// elastic.module.ts
import { Module } from '@nestjs/common';
import { ElasticsearchModule } from '@nestjs/elasticsearch';
 
@Module({
  imports: [ElasticsearchModule.register({ node: 'http://localhost:9200' })],
  exports: [ElasticsearchModule],
})
export class ElasticModule {}
 
// air-quality.repository.ts
import { Injectable } from '@nestjs/common';
import { ElasticsearchService } from '@nestjs/elasticsearch';
 
@Injectable()
export class AirQualityRepository {
  constructor(private readonly esService: ElasticsearchService) {}
 
  async getDailySummary(year: number) {
    return this.esService.search({
      index: 'air_quality_daily',
      body: { query: { match: { year } } }
    });
  }
}

3 ) 方案3：高级配置（动态索引+管道）

import { DynamicModule, Global } from '@nestjs/common';
import { Client, ClientOptions } from '@elastic/elasticsearch';
 
@Global()
@Injectable()
export class ConfigurableElasticService {
  private client: Client;
 
  init(options: ClientOptions) {
    this.client = new Client(options);
  }
 
  async createPipeline(id: string, pipelineConfig: any) {
    return this.client.ingest.putPipeline({ id, body: pipelineConfig });
  }
}
 
// 使用示例
const elasticService = new ConfigurableElasticService();
elasticService.init({ node: 'http://prod-es:9200' });
elasticService.createPipeline('air_quality_pipeline', { ... });

工程示例：3

1 ) 方案1：基础数据导入服务

import { Controller, Post } from '@nestjs/common';  
import { ElasticsearchService } from '@nestjs/elasticsearch';  
 
@Controller('data')  
export class DataImportController {  
  constructor(private readonly esService: ElasticsearchService) {}  
 
  @Post('import')  
  async importData() {  
    const csvData = await this.readCSV('airquality.csv');  
    const body = csvData.flatMap(doc => [  
      { index: { _index: 'airquality', _id: `${doc.site}_${doc.date}` } },  
      doc  
    ]);  
 
    await this.esService.bulk({ body });  
  }  
 
  private async readCSV(path: string): Promise<any[]> {  
    // 使用csv-parser实现（略）  
  }  
}

2 ) 方案2：动态聚合查询API

import { Body, Query } from '@nestjs/common';  
 
@Get('aggregate')  
async aggregate(@Query() params: { level: string }) {  
  const query = {  
    aggs: {  
      yearly_stats: {  
        date_histogram: { field: "date", calendar_interval: "1y" },  
        aggs: { level_count: { filter: { range: { max_value: this.getRange(params.level) } } } }  
      }  
    }  
  };  
  return this.esService.search({ index: 'airquality_days', body: query });  
}  
 
private getRange(level: string) {  
  const ranges = {  
    good: { lte: 50 },  
    unhealthy: { gt: 100, lte: 200 }  
    // ...  
  };  
  return ranges[level];  
}

3 ) 方案3：实时脚本字段处理

// 在NestJS中通过Elasticsearch动态更新Mapping  
async setScriptedField() {  
  await this.esService.indices.putMapping({  
    index: 'airquality_days',  
    body: {  
      properties: {  
        rate_level: {  
          type: "keyword",  
          script: {  
            source: `  
              def val = doc['max_value'].value;  
              if (val <= 50) return "1_good";  
              else if (val <= 100) return "2_moderate";  
              // ...  
            `  
          }  
        }  
      }  
    }  
  });  
}

技术细节与最佳实践

性能优化：

索引分片：按时间范围分片（如yearly-2016），提升查询效率
冷热架构：将历史数据迁移至冷节点（使用ILM策略）
聚合时优先使用filter替代query缩小数据集范围。

对于时间序列数据，启用index.sort加速范围查询：

PUT /airquality_days/_settings  
{ "index": { "sort.field": ["date"], "sort.order": ["desc"] } }

数据准确性：
- 使用Pipeline错误处理（on_failure回调），记录导入失败数据
- 数据校验：在NestJS服务层添加Joi验证
数据一致性
- 使用document_id防止重复导入（如site_date组合）。
- 管道中remove无用字段减少存储开销。
可扩展性：
- 微服务集成：通过Kafka将数据流同步至Elasticsearch
- 集群部署：配置3节点ES集群（1主 + 2数据节点）
Kibana高级技巧
- 动态对比：利用offset=-1y自动对比去年同期数据
- 百分比模式：在面积图中启用stack as percentage直观显示占比变化
- 热力图配置：Y轴使用terms分桶（AQI等级），X轴用date_histogram（日粒度）

Kibana仪表板实现关键技术

堆叠柱状图（年度趋势）
- 配置要点：
  - Metrics：按AQI等级拆分（Filters Aggregation）。
  - Bucket：按年聚合（Date Histogram，间隔1年）。
  - 堆叠模式：Percentage（显示占比趋势）。

时间对比（YoY分析）

Offset应用：直接对比同年期数据。

"aggs": {
  "current_year": { "avg": { "field": "value" } },
  "previous_year": { "avg": { "field": "value", "offset": "-1y" } }
}

环境配置与优化建议

Elasticsearch配置：
- 启用索引生命周期管理（ILM），按时间滚动存储（如按月分片）。
- 设置refresh_interval: 30s提升写入性能。
NestJS最佳实践：
- 使用拦截器统一处理ES请求异常。
- 环境变量管理ES连接参数（通过ConfigModule）。

学习资源与延伸实践

资源类别	推荐内容
官方文档	Elasticsearch Aggregations
数据集	Kaggle空气质量数据集
实战案例	使用`ecommerce`样例数据构建商品销售分析看板
社区支持	Elastic中文社区（搜索@rockbean提问）