时间序列模式识别
· 1. Introduction· 2. Exploratory Data Analysis ∘ 2.1 Pattern Changes ∘ 2.2 Correlation Between Features· 3. Anomaly Detection and Pattern Recognition ∘ 3.1 Point Anomaly Detection (System Fault) ∘ 3.2 Collective Anomaly Detection (External Event) ∘ 3.3 Clustering and Pattern Recognition (External Event)· 4. Conclusion·
· 1.简介 · 2.探索性数据分析 ∘2.1 模式更改 ∘2.2 特征之间的相关性 · 3.异常检测和模式识别 ∘3.1 点异常检测(系统故障) ∘3.2 集体异常检测(外部事件) ∘3.3 聚类和模式认可(外部事件) · 4.结论 ·
Note: The detailed project report and the datasets used in this post can be found in my GitHub Page.
注意 :本文中使用的详细项目报告和数据集可以在我的GitHub Page中找到。
1.简介 (1. Introduction)
This project was assigned to me by a client. There is no non-disclosure agreement required and the project does not contain any sensitive information. So, I decide to make this project public as part of my personal data science portfolio while anonymizing the client’s information.
该项目是由客户分配给我的。 不需要保密协议,该项目不包含任何敏感信息。 因此,我决定将该项目公开,作为我的个人数据科学投资组合的一部分,同时匿名化客户的信息。
In the project, there are two data sets, each consists of one week of sensor readings are provided to accomplish the following four tasks:
在该项目中,有两个数据集,每个数据集包含一个星期的传感器读数,以完成以下四个任务:
1. Find anomalies in the data set to automatically flag events
1.在数据集中查找异常以自动标记事件
2. Categorize anomalies as “System fault” or “external event”
2.将异常分类为“系统故障”或“外部事件”
3. Provide any other useful conclusions from the pattern in the data set
3.根据数据集中的模式提供其他有用的结论
4. Visualize inter-dependencies of the features in the dataset
4.可视化数据集中要素的相互依赖性
In this report I am going to briefly walk through the steps I use for data analysis, visualization of feature correlation, machine learning techniques to automatically flag “system faults” and “external events” and my findings from the data.
在本报告中,我将简要介绍我用于数据分析,特征关联可视化,机器学习技术以自动标记“系统故障”和“外部事件”以及我从数据中发现的步骤。
2.探索性数据分析 (2. Exploratory Data Analysis)
My code and results in this section can be found here.
我在本节中的代码和结果可以在这里找到。
The dataset comes with two CSV files, both of which can be accessed from my GitHub Page. I first import and concatenate them into one Pandas dataframe in Python. Some rearrangements are made to remove columns except the 11 features that we are interested in:
该数据集带有两个CSV文件,都可以从我的GitHub Page中访问它们。 我首先将它们导入并用Python连接到一个Pandas数据框中。 除我们感兴趣的11个功能外,还进行了一些重新排列以删除列:
- Ozone 臭氧
- Hydrogen Sulfide 硫化氢
- Total VOCs 总VOC
- Carbon Dioxide 二氧化碳
- PM 1 1号纸
- PM 2.5 下午2.5
- PM 10 下午10点
- Temperature (Internal & External) 温度(内部和外部)
- Humidity (Internal & External). 湿度(内部和外部)。
The timestamps span from May 26 to June 9, 2020 (14 whole days in total) in EDT (GMT-4) time zone. By subtraction, different intervals are found between each reading, ranging from 7 seconds to 3552 seconds. The top 5 frequent time intervals are listed below in Table 1, where most of them are close to 59 and 60 seconds, so it can be concluded that the sensor reads every minute. However, the inconsistency of reading intervals might be worth looking into if it is no deliberate interference involved since it might cause trouble in future time series analysis.
时间戳跨越EDT(GMT-4)时区的2020年5月26日至6月9日(共14天)。 通过减法,可以在每个读数之间找到不同的间隔,范围从7秒到3552秒。 下面的表1中列出了前5个最频繁的时间间隔,其中大多数时间间隔接近59秒和60秒,因此可以得出结论,传感器每分钟都会读取一次。 但是,如果不涉及故意的干扰,则可能需要研究读取间隔的不一致,因为这可能会在以后的时间序列分析中造成麻烦。

For each of the features, the time series data are on different scales, so they are normalized in order for better visualization and machine learning efficiencies. Then they are plotted and visually inspected t