原文:
annas-archive.org/md5/92b806e0bfe2149821d7ca72450c0f00
译者:飞龙
第五章:数据准备
到目前为止,我们已经介绍了时间序列和 Apache Spark 的基础知识,以及时间序列分析项目的完整生命周期。在本章中,我们将深入探讨组织、清洗和转换时间序列数据的关键步骤,以便进行有效分析。内容包括处理缺失值、应对离群值和将数据结构化以适应 Spark 的分布式计算模型。这些信息非常宝贵,它们将帮助你确保数据质量并与 Spark 兼容,为准确高效的时间序列分析奠定坚实基础。适当的数据准备增强了后续分析过程的可靠性,使本章成为利用 Spark 从时间相关数据集中提取有意义见解的必备前提。
本章我们将涵盖以下主要内容:
-
数据摄取与持久化
-
数据质量检查与清洗
-
转换
技术要求
本章主要是动手编码,涵盖了时间序列分析项目中常见的数据准备步骤。本章的代码可以在本书 GitHub 仓库中的ch5
文件夹找到,网址如下:
github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch5
注意
本章的代码将与 Databricks Community Edition 一起使用,正如第一章和本章中所解释的方法一样。
数据摄取与持久化
在本节中,我们将介绍从数据源获取时间序列数据并将数据集持久化到存储的方式。
摄取
摄取是从源系统检索数据进行进一步处理和分析的过程。这个过程可以批量执行,用来一次性摄取大量数据,或按计划定期自动运行,比如每晚一次。或者,如果数据是源系统持续提供且需要实时获取的,则可以使用结构化流处理作为另一种摄取方法。
注意
从技术上讲,我们可以将数据摄取过程编码为结构化流处理,并将其配置为在触发的时间间隔运行。这为根据数据的新鲜度变化调整业务需求提供了灵活性,而无需重新开发摄取过程。
本章将重点讨论批量摄取,这是目前最常见的方法。我们还将简要讨论结构化流处理,它正在迅速获得应用,在一些组织中甚至超越了批量摄取。
批量摄取
批量摄取通常是通过文件存储或数据库完成的。
从文件存储
正如我们在前几章的动手实践部分所看到的,读取文件是一个常用的批量摄取方法。使用 Apache Spark 时,可以通过spark.read()
来实现:
df = spark.read.csv("file_path", header=True, sep=";", inferSchema=True)
在这个例子中,我们从file_path
存储位置读取一个 CSV 格式的文件。该文件的第一行包含了标题。不同的列由;
字符分隔。我们希望 Spark 根据inferSchema
自动推断文件中存在的数据列及其类型。
该示例基于ts-spark_ch5_1.dbc
中的代码,我们可以从 GitHub 位置导入该文件,参考技术要求部分提到的第五章,并按照第一章中解释的方法,将其导入 Databricks 社区版。
代码的 URL 是github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch5/ts-spark_ch5_1.dbc
。
摄取的数据可以根据本章提供的代码示例进一步处理和分析,如图 5.1所示。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_01.jpg
图 5.1:查看已摄取的数据
在读取文件时,也可以通过提供文件夹位置而非特定文件位置,从存储文件夹中读取多个文件。这是文件摄取的常见模式。另一个常用的功能是提供一个筛选器(pathGlobFilter
),仅包括与模式匹配的文件名。
spark.read
命令有许多其他选项,具体取决于正在读取的数据源。以下是关于数据源的 Apache Spark 文档,详细说明了这些选项:
spark.apache.org/docs/latest/sql-data-sources.html
从数据库
另一个常用的数据源类型是关系型数据库。以下是从 PostgreSQL 读取数据的示例:
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
详细内容请参见以下文档:spark.apache.org/docs/latest/sql-data-sources-jdbc.html
来自专门时间序列数据库(如 QuestDB)的数据,可以通过类似的方式摄取,如下所示:
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:8812/questdb") \
.option("driver", "org.postgresql.Driver") \
.option("user", "admin") \
.option("password", "quest") \
.option("dbtable", "timeseries_table") \
.load()
详细内容请参见以下文档:
questdb.io/blog/integrate-apache-spark-questdb-time-series-analytics/
注意
您需要将特定数据库的 JDBC 驱动程序包含在 Spark 的类路径中。前面引用的文档对此进行了说明。
结构化流处理
对于基于事件驱动或近实时处理的 Apache Spark,时间序列数据可以从流数据源(如 Apache Kafka、Amazon Kinesis、Google Cloud Pub/Sub 和 Azure Event Hubs)中摄取。这通常涉及到使用对应的数据源连接器设置 Spark 结构化流处理。
以下示例展示了如何使用 Spark 从 Apache Kafka 摄取数据:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
Apache Spark 文档提供了关于从流数据源读取的更多详细信息:
spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
一旦数据被摄取,下一步是将其持久化存储以进行进一步处理,如我们接下来所见。
持久化
数据通常会持久化到磁盘上的文件或数据库中。对于文件,Apache Spark 提供了一个成熟的解决方案——Delta Lake,一个开源存储协议。
注意
Apache Iceberg 是另一种常见的开源存储协议。
Delta 为 Apache Spark 和大数据工作负载提供了 ACID 事务,有效地将文件存储和数据库存储的优势结合在一起,这种结合被称为湖仓(数据湖和数据仓库的合并)。Delta 基于 Parquet 文件格式,提供如模式强制、数据版本控制和时间旅行等功能。
下面是一个示例,展示如何使用 Python 在 Delta 存储格式中持久化时间序列数据:
df.delta_table_path storage location. The overwrite mode means that existing data at this location will be overwritten. With Delta format, the data is written as a table that is given the name specified in table_name.
This example is based on the code in `ts-spark_ch5_1.dbc`, which we imported in the earlier section on batch ingestion.
There are many other options for the `spark.write` command, depending on the destination being written to. The following Apache Spark documentation on saving details these options:
[`spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables`](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables)
When the data is persisted in Delta format, in addition to the data, metadata is also stored together to disk. This can be retrieved with the following code:
将 Delta 表加载为 DeltaTable 对象
delta_table = DeltaTable.forPath(spark, delta_table_path)
Delta 表的详细信息
print(“Delta 表详情:”)
delta_table.detail().display()
Note
In the code example, we did not have to install Delta as it is already installed when using the Databricks Community Edition. You will need to install the Delta packages if you are using another Apache Spark environment where Delta is not pre-installed. You can find the instructions here: [`docs.delta.io/latest/quick-start.html`](https://docs.delta.io/latest/quick-start.html).
*Figure 5**.2* shows some of the metadata such as location and creation date.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_02.jpg>
Figure 5.2: Metadata for the Delta table
Once data has been persisted, it can be read from storage as needed at a later stage for querying and analysis. The `spark.read` command can be used here as well, as per the following example:
spark.read.load(delta_table_path).display()
The Delta table storage location, `delta_table_path`, is passed to the `load` command, which retrieves the stored table from the disk storage.
As mentioned earlier, Spark can also write to a database, among other destinations. The following example shows how to write to a PostgreSQL database.
jdbcDF.write \
.format(“jdbc”) \
.option(“url”, “jdbc:postgresql:dbserver”) \
.option(“dbtable”, “schema.tablename”) \
.option(“user”, “username”) \
.option(“password”, “password”) \
.save()
This is further detailed in the following documentation: [`spark.apache.org/docs/latest/sql-data-sources-jdbc.html`](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)
Note
You will need to include the JDBC driver for the particular database on the Spark classpath. The previously referenced documentation explains this.
As seen in this section, persistence allows longer-term storage and retrieval. Delta also stores different versions of the data whenever it changes, which we will investigate next.
Versioning
Data versioning is one of the key features provided by Delta Lake, allowing you to keep track of changes made to your data over time. This storage of different versions is done in an optimal way to minimize storage footprint.
With a record of different versions, Delta enables a functionality called **time travel**. With this, you can query data at specific versions or timestamps, revert to previous versions, and perform time travel queries. This is also useful from a reproducibility point of view, whereby we can go back to the specific version of data used previously, even if it has since changed, to audit, review, and redo an analysis.
The code provided in this chapter has an example of using versioning and time travel. The following extract shows how to read a specific version of the Delta table. `version_as_of` is an integer representing the version number:
df_ = spark.read.timestamp_as_of 表示感兴趣版本的时间戳:
df_ = spark.read.history command, as follows:
print(f"Delta 表历史记录 - 修改后:")
delta_table.history().display()
An example of output from the history is shown in *Figure 5**.3*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_03.jpg>
Figure 5.3: Delta table versions
Finally, it is possible to restore the Delta table back to a previous version with the `restoreToVersion` command, overwriting the latest version, as per the following:
delta_table.restoreToVersion(latest_version)
You can also find more information on time travel here:
[`delta.io/blog/2023-02-01-delta-lake-time-travel/`](https://delta.io/blog/2023-02-01-delta-lake-time-travel/)
This concludes the section on ingestion and persistence. We will now move on to verify and clean the data.
Data quality checks, cleaning, and transformation
Once the data has been ingested from source systems to a storage location from which we can access it, we will need to ensure that it is of usable quality and, if not, do the necessary cleaning and transformation.
Data quality checks
The outcome of any analysis done with the data can be only as good as the data, making data quality checks an important next step.
Consistency, accuracy, and completeness
Data quality checks for consistency, accuracy, and completeness are essential to ensure the reliability of your data. With its powerful tools for data processing and analysis, Apache Spark is suitable for implementing these checks. The following are examples of how you can perform data quality checks for consistency, accuracy, and completeness using Apache Spark in Python.
Consistency check
In the following consistency test example, we are counting the number of records for each date:
示例一致性检查:检查某列的值是否一致
consistency_check_result = df.groupBy(“Date”).count().orderBy(“count”)
print(f"数据一致性结果:")
consistency_check_result.display()
As per *Figure 5**.4*, this simple check shows that some dates do not consistently have the same number of records, which can indicate missing values for some dates.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_04.jpg>
Figure 5.4: Consistency check results
Accuracy check
In the accuracy test example, we want to verify the accuracy of `Global_active_power`, as follows:
示例准确性检查:
检查某列中的值是否符合特定条件
accuracy_check_expression = “Global_active_power < 0 OR Global_active_power > 10”
检查
accuracy_check_result = df.filter(accuracy_check_expression)
accuracy_check_result_count = accuracy_check_result.count()
如果 accuracy_check_result_count == 0:
print(f"数据通过准确性检查 - !({accuracy_check_expression}).")
else:
print(f"数据未通过准确性检查 - {accuracy_check_expression} - 计数 {accuracy_check_result_count}😊
accuracy_check_result.display()
As per *Figure 5**.5*, this check shows that in two cases, `Global_active_power` is outside of the accuracy criteria that we have defined for this check. This indicates that either these values are wrong or that they are correct but are now going beyond the previously known ranges that we have used to define the criteria. We must update the criteria in this latter case.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_05.jpg>
Figure 5.5: Accuracy check results
Completeness check
In the completeness test example, we want to verify whether `Global_active_power` has null values:
示例完整性检查:检查某列是否包含空值
completeness_check_expression = “Global_active_power is NULL”
检查
completeness_check_result = df.filter(
completeness_check_expression)
completeness_check_result_count = completeness_check_result.count()
如果 completeness_check_result_count == 0:
print(f"数据通过完整性检查 - !({completeness_check_expression})")
else:
print(f"数据未通过完整性检查 - {completeness_check_expression} - 计数 {completeness_check_result_count}😊
completeness_check_result.display()
Note
The consistency check example presented earlier can also be used for completeness.
These examples show basic data quality checks for consistency, accuracy, and completeness using Apache Spark. These checks can be extended and integrated into your data pipelines for more comprehensive data quality assurance.
Data quality framework
To better manage the suite of tests, it is recommended that a framework such as *Great Expectations* be used for data quality checks. You can find more information here: [`github.com/great-expectations/great_expectations`](https://github.com/great-expectations/great_expectations)
We will cover another framework approach with the integration of data quality in the Delta Live Tables pipeline, and monitoring and alerting in *Chapter 10*.
Once the data quality has been tested, the next step is to clean the data.
Data cleaning
The previous step of data quality checks indicates the issues with the data that need to be corrected, which we will now address.
Missing values
Apache Spark offers various methods to handle missing values in time series data. The following examples show how you can clean time series data for missing values using Apache Spark in Python.
Forward filling
The forward filling method to handle missing values replaces the missing values with the previous known value, with the values sorted in chronological order based on their timestamp. In the following code example, missing values for `Global_active_power` are replaced in this way. The `Window.rowsBetween` function in the following case goes from the first record to the current one. The `last` function then finds the last non-null value within that window. As the window slides over all the records, all the missing values are replaced with the last known value:
from pyspark.sql import functions as F
从 pyspark.sql 导入 Window
示例:通过向前填充处理缺失值
“timestamp” 列按时间顺序排列
df = spark.sql(
f"从 {table_name} 表中选择时间戳和全局有功功率,并按时间戳排序"
)
window = Window.rowsBetween(float(‘-inf’), 0)
filled_df = df.withColumn(
“filled_Global_active_power”,
F.last(df[‘Global_active_power’],
ignorenulls=True).over(window))
显示更新后的值
filled_df.filter(
“timestamp BETWEEN ‘2008-11-10 17:58:00’ AND ‘2008-11-10 18:17:00’”
).display()
The result of forward filling can be seen in *Figure 5**.6*, where the filled values are shown in the `filled_Global_active_power` column.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_06.jpg>
Figure 5.6: Forward filling
Forward filling works well when the last known value is a good indication of the next value, such as for a slow-changing value. It is not a good method when the value can change suddenly or when there is seasonality.
Backward filling
The backward filling method to handle missing values replaces the missing values with the next known value, with the values sorted in chronological order based on their timestamp. In the following code example, missing values for `Global_active_power` are replaced in this way. The `Window.rowsBetween` function in the following case goes from the current one to the last record. The `first` function then finds the next non-null value within that window. As the window slides over all the records, all the missing values are replaced with the next known value:
从 pyspark.sql 导入 functions 作为 F
从 pyspark.sql 导入 Window
示例:通过向后填充处理缺失值
“timestamp” 列按时间顺序排列
df = spark.sql(
f"从 {table_name} 表中选择时间戳和全局有功功率,并按时间戳排序"
)
window = Window.rowsBetween(0, float(‘inf’))
filled_df = df.withColumn(
“filled_Global_active_power”,
F.first(df[‘Global_active_power’],
ignorenulls=True).over(window))
显示更新后的值
filled_df.filter(
“timestamp BETWEEN ‘2008-11-10 17:58:00’ AND ‘2008-11-10 18:17:00’”
).display()
The result of backward filling can be seen in *Figure 5**.7*, where the filled values are shown in the `filled_Global_active_power` column.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_07.jpg>
Figure 5.7: Backward filling
Backward filling works well when the next known value can reasonably indicate the previous value, such as with slow-changing data or when collecting data retrospectively with gaps in the past. However, it is not suitable for analyzing causality or leading indicators.
Interpolation
The interpolation method to handle missing values replaces the missing values with a combination, such as the average, of the previous and next non-missing values, with the values sorted in chronological order based on their timestamp.
Note
There are several different interpolation calculation methods, including linear, polynomial, and spline interpolation. The average method used here is a simple form of linear interpolation.
In the following code example, missing values for `Global_active_power` are replaced in this way. The `Window.rowsBetween` function, used twice, in the following case, goes from the first record to the current one for `windowF`, and from the current one to the last record for `windowB`. The `last` function then finds the previous non-null value within `windowF`, while the `first` function finds the next non-null value within `windowB`. These two non-null values are averaged. As the window slides over all the records, all the missing values are replaced by the averaged value:
从 pyspark.sql 导入 Window
示例:通过向后填充处理缺失值
“timestamp” 列按时间顺序排列
df = spark.sql(
f"从 {table_name} 表中选择时间戳和全局有功功率,并按时间戳排序"
)
windowF = Window.rowsBetween(float(‘-inf’), 0)
windowB = Window.rowsBetween(0, float(‘inf’))
filled_df = df.withColumn(
“filled_Global_active_power”, (F.last(
df[‘Global_active_power’], ignorenulls=True
).over(windowF) + F.first(
df[‘Global_active_power’], ignorenulls=True
).over(windowB))/2)
显示更新后的值
filled_df.filter(
“timestamp BETWEEN ‘2008-11-10 17:58:00’ AND ‘2008-11-10 18:17:00’”
).display()
The result of interpolation can be seen in *Figure 5**.8*, where the filled values are shown in the `filled_Global_active_power` column.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_08.jpg>
Figure 5.8: Interpolation
Interpolation works well for slow-changing values, when there is a predictable cyclical pattern, or when there is a small gap in data. It is not a good method when the value can change suddenly, is discrete, or when there is a large gap in data.
Of the three methods shown for handling missing values, the appropriate method to use is based on the characteristics of your time series data and the requirements of your analysis.
Data leakage
Note that the backward filling and interpolation methods can leak future data across the boundaries of training, validation, and test data splits. Use these methods within the splits, and not across, or use forward filling if this is going to be an issue.
Duplicates
The presence of duplicate values in time series data can skew analysis and lead to incorrect results. Apache Spark has functions to efficiently remove duplicate values. In the following example, we clean time series data for duplicate values using Apache Spark in Python.
The `dropDuplicates` function removes duplicates by comparing all columns by default and only considers a row to be a duplicate if all the columns match those of one or more other rows. This will not work if we have multiple rows with, say, the same `timestamp` column value but different values in one or more other columns. In this case, we can pass a subset of one or more columns as a parameter to be used to identify the duplicates, as opposed to using all the columns.
In the most common cases, we want to have one and only one row of values for each timestamp and consider the other rows with the same timestamp to be duplicates. Passing the timestamp as the subset parameter to `dropDuplicates` will remove all the other rows having the same timestamp value, as we will see in the following code example:
示例:基于所有列移除重复行
print(f"有重复行 - 计数: {df.count()}")
cleaned_df = df.dropDuplicates()
print(f"无重复行 - 计数: {cleaned_df.count()}")
示例:基于选定的列移除重复行
假设"timestamp"是识别重复项的列
cleaned_df = df.dropDuplicates([“timestamp”])
print(f"无重复时间戳 - 计数: {cleaned_df.count()}")
Depending on your dataset and use case, you can choose the appropriate method based on the columns that uniquely identify duplicates in your time series data.
Outliers
The detection and handling of outliers in time series data is crucial to ensure the accuracy of analysis and modeling. Apache Spark provides various functions to detect and handle outliers efficiently. The following example shows how to clean time series data for outliers using Apache Spark in Python.
The z-score method used is based on how far the data point is from the `mean` relative to the standard deviation, `stddev`. The parametrizable threshold value, `z_score_threshold`, then specifies beyond which z-score value the data point is considered an outlier. A high threshold will allow more data points in, while a low threshold will flag more outliers:
从 pyspark.sql 导入 functions 作为 F
示例:使用 z-score 检测离群值
计算"值"列中每个值的 z-score
mean_value = df.select(F.mean(
“Global_active_power”)).collect()[0][0]
stddev_value = df.select(F.stddev(
“Global_active_power”)).collect()[0][0]
z_score_threshold = 5 # 根据需要调整阈值
df_with_z_score = df.withColumn(“z_score”, (F.col(
“Global_active_power”) - mean_value) / stddev_value)
过滤掉 z-score 超出阈值的行
离群值 = df_with_z_score.filter(~F.col(“z_score”).between(
-z_score_threshold, z_score_threshold))
cleaned_df = df_with_z_score.filter(F.col(“z_score”).between(
-z_score_threshold, z_score_threshold))
标记为离群值
df_with_outlier = df_with_z_score.withColumn(
“_ 离群值”,
F.when(
(F.col(“z_score”) < -z_score_threshold) |
(F.col(“z_score”) > z_score_threshold), 1
).otherwise(0))
print(f"包含异常值 - 计数: {df.count()}")
print(f"Global_active_power - 平均值: {mean_value}, 标准差: {stddev_value}, z 分数阈值: {z_score_threshold}")
print(f"去除异常值后 - 计数: {cleaned_df.count()}")
print(f"异常值 - 计数: {outliers.count()}")
print(“异常值:”)
outliers.display()
*Figure 5**.9* shows the outcome of the outlier detection based on the z-score chosen.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_09.jpg>
Figure 5.9: Outlier detection
Beyond this example, the choice of z-score threshold and outlier detection techniques is based on the data characteristics and requirements of the use case.
Note
Outliers can be indicative of one or more anomalies in the source system that generated the measurement, or data processing or transmission issues post source system. The identification of outliers flags the requirement to further investigate the source system and the data transmission chain to find the root cause.
After cleaning the data based on the issues identified with the data quality checks, other transformations, which we will look at next, are required to get the data into the right shape for the analytics algorithm to work.
Transformations
In this section, we will look at examples of normalizing and standardizing, and touch briefly on stationary transformation.
Normalizing
Normalizing time series data ensures that features are on a similar scale, which can improve the performance of machine learning algorithms while facilitating analysis. Apache Spark provides various functions for normalization. The following example shows how to normalize time series data using Apache Spark in Python.
The min-max normalization technique is used to scale the data points relative to the min-max range. The `min` and `max` values are calculated first. This brings the value to the range of `0` for the minimum value and `1` for the maximum value:
from pyspark.sql import functions as F
定义要归一化的列(例如,“value”列)
columns_to_normalize = [“Global_active_power”]
计算每列的最小值和最大值进行归一化
min_max_values = df.select(
[F.min(F.col(column)).alias(f"min_{column}")
for column in columns_to_normalize] +
[F.max(F.col(column)).alias(f"max_{column}")
for column in columns_to_normalize]
).collect()[0]
使用最小-最大归一化对数据进行归一化
for column in columns_to_normalize:
min_value = min_max_values[f"min_{column}"]
max_value = min_max_values[f"max_{column}"]
df = df.withColumn(
f"normalized_{column}",
(F.col(column) - min_value) / (max_value - min_value))
print(f"归一化后的 - {columns_to_normalize}😊
df.display()
*Figure 5**.10* shows the outcome of normalizing the example time series data.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_10.jpg>
Figure 5.10: Normalizing
Depending on the specific requirements and data characteristics, the normalization method can be adjusted with the use of other techniques such as z-score normalization and decimal scaling, in addition to the min-max technique used in the example.
Standardizing
Standardizing time series data ensures that features are on a similar scale, which can improve the performance of machine learning algorithms while facilitating analysis. This method transforms the data such that it has a mean of `0` and a standard deviation of `1`. Apache Spark provides various functions for standardization. The following example shows how to standardize time series data using Apache Spark in Python.
This example uses `log` values to account for the skewness of the data. First, `mean` and `stddev` are calculated. These values are then used in the formula to standardize:
from pyspark.sql import functions as F
定义要标准化的列(例如,“value”列)
columns_to_standardize = [“Global_active_power”]
计算每列的均值和标准差以
标准化
mean_stddev_values = df.select(
[F.mean(F.log(F.col(column))).alias(f"mean_{column}")
for column in columns_to_standardize] +
[F.stddev(F.log(F.col(column))).alias(f"stddev_{column}")
for column in columns_to_standardize]
).collect()[0]
使用 z-score 标准化对数据进行标准化
for column in columns_to_standardize:
mean_value = mean_stddev_values[f"mean_{column}"]
stddev_value = mean_stddev_values[f"stddev_{column}"]
df = df.withColumn(
f"standardized_{column}",
(F.log(F.col(column)) - mean_value) / stddev_value
)
print(f"标准化后的 - {columns_to_standardize}😊
df.display()
*Figure 5**.11* shows the outcome of standardizing the example time series data.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_11.jpg>
Figure 5.11: Standardizing
The standardization method can be adjusted depending on the specific requirements and data characteristics.
Stationary
In *Chapter 1*, we discussed the requirement of stationary time series for some analysis methods. Making time series data stationery involves removing trends and seasonality, which we will cover in the following chapter.
This concludes the section on testing for data quality and then cleaning and transforming time series data. We will cover the scalability considerations in data preparation when we discuss feature engineering in *Chapter 8*.
Summary
In conclusion, this chapter focused on the critical steps of organizing, cleaning, and transforming time series data for effective analysis. We have covered data preparation techniques using Apache Spark for ingestion, persistence, data quality checks, cleaning, and transformations. We looked at code examples for, among others, handling missing values and duplicates, addressing outliers, and normalizing data. This has set the stage for an accurate and efficient analytical process using Apache Spark. Proper data preparation significantly enhances the reliability of subsequent analytical processes, which is what we will progress toward in the next chapter.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1>.jpg)
第六章:探索性数据分析
在加载和准备数据(在前一章已经介绍过)之后,我们将进行探索性数据分析,以揭示时间序列数据中的模式和见解。我们将使用统计分析技术,包括那些特定于时间模式的技术。这些步骤的结果对于识别趋势和季节性非常关键,并且为后续建模决策提供信息。使用 Apache Spark 进行强大的探索性数据分析确保全面掌握数据集的特征,增强后续时间序列模型和分析的准确性和相关性。
在本章中,我们将涵盖以下主要内容:
-
统计分析
-
重采样、分解和稳定性
-
相关性分析
技术要求
本章主要涵盖了时间序列分析项目中常用的数据探索技术,以实际操作为主。本章的代码可以在书籍的 GitHub 仓库的 ch6
文件夹中找到,网址为:github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch6
。
注意
我们将在代码示例中使用 Spark DataFrames,并将其转换为支持 pandas 的库的 DataFrames。这显示了如何可以互换使用两者。在使用 pandas 时将会提及。
统计分析
本节从时间序列数据的统计分析开始,涵盖数据概要分析以收集这些统计信息,分布分析和可视化。
本章的示例基于 ts-spark_ch6_1.dbc
中的代码,我们可以从 GitHub 上导入到 Databricks Community Edition,如第一章所述的方法,在技术要求部分提到。
代码网址如下:github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch6/ts-spark_ch6_1.dbc
我们将从家庭能源消耗数据集开始实际操作示例,这个数据集也在第二章和第五章中使用过。在用 spark.read
加载数据集后,如下代码片段所示,我们通过df.cache()
将 DataFrame 缓存到内存中,以加速后续的处理。由于懒加载,缓存操作将在下一个动作时进行,而不是立即进行。为了强制进行缓存,我们添加了一个df.count()
操作。然后,我们创建了一个timestamp
列,将Date
和Time
列合并在一起。由于数值列已作为字符串加载,因此我们必须将它们转换为数值型的double
数据类型才能进行计算。请注意,为了提高可读性,我们将对df
DataFrame 的操作分成了多行代码,当然,也可以将这些操作链式调用写在一行代码中:
…
# Code in cell 5
df = spark.read.csv(
"file:///" + SparkFiles.get(DATASET_FILE),
header=True, sep=";", inferSchema=True)
df.cache()
df.count()
…
# Code in cell 7
df = df.withColumn('Time', F.date_format('Time', 'HH:mm:ss'))
# Create timestamp column
df = df.withColumn('timestamp', F.concat(df.Date, F.lit(" "), df.Time))
df = df.withColumn(
'timestamp',
F.to_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))
# Fix data types
df = df \
.withColumn('Global_active_power',
df.Global_active_power.cast('double')) \
…
print("Schema:")
df.spark.read option inferSchema. The data types before conversion, displayed with printSchema(), are shown in *Figure 6**.1*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_01.jpg>
Figure 6.1: Inferred schema with data types
The updated schema is as per *Figure 6**.2*, showing the converted data types.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_02.jpg>
Figure 6.2: Updated schema with converted data types
We are now ready to profile the data.
Data profiling
Data profiling involves analyzing the dataset’s structure, quality, and statistical properties. This helps to identify anomalies, missing values, and outliers, ensuring data integrity. This process can also be comprehensive, including the analysis of trends, seasonal patterns, and correlations, guiding more accurate forecasting and modeling.
Note
Data profiling can also guide preprocessing steps such as normalization and transformation, covered in *Chapter 5*.
Apache Spark provides the convenient `summary()` function, as per the following code, for summary statistics:
汇总统计
第 10 号单元格的代码
df.summary().display()
This generates the following outcome:
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_03.jpg>
Figure 6.3: Summary statistics
While these summary statistics are useful, they are usually not sufficient. A data profiling tool such as YData Profiling, which we will look at next, provides more extensive analysis and reporting.
The following code extract shows how to launch a Profile Report with YData. Notable here is the use of a Pandas DataFrame, `pdf`, and of the time series mode (`tsmode` parameter), with the `sortby` parameter to sort by timestamp. We also want correlations to be included in the report. After the report is generated, it is converted to HTML for display with the `to_html()` function.
第 12 号单元格的代码
…
profile = ProfileReport(
pdf,
title=‘时间序列数据分析’,
tsmode=True,
sortby=‘timestamp’,
infer_dtypes=False,
interactions=None,
missing_diagrams=None,
correlations={
“auto”: {“calculate”: False},
“pearson”: {“calculate”: True},
“spearman”: {“calculate”: True}})
将分析报告保存为 HTML 文件
profile.to_file(“time_series_data_profiling_report.html”)
在笔记本中展示分析报告
report_html = profile.to_html()
displayHTML(report_html)
The generated report contains an **Overview** section, as per *Figure 6**.4*, with an indication, among other things, of the number of variables (columns), observations (rows), and missing values and duplicate counts.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_04.jpg>
Figure 6.4: Data profile report – Overview
Scrolling down from **Overview**, we can see column-specific statistics, as shown in *Figure 6**.5*, such as the minimum, maximum, mean, number of zeros, and number of distinct values.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_05.jpg>
Figure 6.5: Data profile report – Details
This section has further sub-sections, such as **Histogram**, showing the distribution of values, and **Gap analysis**, as per *Figure 6**.6*, with indications of data gaps for the column.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_06.jpg>
Figure 6.6: Data profile report – Gap analysis
With the time series mode specified earlier, we also get a basic **Time Series** part of the report, shown in *Figure 6**.7*
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_07.jpg>
Figure 6.7: Data profile report – Time Series
Other sections of the report cover **Alerts**, shown in *Figure 6**.8*, with outcomes of tests run on the dataset, including time-series-specific ones, and a **Reproduction** section with details on the profiling run.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_08.jpg>
Figure 6.8: Data profile report – Time Series
This section provided an example of how to perform data profiling on time series data using YData Profiling and Apache Spark. Further information on YData Profiling can be found here: [`github.com/ydataai/ydata-profiling`](https://github.com/ydataai/ydata-profiling).
We will now drill down further in our understanding of the data, by analyzing the gaps in the dataset.
Gap analysis
In the previous section, we mentioned gap analysis for gaps in value for a specific column. Another consideration for time series data is gaps in the timeline itself, as in the following example with the household energy consumption dataset, where we are expecting values every minute.
In this case, we first calculate the time difference between consecutive timestamps using `diff()`, as in the following code, with a pandas DataFrame, `pdf`. If this is greater than `1 minute`, we can flag the timestamp as having a prior gap:
测试间隙
第 15 号单元格的代码
测试间隙
pdf[‘gap_val’] = pdf[‘timestamp’].sort_values().diff()
pdf[‘gap’] = pdf[‘gap_val’] > ps.to_timedelta(‘1 minute’)
pdf[pdf.gap]
As *Figure 6**.9* shows, we found 3 gaps of 2 minutes each in this example.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_09.jpg>
Figure 6.9: Gap analysis
Depending on the size of the gap and the nature of the dataset, we can adopt one of the following approaches:
* Ignore the gap
* Aggregate, for example, use the mean value at a higher interval
* Use one of the missing-value handling techniques we saw in *Chapter 5*, such as forward filling
Regular or irregular time series
The gap analysis presented here assumes a regular time series. The approach is slightly different in detecting gaps in the timeline of irregular time series. The previous example of checking for the absence of values at every minute interval is not applicable for an irregular time series. We will have to look at the distribution of the count of values over the timeline of the irregular time series and make reasonable assumptions about how regularly we expect values in the irregular time series. For instance, if we are considering the energy consumption of a household, the time series may be irregular at minute intervals, but based on historical data, we expect energy use every hour or daily. In this case, not having a data point on a given hour or day can be indicative of a gap. Once we have identified a gap, we can use the same approaches as discussed for regular time series, that is, forward filling or similar imputation, aggregation at higher intervals, or just ignoring the gap.
We discussed here the specific problem of gaps in the time series. We mentioned that, to identify gaps, we can look at the distribution of the data, which will be covered next.
Distribution analysis
Distribution analysis of time series provides an understanding of the underlying patterns and characteristics of the data, such as skewness, kurtosis, and outliers. This helps detect deviations from normal distribution, trends, and seasonal patterns, and visualize the variability of the time series. This understanding then feeds into choosing the appropriate statistical models and forecasting methods. This is required as models are built on assumptions of the distribution of the time series. Done correctly, distribution analysis ensures that model assumptions are met. This also improves the accuracy and reliability of the predictions.
In this section, we will examine a few examples of distribution analysis, starting with the profiling output of *Figure 6**.5*, which shows a kurtosis of 2.98 and a skewness of 1.46\. Let’s explain what this means by first defining these terms.
**Kurtosis** indicates how peaked or flat a distribution is compared to a normal distribution. A value greater than 2, as in our example in *Figure 6**.5*, indicates the distribution is too peaked. Less than -2 means too flat.
**Skewness** indicates how centered and symmetric the distribution is compared to a normal distribution. A value between -1 and 1 is considered near normal, between -2 and 2, as in the example in *Figure 6**.5*, is acceptable, and below -2 or above 2 is not normal.
When both kurtosis and skewness are zero, we have a perfectly normal distribution, which is quite unlikely to be seen with real data.
Let’s now do some further distribution analysis with the following code extract. We want to understand the frequency distribution of `Global_active_power`, the distribution by day of the week, `dayOfWeek`, and the hour of the day. We will use the Seaborn (`sns`) visualization library for the plots, with the pandas DataFrame, `pdf`, passed as a parameter:
分布分析
第 17 号单元格的代码
…
提取日期和小时
df = df.withColumn(“dayOfWeek”, F.dayofweek(F.col(“timestamp”)))
df = df.withColumn(“hour”, F.hour(F.col(“timestamp”)))
…
使用 Seaborn 和 Matplotlib 进行分布分析
…
sns.histplot(pdf[‘Global_active_power’], kde=True, bins=30)
plt.title(
‘时间序列数据中 Global_active_power 的分布’
)
…
用箱线图可视化按星期几分布
…
sns.boxplot(x=‘dayOfWeek’, y=‘Global_active_power’, data=pdf)
plt.title(
‘时间序列数据中 Global_active_power 的日分布’
)
…
用箱线图可视化按小时分布
…
sns.boxplot(x=‘hour’, y=‘Global_active_power’, data=pdf)
plt.title(
‘时间序列数据中 Global_active_power 的小时分布’
)
…
We can see the frequency of occurrence of the different values of `Global_active_power` in *Figure 6**.10*, with the skewness to the left.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_10.jpg>
Figure 6.10: Distribution by frequency
If we look at the distribution by day of the week, as in *Figure 6**.11*, power consumption during the weekends is higher, as can be expected for a household, with 1 on the *x* axis representing Sundays and 7 Saturdays. The distribution is also over a broader range of values these days.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_11.jpg>
Figure 6.11: Distribution by day of the week
The distribution by hour of the day, as in *Figure 6**.12*, shows higher power consumption during the morning and evening, again as can be expected for a household.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_12.jpg>
Figure 6.12: Distribution by hour of the day
You will also notice in the distribution plots the values that are flagged as outliers, lying beyond the whiskers. These are at a 1.5 **inter-quartile range** (**IQR**) above the third quartile. You can use other thresholds for outliers, as in *Chapter 5*, where we used a cutoff on the z-score value.
Visualizations
As we have seen so far in this book and, more specifically, this chapter, visualizations play an important role in time series analysis. By providing us with an intuitive and immediate understanding of the data’s underlying patterns, they help to identify seasonal variations, trends, and anomalies that might not otherwise be seen from raw data alone. Furthermore, visualizations facilitate the detection of correlations, cycles, and structural changes over time, contributing to better forecasting and decision-making.
Fundamentally, (and this is not only true for time series analysis) visualizations aid in communicating complex insights to stakeholders and, in doing so, improve their ability to understand and act accordingly.
Building on the techniques for statistical analysis seen in this chapter, we will now move on to other important techniques to consider while analyzing time series—resampling, decomposition, and stationarity.
Resampling, decomposition, and stationarity
This section details additional techniques used in time series analysis, introduced in *Chapter 1*. We will see code examples of how to implement these techniques.
Resampling and aggregation
Resampling and aggregation are used in time series analysis to transform and analyze data at different time scales. **Resampling** is changing the frequency of the time series, such as converting hourly data to daily data, which can reveal trends and patterns at different time frequencies. **Aggregation**, on the other hand, is the summarizing of data over specified intervals and is used in conjunction with resampling to calculate the resampled value. This can reduce noise, handle missing values, and convert an irregular time series to a regular series.
The following code extract shows the resampling at different intervals, together with the aggregation. The original dataset has data every minute. With `resample('h').mean()` applied to the pandas DataFrame, `pdf`, we resample this value to the mean over the hour:
重采样与聚合
第 22 号单元格的代码
…
将数据重采样为小时、天和周的频率,并按#均值聚合
hourly_resampled = pdf.resample(‘h’).mean()
hourly_resampled_s = pdf.resample(‘h’).std()
daily_resampled = pdf.resample(‘d’).mean()
daily_resampled_s = pdf.resample(‘d’).std()
weekly_resampled = pdf.resample(‘w’).mean()
weekly_resampled_s = pdf.resample(‘w’).std()
…
*Figure 6**.13* shows the outcome of the hourly resampling.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_13.jpg>
Figure 6.13: Resampled hourly
*Figure 6**.14* shows the outcome of the daily resampling.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_14.jpg>
Figure 6.14: Resampled daily
*Figure 6**.15* shows the outcome of the weekly resampling.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_15.jpg>
Figure 6.15: Resampled weekly
With these examples, we have resampled and aggregated time series data using Apache Spark. We will next expand on the time series decomposition of the resampled time series.
Decomposition
As introduced in *Chapter 1*, decomposition breaks down the time series into its fundamental components: trend, seasonality, and residuals. This separation helps uncover underlying patterns within the data more clearly. The trend shows long-term movement, while seasonal components show repeating patterns. Residuals highlight any deviation from the trend and seasonal components. This decomposition allows for each component to be analyzed and addressed individually.
The following code extract shows the decomposition of time series using `seasonal_decompose` from the `statsmodels` library. In *Chapter 1*, we used a different library, `Prophet`.
第 30 号单元格的代码
…
from statsmodels.tsa.seasonal import seasonal_decompose
执行季节性分解
hourly_result = seasonal_decompose(
hourly_resampled[‘Global_active_power’])
daily_result = seasonal_decompose(
daily_resampled[‘Global_active_power’])
…
*Figure 6**.16* shows the components of the hourly resampled time series. The seasonal component shows a pattern, with each repeating pattern corresponding to a day, and the ups in power consumption every morning and evening are visible.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_16.jpg>
Figure 6.16: Decomposition of hourly data
*Figure 6**.17* shows the components of the daily resampled time series. The seasonal component shows a pattern, with each repeating pattern corresponding to a week, and the ups in power consumption every weekend are visible.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_17.jpg>
Figure 6.17: Decomposition of daily data
Now that we have performed time series decomposition using Apache Spark and `statsmodels` for time series at different resampling intervals, let's discuss the next technique.
Stationarity
Another key concept related to time series data, introduced in *Chapter 1*, stationarity concerns the statistical properties of the series, such as mean, variance, and autocorrelation remaining constant over time. This is an assumption on which time series models, such as **AutoRegressive Integrated Moving Average** (**ARIMA**) are built. A series must be identified and converted to stationary before using such models. In general, stationary time series facilitate analysis and improve model accuracy.
The first step in handling non-stationarity is to check the time series, which we will look at next.
Check
The **Augmented Dickey-Fuller** (**ADF**) test and the **Kwiatkowski-Phillips-Schmidt-Shin** (**KPSS**) test are commonly used statistical tests to check for stationarity. Without going into the details of these tests, we can say they calculate a value, which is called the p-value. A value of p < 0.05 for ADF means that the series is stationary. Additionally, we can check for stationarity by visual inspection of the time series plot and **autocorrelation function** (**ACF**) plots, and by comparing summary statistics over different time periods. Mean, variance, and autocorrelation remaining constant across time suggest stationarity. Significant changes indicate non-stationarity.
The following example code checks for stationarity using the ADF test, `adfuller`, from the `statsmodels` library. We will use the hourly resampled data in this example.
平稳性
代码位于第 33 行
…
from statsmodels.tsa.stattools import adfuller
执行扩展的 Dickey-Fuller 检验
result = adfuller(hourly_resampled)
if Test statistic < Critical Value and p-value < 0.05
# 拒绝原假设,时间序列没有单位根
# 序列是平稳的
…
In this case, the p-value, as shown in *Figure 6**.18*, is less than 0.05, and we can conclude the time series is stationary from the ADF test.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_18.jpg>
Figure 6.18: ADF test results – Power consumption dataset
Running the ADF test on the dataset for the annual mean temperature of Mauritius, used in *Chapter 1*, gives a p-value greater than 0.05, as shown in *Figure 6**.19*. In this case, we can conclude that the time series is non-stationary.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_19.jpg>
Figure 6.19: ADF test results – Annual mean temperature dataset
As we now have a non-stationary series, we will next consider converting it to a stationary series using differencing.
Differencing
The following code extract shows the conversion of a non-stationary time series to a stationary one. We’ll use differencing, a common method to remove trends and seasonality, which can make the time series stationary. By using a combination of the `Window` function and `lag` of 1, we can find the difference between an annual mean and the previous year’s value.
差分
代码位于第 41 行
…
from pyspark.sql.window import Window
计算差分(差分处理)
window = Window.orderBy(“year”)
df2_ = df2.withColumn(
“annual_mean_diff”,
F.col(“annual_mean”) - F.lag(
F.col(“annual_mean”), 1
).over(window))
…
We can see the original time series compared to the differenced time series in *Figure 6**.20*. The removal of the trend is visible.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_20.jpg>
Figure 6.20: Differencing – Annual mean temperature dataset
Running the ADF test after differencing, gives a p-value less than 0.05, as shown in *Figure 6**.21*. We can conclude that the difference in time series is stationary.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_21.jpg>
Figure 6.21: ADF test results – Differenced annual mean temperature dataset
Building on our understanding of techniques for exploratory analysis learned in this section, we will now move on to the last section of this chapter, which is about correlation of time series data.
Correlation analysis
Correlation measures the relationship between two variables. This relationship can be causal, whether one is the result of the other. This section will explore the different types of correlation applicable to time series.
Autocorrelation
The **AutoCorrelation Function** (**ACF**) measures the relationship between a time series and its past values. High autocorrelation indicates that past values have a strong influence on future values. This information can then be used to build predictive models, for instance, in selecting the right parameters for models such as ARIMA, thereby enhancing the robustness of the analysis. Understanding autocorrelation also helps in identifying seasonal effects and cycles.
The **Partial AutoCorrelation Function** (**PACF**) similarly measures the relationship between a variable and its past values, but contrary to the ACF, with the PACF we discount the effect of values of the time series at all shorter lags.
Check
The following code shows how you can check for autocorrelation and partial autocorrelation using Apache Spark and `plot_acf` and `plt_pacf` from the `statsmodels` library.
自相关
代码位于第 45 行
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
绘制自相关函数 (ACF)
plt.figure(figsize=(12, 6))
plot_acf(hourly_resampled[‘Global_active_power’], lags=3*24)
plt.title(‘自相关函数 (ACF)’)
plt.show()
绘制偏自相关函数 (PACF)
plt.figure(figsize=(12, 6))
plot_pacf(hourly_resampled[‘Global_active_power’], lags=3*24)
plt.title(‘偏自相关函数 (PACF)’)
plt.show()
…
The resulting ACF and PACF plots are shown in *Figure 6**.22*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_22.jpg>
Figure 6.22: ACF and PACF plots
The outcomes of ACF and PACF indicate the nature of the time series and guide the selection of the appropriate models and parameters for forecasting. Let’s now make sense of these plots and how we can use their outcome.
Interpretation of ACF
We will consider the peaks and the decay from the ACF plot to interpret the outcome, using the upper graph in *Figure 6**.22* as an example.
Peaks in the autocorrelation plot outside the confidence interval indicate notable autocorrelations. Regular intervals point to seasonality. From the example, we can see autocorrelation at lags 1, 2, and 3 and seasonality at lags 12 and 24, which correspond to a 12- and 24-hour interval.
A slow decay in the autocorrelation plot suggests that the series is non-stationary with a trend. In this case, we can convert the series to stationary by differencing it, as discussed in the previous section on *Differencing*. This, however, is not the case in our example in *Figure 6**.22*, as there is no slow decay.
The outcome of the ACF can be used to define the `q` of an ARIMA model. Major peaks at lags 1, 2 and 3 in our example, means q=1, q=2, and q=3.
Interpretation of PACF
We will consider the peaks and the cut-off from the PACF plot to interpret the outcome, using the lower graph in *Figure 6**.22* as an example.
Peaks in the partial autocorrelation plot outside the confidence interval indicate notable partial autocorrelations. In the example, this is seen at lags 1, 12, and 24.
An immediate cut-off after some lags indicates an **autoregressive** (**AR**) component. In the example, this is after lag 1.
The outcome of the PACF can be used to define the AR parameter `p` of an ARIMA model. Major peaks at lag 1 in our example, means p=1.
Model parameters
Based on the interpretation of the ACF and PACF plots in *Figure 6**.22*, we can consider the following candidate ARIMA(p, d, q) models, where p is the PACF cut-off point, d is the order of differencing, and q is the ACF autocorrelation lag:
* ARIMA(1, 0, 1)
* ARIMA(1, 0, 2)
* ARIMA(1, 0, 3)
We will discuss model selection and parameters in detail in the next chapter. The depth of our discussion here is just enough to conclude the discussion on ACF and PACF. Let’s move on to other lag analysis methods.
Lag analysis
In addition to ACF and PACF plots seen previously, we will explore another lag analysis method in this section.
We’ll start by calculating the different lag values of interest, as per the following code extract, using the `Window` and `lag` functions we have seen previously.
滞后分析
代码位于第 49 行
…
window = Window.orderBy(“timestamp”)
创建滞后特征
hourly_df = hourly_df.withColumn(
“lag1”, F.lag(F.col(“Global_active_power”), 1).over(window))
hourly_df = hourly_df.withColumn(
“lag2”, F.lag(F.col(“Global_active_power”), 2).over(window))
hourly_df = hourly_df.withColumn(
“lag12”, F.lag(F.col(“Global_active_power”), 12).over(window))
hourly_df = hourly_df.withColumn(
“lag24”, F.lag(F.col(“Global_active_power”), 24).over(window))
…
This creates the lag columns, as shown in *Figure 6**.23*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_23.jpg>
Figure 6.23: Lag values
We then calculate the correlation of the current values with their lag values, as in the following code, using the `stat.corr()` function.
代码位于第 50 行
…
计算滞后 1 的自相关
df_lag1 = hourly_df.dropna(subset=[“lag1”])
autocorr_lag1 = df_lag1.stat.corr(“Global_active_power”, “lag1”)
…
计算滞后 24 的自相关
df_lag24 = hourly_df.dropna(subset=[“lag24”])
autocorr_lag24 = df_lag24.stat.corr(“Global_active_power”, “lag24”)
…
*Figure 6**.24* shows the autocorrelation values, significant at lag 1, 2, and 24, as we saw on the ACF plot previously.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_24.jpg>
Figure 6.24: Autocorrelation at different lag values
Finally, by plotting the current and lag values together, we can see in *Figure 6**.25* how they compare to each other. We can visually confirm here the greater correlation at lag 1, 2, and 24.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_25.jpg>
Figure 6.25: Comparison of current and lag values
This concludes the section on autocorrelation, where we looked at ACF and PACF, and how to calculate lagged features and their correlation using Apache Spark. While the lag analysis methods in this section have been used for autocorrelation, they can also be used for cross-correlation, which we will cover next, as another type of correlation, this time between different time series.
Cross-correlation
Cross-correlation measures the relationship between two different time series. One series may influence or predict the other over different time lags, in what is called a **lead-lag relationship**. Cross-correlation is used for multivariate time series modeling and causality analysis.
Going back to the profiling report we saw earlier, we can see a graph of the correlation of the different columns of the example dataset included in the report, as in *Figure 6**.26*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_26.jpg>
Figure 6.26: Cross-correlation heatmap
We can calculate the cross-correlation directly with the following code.
互相关
代码位于第 53 行
…
计算 value1 和 value2 之间的互相关
cross_corr = hourly_df.stat.corr(“Global_active_power”, “Voltage”)
…
The cross-correlation calculation yields the value in *Figure 6**.26*. As this correlation is at the same lag, it does not have predictive value, in the sense that we are not using the past to predict the future. However, this pair of attributes is still worth further analysis at different lags, due to the significant cross-correlation.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_27.jpg>
Figure 6.27: Cross-correlation value
Note
We know that P=IV, where P is electrical power, I is current, and V is voltage, indicates how power and voltage are related. Hence, these two time series are not independent of each other. Even if there is no further insight into the P and V relationship, we will continue this analysis as an example of cross-correlation analysis.
As cross-correlation at the same lag does not help much for prediction, we will now look at using different lag values with the following code. This uses the cross-correlation `ccf()` function, which calculates the cross-correlation at different lag values.
代码位于第 54 行
…
from statsmodels.tsa.stattools import ccf
hourly_ = hourly_resampled.iloc[:36]
计算互相关函数
ccf_values = ccf(hourly_[‘Global_active_power’], hourly_[‘Voltage’])
绘制互相关函数
plt.figure(figsize=(12, 6))
plt.stem(range(len(ccf_values)),
ccf_values, use_line_collection=True, markerfmt=“-”)
plt.title(‘互相关函数 (CCF)’)
…
This generates the plot in *Figure 6**.27*, which shows the correlation of the two attributes at different lags.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_28.jpg>
Figure 6.28: Cross-correlation function
To conclude, this section showed how to perform cross-correlation analysis by creating lagged features, and calculating and visualizing cross-correlation.
Summary
In this chapter, we used exploratory data analysis to uncover patterns and insights in time series data. Starting with statistical analysis techniques, where we profiled the data and analyzed its distribution, we then resampled and decomposed the series into its components. To understand the nature of the time series, we also checked for stationarity, autocorrelation, and cross-correlation. By this point, we have gathered enough information on time series to guide us into the next step of building predictive models for time series.
In the next chapter, we will dive into the core topic of this book, which is developing and testing models for time series analysis.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1>.jpg)
第七章:构建与测试模型
在覆盖了时间序列分析的数据准备和探索性数据分析阶段后,我们现在将重点转向为时间序列数据构建预测模型。我们将涵盖多种类型的模型以及如何决定选择哪个模型。我们还将学习如何训练、调整和评估模型。
本章所涉及的概念将作为模型开发的实用指南,为有效的时间序列模型提供基本构建块,并促进准确的预测和深入的分析。我们将考虑在实际项目中常见的执行约束,并最终对不同模型解决预测问题的结果进行比较。
我们将涵盖以下主要主题:
-
模型选择
-
开发与测试
-
模型比较
技术要求
本章的代码将在开发与测试部分中讲解,可以在本书的 GitHub 仓库的ch7
文件夹中找到,网址如下:
github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch7
。
模型选择
在开发时间序列分析模型之前的第一步是选择使用哪个模型。正如在第一章中讨论的那样,时间序列分析的一个关键挑战是选择合适的模型。这个选择会影响分析的准确性、可靠性、效率和可扩展性等多个方面。反过来,这确保了分析能够得出更有根据的决策和更有效的结果,同时具有科学的严谨性和实际的实用性。
不同类型的模型各有其特点。
模型类型
时间序列分析模型可以分为统计模型、经典机器学习(ML)模型和深度学习(DL)模型:
-
统计模型用于时间序列分析,它们基于统计理论并假设时间序列的特征,如线性和平稳性。经典模型的例子包括自回归滑动平均模型(ARIMA)、季节性自回归积分滑动平均外生模型(SARIMAX)、指数平滑法(ETS)、广义自回归条件异方差模型(GARCH)和状态空间模型。
-
经典机器学习模型用于时间序列分析,采用无需显式编程的算法从数据中学习。这些模型能够处理非线性关系。然而,它们通常需要比经典模型更多的数据来进行训练。机器学习模型的例子包括线性回归、支持向量机(SVMs)、k 近邻(kNN)、随机森林和梯度提升机。
-
深度学习模型使用具有多个层次的神经网络来学习时间序列数据中的复杂模式。这些模型能够处理非线性关系和长期依赖性。然而,它们需要大量的训练数据和显著的计算资源。深度学习模型的例子包括长短期记忆网络(LSTM)、卷积神经网络(CNN)、时间卷积网络(TCN)、变换器(transformers)和自编码器(autoencoders)。
机器学习与深度学习
深度学习是机器学习的一个子集,它使用深度神经网络。按照常规做法,我们在这里使用“经典机器学习”这一术语,指的是那些不是基于神经网络的方法和模型。深度学习一词则用于描述使用神经网络的方法和模型。
之前提到的每个类别和模型都有独特的特点和方法,这决定了它们的适用性,我们接下来将进一步探讨这些内容。
选择标准
选择使用哪种模型是基于几个标准的。在第一章中关于选择正确模型的部分以及在第四章中初步讨论模型选择时,我们简要提到了这一点。模型在解决时间序列分析问题中的适用性取决于分析目标、数据特征、计算能力和可用时间等因素。
我们现在将深入探讨模型选择中的这些及其他重要因素。
使用场景类型
时间序列分析大体上可分为预测、分类和异常检测等使用场景,如在第二章中讨论的那样。我们将在这里简要回顾这些使用场景,并突出常用的模型。接下来的章节将深入讨论这些内容。
-
预测的目标是基于模型从过去的值中学习到的模式来预测未来的值。如在第二章中所述,预测可以是单步或多步,基于单一(单变量)或多个(多变量)时间序列。常用的模型如 ARIMA、SARIMA 和指数平滑法(ETS)因其简单性而被选择,并在预测任务中表现出强劲的性能。LSTM 和在本书前面章节中介绍的 Prophet,适用于更复杂的预测需求,并在这些场景下能更有效地工作。
-
模式识别和分类用于识别和理解模式,并相应地对时间序列进行分类。常用的模型基于分解方法,例如使用 LOESS 的季节-趋势分解(STL)和多重 STL(MSTL),以及傅里叶分析。我们在第一章和第六章中花了一些时间讨论分解方法。我们在第二章中简要讨论了傅里叶分析,此外还讨论了基于距离的方法、形状分析、集成方法和深度学习。
-
异常检测旨在识别时间序列中的异常值或异常。正如在第二章中所展示的,这种检测可以基于单变量或多变量序列,以及点、集体或上下文分析。最初标记为异常的点可能最终被认为是新颖的,即一种非问题的全新模式。常用的模型基于残差分析的能力,例如 ARIMA。机器学习模型也经常被使用,例如孤立森林,或者在异常值比例较高时,使用专门的方法,如季节性混合极端学生化偏差(SH-ESD)。我们在第二章中看到过使用孤立森林进行异常检测的代码示例,并讨论了监督、无监督、半监督和混合方法。
另一个模型选择标准是我们接下来将讨论的时间序列的统计性质。
时间序列的性质
时间序列的性质,即其统计特性,影响模型的选择。模型的研究和开发通常是基于对时间序列特性的特定假设,从而决定它们的适用性。本节将专注于适用性,并跳过定义,假设到目前为止,你已经熟悉我们将在本节中使用的术语,这些术语基于第一章的介绍和第六章中的代码示例:
-
平稳时间序列可以使用 ARIMA 建模,ARIMA 假设序列是平稳的。平稳时间序列的一个例子是某股票在三年期间的日收益百分比。假设市场没有发生显著的结构性变化,股票收益通常围绕稳定的均值波动,并具有一致的方差。
非平稳时间序列可以通过差分转化为平稳序列,正如在第六章中看到的那样。差分后的序列可以与这些模型一起使用。或者,使用 Prophet 或机器学习模型来处理非平稳序列。非平稳时间序列的一个例子是月度失业率,它可能具有趋势,并与经济状况相关的周期性模式。
-
季节性时间序列需要能够处理季节性的模型,如 SARIMA、ETS、Prophet 或机器学习模型。我们在第二章中的编码示例中已经看到,使用 Prophet 预测温度的应用。
-
时间序列中的趋势可能会影响某些模型的表现,比如 ARIMA。在这种情况下,类似于平稳性,我们可以通过差分去除趋势成分,如第六章中的代码示例所示。然后可以使用 ARIMA 模型。或者,使用能够处理趋势的模型,如趋势模型、ETS、Prophet 或机器学习。
-
波动性时间序列可以通过模型处理,如广义自回归条件异方差性(GARCH)、随机波动性 GARCH(SV-GARCH)或机器学习。这些模型常用于高度波动的金融市场以及其他领域的预测和风险管理。
-
数据关系的线性意味着线性模型(如 ARIMA)适用。一个线性时间序列的例子是每日气温,其中今天的气温可以通过前两天气温的线性组合加上一些随机误差来预测。
在非线性模式的情况下,具有神经网络的机器学习模型更为合适。一个非线性时间序列的例子是,当股票价格低于某个阈值(比如 100)时遵循一种关系,而当其高于该阈值时则遵循另一种关系。
数据分析的量和频率(接下来会讨论)是影响模型选择的时间序列的另一个特性。
数据的量和频率
数据的量和频率会影响所需的计算能力和分析所需的时间。它们的组合决定了选择何种模型进行分析。我们将在这里讨论数据的量和频率,而另外两个因素将在下一个部分进行讨论:
-
小数据集可以使用统计模型进行分析,例如 ARIMA 和 ETS。这些是适用于较小数据集的简单模型。一个小数据集的例子是过去几年某商店的每日销售数据。
-
大数据集非常适合机器学习模型,如梯度提升和 LSTM。这两者之间是相辅相成的:一方面是机器学习模型在处理大数据集时的计算能力和可扩展性,另一方面是需要大量数据进行模型训练以避免过拟合。机器学习模型可以学习大数据集中的复杂模式,但需要更多的计算资源。大数据集的例子包括分钟级股票价格或过去五年的传感器数据。
正如我们在第八章中将看到的,我们可以通过利用 Apache Spark 的分布式计算能力,将模型扩展到大数据集:
-
低频率时间序列,如日度、周度、月度、季度或年度,通常规模较小。如前所述关于小数据集,ARIMA 和 ETS 通常是这类数据集的好选择。
-
高频率时间序列可能具有快速变化、噪声、波动性和异方差性,可以使用诸如 GARCH 这样的模型来处理,这通常用于金融时间序列。
如果需要在数据到达速率以下的较低频率上进行分析,则可以通过重新采样和聚合将高频序列转换为低频序列,如第六章中所讨论的。重新采样会减少数据集的大小,同时平滑噪声和波动性。这打开了使用适合低频时间序列模型的可能性,正如之前讨论的那样。
高频率数据的减值
我们在这里讨论频率,指的是时间序列中连续数据点之间的时间间隔,也称为粒度。高频数据的另一个考虑因素是分析也需要高频率进行。这是由于高频数据随时间的快速减值。考虑到实时股票 tick 变化在发生时的关键性,但几小时后变得不那么相关。在这种情况下,模型必须能够进行极快的计算,潜在地实时进行。
更高的数据量和频率需要更多的计算资源,这将在接下来进行讨论。
计算约束
像任何其他项目一样,时间序列分析也是在预算内进行的。这意味着可用的资源量,包括执行分析过程的计算能力,是受限制的。同时,我们知道更高的数据量和频率需要更多的计算资源。我们还必须考虑分析需要多快才能完成,以使结果有用。在考虑这些约束条件的同时,让我们来探讨模型的选择:
-
有限的计算资源意味着我们可能需要考虑通过重新采样来减少数据集的大小,并使用 ARIMA 或 ETS 等简单模型的组合。机器学习模型虽然能更好地检测复杂模式和更大的数据集,但通常需要更多的计算资源。
-
快速分析需要使用更快的模型进行训练和预测。像 ARIMA 或 ETS 这样的模型,对于较小的数据集再次是很好的选择。
如果需要对大型数据集进行快速分析,则可以选择以下选项:
-
在下一章中,我们将介绍使用 Apache Spark 集群的分布式处理来扩展大型数据集。
-
重新采样以将数据集大小转换为更小,并使用 ARIMA 或 ETS 等简单模型。
-
使用机器学习模型时的注意事项:对于较大的数据集,训练和调优阶段会变慢。通过使用更多的计算资源可以提高预测速度,但这当然会带来更高的成本。值得注意的是,训练、调优和预测速度也可以通过使用 Apache Spark 的分布式处理来提高,正如我们将在下一章看到的那样。
-
-
计算资源成本是另一个可能限制使用计算密集型模型的重要因素。虽然较简单的统计模型可以在较便宜的标准资源上运行,但深度学习模型可能需要在高性能硬件上使用更昂贵的 GPU。
在考虑了计算需求如何影响模型选择后,我们将进一步考虑模型准确性、复杂性和可解释性如何决定使用哪个模型。
模型准确性、复杂性和可解释性
在模型选择时需要考虑的其他因素包括模型准确性、复杂性和可解释性:
-
模型准确性在许多情况下被错误地视为模型选择的决定性因素。准确性被故意列在选择标准的末尾,目的是强调同时考虑其他因素的重要性。最好的模型不一定是最准确的模型,而是为特定应用场景带来最大投资回报的模型。
当需要高准确性时,特别是在预测中,可能需要更复杂的模型,如 SARIMAX 或深度学习。超参数调优作为开发过程的一部分,用于进一步提高准确性,但这会增加计算开销。
-
复杂性和可解释性通常是相互冲突的。对更高准确性的需求会导致使用更复杂的模型,而这些模型通常更难以解释,常被称为黑箱模型。
如果可解释性至关重要,可以选择较简单的模型,如 ARIMA 或 ETS,这些模型还具有计算资源需求较低的额外优势。基于树的模型,如 GBM 或时间序列树形管道(TSPi),在准确性和计算需求之间提供了良好的平衡,而较简单的树形模型则提供了可解释性。
如果数据表现出复杂的模式且高准确性至关重要,可能没有太多选择,我们可能需要使用复杂的模型,这会在计算资源和可解释性上做出权衡。
模型选择概述
关于模型选择,有几个要点值得注意:
-
统计模型如 ARIMA 基于对时间序列的假设,需进行统计检验,并可能需要额外的预处理,以便在使用模型之前转换序列。
-
Prophet 和机器学习模型更广泛适用,但具有额外的复杂性和计算要求。
-
本节中提到的模型作为示例,适用于讨论的标准。其他模型,来自不断增长的公开可用模型和方法列表,可以并且应该被测试。找到最佳模型是一个实验和迭代的过程,取决于具体的应用场景。
正如我们在选择标准部分所看到的,多个因素会影响模型的选择,并决定在哪些方面投入更多的精力。哪些因素最为重要取决于项目的背景和使用场景。最佳的模型选择是能带来最高投资回报率的模型,这需要在这里讨论的不同因素之间进行权衡。
在此时,选定了模型后,我们已准备好进入下一开发步骤,即在我们的时间序列数据上训练模型。
开发与测试
在本节中,我们将比较不同类别模型的预测性能:统计模型、经典机器学习模型和深度学习模型。我们将使用六种不同的模型:SARIMA、LightGBM、LSTM、NBEATS、NHITS 和 NeuralProphet。这些模型因其广泛且经过验证的应用以及易于访问和使用而被选中。
我们将继续执行以下限制条件:
-
尽可能使用默认模型超参数进行比较,并将调整限制在少数几个案例中,具体内容将在后文说明。
-
完整的执行过程,从数据加载到模型训练、测试和预测,将限制在 15 分钟以内。
-
所使用的计算资源将受限于 Databricks 社区版计算资源,如图 7.1所示,具有 15.3 GB 的内存和 2 个 CPU 核心。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_01.jpg
图 7.1:Databricks 社区版计算资源
在我们的实际项目中,我们常常面临时间和资源的限制。本节还旨在为您提供在这些限制条件下工作的工具。
单线程、多线程和集群
在本章的代码示例中,我们将使用Pandas
和NumPy
。Pandas
在使用 CPU 核心时是单线程的,NumPy
默认是多线程的,因此它会并行使用多个 CPU 核心。两者都绑定到单一机器上,无法利用多机器的 Spark 集群能力。我们将在第八章中讨论如何解决这一限制,该章节涉及扩展。在很多现有的代码示例中,你会发现使用了Pandas
和NumPy
,因此从这些库开始作为基础非常重要。然后,在第八章中,我们将讨论如何将单机代码转换为利用 Spark 集群能力的代码。
本节使用的时间序列数据是第二章中用于家庭能量消耗的扩展版本。我们将在本章余下的所有模型中使用相同的时间序列。数据集位于ch7
文件夹中的ts-spark_ch7_ds1_25mb.csv
。由于这是一个新数据集,我们将在下一节中通过探索数据的步骤进行介绍。
数据探索
在本节中,我们要检查数据集中的平稳性、季节性和自相关。这是理解时间序列特性的重要步骤。
本节的代码位于ts-spark_ch7_1e_sarima_comm.dbc
。我们按照第一章中“实践操作:加载和可视化时间序列”部分的说明,将代码导入 Databricks 社区版。
代码的 URL 如下:
代码的第一部分加载并准备数据。我们在这里不详细讲解这部分内容,因为我们已经在第五章中涵盖了数据准备的内容,你可以参考笔记本中的代码。然而,数据探索部分与本章相关,因此让我们接下来进一步探索,从平稳性检查开始。
平稳性
我们可以通过运行增强型迪基-富勒(ADF)测试,使用以下代码来检查能量消耗时间序列是否平稳:
from statsmodels.tsa.stattools import adfuller
# Perform Augmented Dickey-Fuller test
result = adfuller(data_hr[-300:]['Global_active_power'])
# if Test statistic < Critical Value and p-value < 0.05
# reject the Null hypothesis, time series does not have a unit root
# series is stationary
# Extract and print the ADF test results
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
print(f' {key}: {value}')
这给出了以下 ADF 统计量:
ADF Statistic: -6.615237252003429
p-value: 6.231223531550648e-09
Critical Values:
1%: -3.4524113009049935
5%: -2.8712554127251764
10%: -2.571946570731871
由于 ADF 统计量小于临界值,且 p 值小于 0.05,我们可以得出结论,时间序列是平稳的。
季节性
我们可以通过以下代码检查季节性:
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series data into seasonal, trend, and residual
# components
results = seasonal_decompose(data_hr)
# Plot the last 300 data points of the seasonal component
results.seasonal[-300:].plot(figsize = (12,8));
这给出了图 7**.2中的季节性分解。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_02.jpg
图 7.2:季节性分解
由于模式每 24 小时重复一次,我们可以得出结论,时间序列具有日常季节性。
自相关
我们可以通过以下代码检查自相关和偏自相关:
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot ACF to identify autocorrelation in 'data_hr' DataFrame
plot_acf(data_hr['Global_active_power'])
# Plot PACF to identify partial autocorrelation in 'data_hr' DataFrame
plot_pacf(data_hr['Global_active_power'])
# Display the ACF and PACF plots
plt.show()
这给出了图 7**.3中的自相关图。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_03.jpg
图 7.3:不同滞后(x 轴)下的自相关(y 轴)
我们可以看到在较低的滞后值(包括滞后 1)和滞后 12 时有较高的自相关性,以及在滞后 24 时季节性的影响。考虑到家庭中典型的能量消耗模式,这一点是合理的:
-
比如做饭、洗衣或看电视等活跃能量使用的时段,很可能会超过一个小时(滞后 1)
-
早晨和晚上(滞后 12)通常是活动的高峰期
-
日常例行活动意味着我们每 24 小时会有相似的活动周期(滞后 24)
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_04.jpg
图 7.4:偏自相关
PACF 图显示在滞后 1 时有较高的偏自相关,并在滞后 10 和滞后 23 附近有明显的偏自相关。这与我们提到的家庭能源消费的典型模式一致。
统计模型 – SARIMA
我们将讨论的第一个模型是 SARIMA,它通过加入季节性组件扩展了 ARIMA 模型。虽然 ARIMA 模型解决了自相关、差分平稳性和移动平均等问题,SARIMA 在此基础上还考虑了数据中的季节性模式。
本节的代码位于 ts-spark_ch7_1e_sarima_comm.dbc
文件中。我们按照 第一章 中 动手实践:加载和可视化时间序列 部分的说明,将代码导入 Databricks 社区版。
代码 URL 如下:
开发与调优
在模型开发过程中,我们使用以下代码将数据集的最后 48 小时与训练数据分开。这将用于后续的测试,其他部分将用于训练:
# Split the data into training and testing sets
# The last 48 observations are used for testing,
# the rest for training
train = data_hr[:-48]
test = data_hr[-48:]
我们将讨论两种结合训练和调优的方法,用于训练模型并找到最佳参数:auto_arima
和 ParameterGrid
。
Auto ARIMA
使用 auto ARIMA 方法时,我们希望自动找到最小化 pmdarima
库的模型参数,以演示 auto ARIMA 方法。由于这是一个计算密集型操作,我们希望保持之前解释的时间(15 分钟)和资源(Databricks 社区版)限制,因此我们将数据集限制为最后 300
个数据点。
使用 pmdarima
的代码如下:
import pmdarima as pm
# Create auto_arima model to automatically select the best ARIMA parameters
model = pm.auto_arima(
# Use the last 300 observations of the series for modeling:
train[-300:]["Global_active_power"],
# Enable seasonal differencing:
seasonal=True,
# Set the seasonal period to 24
# (e.g., 24 hours for daily data):
m=24,
# Set the degree of non-seasonal differencing to 0
# (assumes data is already stationary):
d=0,
# Set the degree of seasonal differencing to 1:
D=1,
# Set the maximum value of AR (p) terms to consider:
max_p=3,
# Set the maximum value of MA (q) terms to consider:
max_q=3,
# Set the maximum value of seasonal AR (P) terms to consider:
max_P=3,
# Set the maximum value of seasonal MA (Q) terms to consider:
max_Q=3,
# Use AIC (Akaike Information Criterion) to select the best model:
information_criterion='aic',
# Print fit information to see the progression of
# the model fitting:
trace=True,
# Ignore models that fail to converge:
error_action='ignore',
# Use stepwise algorithm for efficient search of the model space:
stepwise=True,
# Suppress convergence warnings:
suppress_warnings=True
)
# Print the summary of the fitted model
print(model.summary())
以下代码输出展示了逐步搜索最小化 AIC 的参数集。这将是用于 ARIMA 模型的最佳参数集,用于预测家庭的能源消费:
Performing stepwise search to minimize aic
…
ARIMA(1,0,1)(2,1,0)[24] intercept : AIC=688.757, Time=9.37 sec
…
ARIMA(2,0,2)(2,1,0)[24] : AIC=681.750, Time=6.83 sec
…
ARIMA(1,0,1)(2,1,0)[24] : AIC=686.763, Time=6.02 sec
Best model: ARIMA(2,0,2)(2,1,0)[24]
请注意,虽然这是一组最佳的模型参数,但考虑到时间和资源的限制,我们可能会发现,通过更长时间运行算法,我们能够找到更好的模型。
ParameterGrid
使用 ParameterGrid
方法时,我们将逐一遍历参数组合列表,以找到最小化 AIC 的模型参数。
使用 ParameterGrid
的代码如下:
# Define parameter grid for SARIMAX model configuration
param_grid = {
'order': [(0, 0, 0), (1, 0, 1), (2, 0, 0)],
# Non-seasonal ARIMA orders
'seasonal_order': [
(0, 0, 0, 24),
(2, 0, 1, 24),
(2, 1, 1, 24)
], # Seasonal ARIMA orders with period of 24
}
# Initialize variables to store the best AIC and
# corresponding parameters
best_aic = float("inf")
best_params = ["",""]
# Iterate over all combinations of parameters in the grid
for params in ParameterGrid(param_grid):
print(
f"order: {params['order']}, seasonal_order: {params['seasonal_order']}"
)
try:
# Initialize and fit SARIMAX model with current parameters
model = SARIMAX(
train['Global_active_power'],
order=params['order'],
seasonal_order=params['seasonal_order'])
model_fit = model.fit(disp=False)
print(f"aic: {model_fit.aic}")
# Update best parameters if current model has lower AIC
if model_fit.aic < best_aic:
best_aic = model_fit.aic
best_params = params
except Exception as error:
print("An error occurred:", error)
continue
尽管 auto ARIMA 和 ParamaeterGrid
在最小化 AIC 方面相似,但 auto ARIMA 使用起来要简单得多,仅需一行代码。
在 SARIMA 模型训练完成后,我们将接下来进行模型预测测试。
测试与预测
我们使用模型通过 predict
函数预测测试数据集,每次预测一个周期,每次预测后更新模型的实际值。这种迭代方法将 forecast_step
中的单步预测转化为多步预测:
def forecast_step():
# Predicts the next period with confidence intervals
forecast, conf_int = model.predict(
n_periods=1, return_conf_int=True)
…
# Iterate over each observation in the test dataset
for obs in test['Global_active_power']:
forecast, conf_int = forecast_step() # Forecast next step
forecasts.append(forecast) # Append forecast to list
…
# Update the model with the new observation
model.update(obs)
然后,我们可以在图 7.5和图 7.6中绘制预测值与实际值的对比图。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_05.jpg
图 7.5:SARIMA 预测与实际值(训练与测试)
我们在图 7.6中放大了测试期,以便直观比较预测值与实际值。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_06.jpg
图 7.6:SARIMA 预测与实际值(缩放至测试数据)
虽然可视化图表可以帮助我们了解模型的预测能力,但我们仍然需要定量的指标来评估模型的好坏。这些指标还将帮助我们与其他模型进行预测准确度的比较。
时间序列预测有多种可用的评估指标。本章将展示以下三种指标的使用,突出它们如何服务于不同的目标:
- 均方误差(MSE)度量了预测值(F)与实际值(A)之间差值的平方平均值。当我们希望惩罚较大误差时,它效果很好。然而,由于平方误差会赋予较大差异更大的权重,因此它对异常值敏感。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/2.png
- 对称平均绝对百分比误差(SMAPE)是预测值(F)与实际值(A)之间绝对差值的平均值。它以百分比的形式表示,基于实际值和预测值的绝对值之和的一半。SMAPE 可调整数据的尺度,使其适用于不同数据集之间的比较。由于其对称缩放,它对极端值的敏感度较低。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/3.png
- 加权绝对百分比误差(WAPE)是一个归一化的误差度量,通过实际值加权绝对误差。当处理具有不同大小的数据时,它表现良好,但对大值误差敏感。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/4.png
我们将看到两种不同的度量计算方法:模型库中包含的度量计算函数,以及一个独立的专门度量计算库。
模型库中的度量函数
在这种方法中,我们希望使用模型库中已经包含的度量计算函数。我们将使用sklearn
和pmdarima
库进行度量计算,并在以下代码中演示:
from sklearn.metrics import mean_squared_error
from pmdarima.metrics import smape
# Calculate and print the mean squared error of the forecasts
print(f"Mean squared error: {mean_squared_error(test['Global_active_power'], forecasts)}")
# Calculate and print the Symmetric Mean Absolute Percentage Error
# (SMAPE)
print(f"SMAPE: {smape(test['Global_active_power'], forecasts)}")
这给出了以下结果:
Mean squared error: 0.6131968222566936
SMAPE: 43.775868579535334
单独的度量库
在这种第二种度量计算方法中,我们使用SeqMetrics
库,如以下代码所示:
from SeqMetrics import RegressionMetrics, plot_metrics
# Initialize the RegressionMetrics object with actual and
# predicted values
er = RegressionMetrics(
test['Global_active_power'], forecasts)
# Calculate all available regression metrics
metrics = er.calculate_all()
# Plot the calculated metrics using a color scheme
plot_metrics(metrics, color="Blues")
# Display the Symmetric Mean Absolute Percentage Error (SMAPE)
print(f"Test SMAPE: {metrics['smape']}")
# Display the Weighted Absolute Percentage Error (WAPE)
print(f"Test WAPE: {metrics['wape']}")
这给出了以下结果:
Test SMAPE: 43.775868579535334
Test WAPE: 0.4202224470299464
该库还提供了所有计算的度量的可视化,如图 7.7和7.8所示。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_07.jpg
图 7.7:WAPE 的 SeqMetrics 显示
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_08.jpg
图 7.8:SMAPE 的 SeqMetrics 显示
在训练和测试完我们的第一个模型后,我们可以进入下一个模型,这是一个经典的机器学习模型。
经典机器学习模型 – LightGBM
我们将介绍的第二个模型是Light Gradient Boosting Machine(LightGBM),这是一个免费的开源梯度提升模型。它基于树学习算法,旨在高效且分布式。
本节的代码在ts-spark_ch7_1e_lgbm_comm.dbc
中。我们将代码导入到 Databricks 社区版中,按照第一章中解释的方法进行操作。
代码的 URL 如下:
开发与调优
对于模型开发,我们使用以下代码将数据集的最后 48 小时从训练数据中分离出来,用于后续测试。其余部分用于训练:
# Split the data into training and testing sets
# The last 48 observations are used for testing, the rest for training
train = data_hr[:-48]
test = data_hr[-48:]
我们将使用 GridSearchCV
方法为 LGBMRegressor
模型寻找最佳参数。TimeSeriesSplit
用于根据时间序列特性将训练数据集进行交叉验证划分:
# Define the parameter grid for LightGBM
param_grid = {
'num_leaves': [30, 50, 100],
'learning_rate': [0.1, 0.01, 0.001],
'n_estimators': [50, 100, 200]
}
# Initialize LightGBM regressor
lgbm = lgb.LGBMRegressor()
# Setup TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=10)
# Configure and run GridSearchCV
gsearch = GridSearchCV(
estimator=lgbm,
param_grid=param_grid,
cv=tscv
)
gsearch.fit(X_train, y_train)
# Output the best parameters from Grid Search
print(f"Best Parameters: {gsearch.best_params_}")
我们找到了以下最佳参数:
Best Parameters: {'learning_rate': 0.1, 'n_estimators': 50, 'num_leaves': 30}
基于训练数据集,这将是与 LightGBM 模型预测该家庭能耗时使用的最佳参数集。然后,我们可以用这些参数训练最终模型:
final_model = lgb.LGBMRegressor(**best_params)
final_model.fit(X_train, y_train)
在训练好 LightGBM 模型后,我们将进行模型预测测试。
测试与预测
我们使用模型通过 predict
函数对测试数据集进行预测。请注意,在此情况下,我们并没有使用迭代式多步预测代码,而是使用了滞后值作为模型的输入特征:
# Predict on the test set
y_pred = final_model.predict(X_test)
然后,我们可以在 图 7.8 和 图 7.9 中将预测值与实际值进行对比。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_09.jpg
图 7.9:LightGBM 预测与实际值对比(训练与测试)
我们在 图 7.9 中放大测试期,以便直观比较预测值与实际值。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_10.jpg
图 7.10:LightGBM 预测与实际值对比(测试数据放大)
根据预测值与实际值,我们可以计算 SMAPE 和 WAPE,得到以下值:
Test SMAPE: 41.457989848314384
Test WAPE: 0.38978585281926825
现在我们已经训练并测试了统计学和经典机器学习模型,可以进入第三种模型类型——深度学习模型。
深度学习模型 - NeuralProphet
我们将介绍的第三个模型是 NeuralProphet,它是一个免费的开源深度学习模型,灵感来自于我们在前几章使用过的 Prophet 和 AR-Net。NeuralProphet 基于 PyTorch 构建。
本节代码位于 ts-spark_ch7_1e_nprophet_comm.dbc
文件中。我们按照 第一章 中的方式将代码导入 Databricks Community Edition。
注意
请注意,此示例的笔记本需要 Databricks 计算 DBR 13.3 LTS ML。
开发
我们实例化了一个 NeuralProphet
模型,并通过 n_lag
参数指定我们希望使用过去 24 小时的数据进行预测。然后,我们在训练数据集上训练(fit
方法)该模型:
# Initialize and fit the Prophet model
# model = NeuralProphet()
model = NeuralProphet(n_lags=24, quantiles=[0.05, 0.95])
metrics = model.fit(train_df)
只需这两行代码即可训练模型,接下来我们将进行模型预测测试。
测试与预测
在使用模型对测试数据集进行预测之前,我们需要为 NeuralProphet 准备数据,类似于之前为 Prophet 所做的准备。所需的格式是有一个ds
列用于日期/时间,另一个y
列用于预测目标。然后,我们可以使用predict
方法。请注意,在此情况下,我们没有使用迭代的多步预测代码。在前一段代码中指定了滞后 24 作为参数,NeuralProphet 使用过去 24 个值的滑动窗口来预测下一个值:
# Convert the DataFrame index to datetime,
# removing timezone information
test_df['ds'] = test_df.index.to_pydatetime()
test_df['ds'] = test_df['ds'].apply(
lambda x: x.replace(tzinfo=None))
# Rename the target variable for Prophet compatibility
test_df = test_df.rename(columns={'Global_active_power': 'y'})
# Use the trained model to make predictions on the test set
predictions_48h = model.predict(test_df)
我们在图 7.12和图 7.13中将预测值与实际值进行对比。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_11.jpg
图 7.11:NeuralProphet 预测与实际值对比(训练与测试)
我们在图 7.13中放大测试期,以便进行预测与实际值的视觉对比。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_12.jpg
图 7.12:NeuralProphet 预测与实际值对比(放大测试数据)
基于预测和实际值,我们可以计算 SMAPE 和 WAPE,并获得以下值来衡量模型的准确性:
Test SMAPE: 41.193985580947896
Test WAPE: 0.35355667972102317
我们将在后续的模型 比较部分使用这些指标来比较本章中使用的不同模型。
到目前为止,我们已经训练并测试了每种类型的模型:统计模型、经典机器学习模型和深度学习模型。书中 GitHub 仓库中提供了其他一些常用的时间序列模型示例:
-
Prophet:
ts-spark_ch7_1e_prophet_comm.dbc
-
LSTM:
ts-spark_ch7_1e_lstm_comm1-cpu.dbc
-
NBEATS 和 NHITS:
ts-spark_ch7_1e_nbeats-nhits_comm.dbc
我们鼓励你进一步探索这些内容。
拥有一个有效的模型很重要,但还不够。我们还需要能够解释我们使用的模型。接下来我们将介绍这一部分内容。
可解释性
可解释性在许多情况下都是一个关键要求,例如金融和受监管行业。我们将通过一种广泛使用的方法——Shapley 加法解释(SHAP)来解释数据集的不同特征如何影响预测结果。
我们将使用shap
库中的TreeExplainer
函数,应用于经典机器学习模型 – LightGBM部分的最终模型,计算 SHAP 值,从而了解每个特征对模型输出的影响。
import shap
# Initialize a SHAP TreeExplainer with the trained model
explainer = shap.TreeExplainer(final_model)
# Select features for SHAP analysis
X = data_hr[[
'Global_active_power_lag1', 'Global_active_power_lag2',
'Global_active_power_lag3', 'Global_active_power_lag4',
'Global_active_power_lag5', 'Global_active_power_lag12',
'Global_active_power_lag24', 'Global_active_power_lag24x7'
]]
# Compute SHAP values for the selected features
shap_values = explainer(X)
# Generate and display a summary plot of the SHAP values
shap.summary_plot(shap_values, X)
然后,我们可以在图 7.10中绘制特征重要性。正如我们在前一部分的数据探索中所预期的那样,滞后 1 和滞后 24 是对预测贡献最大的特征。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_13.jpg
图 7.13:SHAP – 特征重要性
我们可以通过以下代码进一步分析,聚焦于某个特定的预测值,在此我们要解释第一个预测值的情况:
# Plot a SHAP waterfall plot for the first observation's SHAP values # to visualize the contribution of each feature
shap.plots.waterfall(shap_values[0])
我们可以在图 7.11中看到特征的相对贡献,再次呈现滞后 1 和滞后 24 的主导地位,滞后 12 的贡献相对较小。这与我们在数据探索部分中的分析一致,在该部分中我们确认了这些滞后项在预测家庭能源消耗中的重要性。
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_07_14.jpg
图 7.14:SHAP—特征重要性(首次观测)
模型比较
在结束本章之前,我们将根据我们所测量的指标和代码执行时间对所有测试过的模型进行比较。结果显示在表 7.1中。
模型 | 类型 | SMAPE | WAPE | 训练 | 调优 | 测试 | 总计(包括数据预处理) |
---|---|---|---|---|---|---|---|
NeuralProphet | 深度学习/混合 | 41.19 | 0.35 | 60 秒 | - | 1 秒 | 90 秒 |
LightGBM | 经典机器学习 | 41.46 | 0.39 | 60 秒 | 包含 | 包含 | 137 秒 |
SARIMA | 统计模型 | 43.78 | 0.42 | 包含 | 420 秒 | 180 秒 | 662 秒 |
Prophet | 统计/混合 | 47.60 | 0.41 | 2 秒 | - | 1 秒 | 70 秒 |
NHITS | 深度学习 | 54.43 | 0.47 | 35 秒 | - | 包含 | 433 秒 |
NBEATS | 深度学习 | 54.91 | 0.48 | 35 秒 | - | 包含 | 433 秒 |
LSTM | 深度学习 | 55.08 | 0.48 | 722 秒 | - | 4 秒 | 794 秒 |
表 7.1:模型结果比较
以下是一些关于模型准确性的观察:
-
NeuralProphet 和 LightGBM 在 SMAPE 和 WAPE 指标下提供了最佳的预测准确性。SARIMA 的表现也不算差。
-
深度学习模型 NBEATS、NHITS 和 LSTM 作为单输入模型时预测准确性较差。我们建议进一步探索如何通过多输入来提升它们的表现。
以下内容涉及执行时间:
-
在所有情况下,我们都保持在 900 秒(15 分钟)的总执行时间限制内,使用 2 个 CPU 核心在单节点的 Databricks 社区版集群上运行。这对于 25MB 的数据集来说是可行的。我们将在第八章中看到如何为更大的数据集进行扩展。
-
Prophet、NBEATS 和 NHITS 的执行时间最佳,NeuralProphet 和 LightGBM 紧随其后,训练、调优和测试时间仍在 1 分钟以内。
-
即使我们将数据集限制为最后 300 个观测值,SARIMA 的执行时间仍然相对较高。这是由于 Auto ARIMA 算法在搜索最佳超参数时以及多步迭代预测代码的执行。
-
LSTM 的执行时间最长,这可以通过使用 CPU 而非 GPU 来解释,GPU 对于深度学习来说要快得多。
从这次模型比较的整体结论来看,NeuralProphet 和 LightGBM 是我们使用的数据集的最佳选择,几乎不需要调优,并且符合我们设定的计算和执行时间限制。
总结
在本章中,我们重点讨论了本书的核心主题,即时间序列分析模型的开发,特别是预测模型。从回顾不同类型的模型开始,然后介绍了选择合适模型的关键标准。在本章的第二部分,我们实践了多个模型的开发和测试,并根据准确性和执行时间进行了比较。
在下一章中,我们将扩展一个 Apache Spark 的优势领域:将时间序列分析扩展到大数据。
加入我们社区的 Discord
加入我们社区的 Discord 空间,与作者和其他读者进行讨论:
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1.jpg)
第三部分:扩展到生产环境及更远发展
在最后一部分,我们将讨论将第二部分中涉及的解决方案扩展和投入生产时需要考虑的因素和实际案例。随后,我们将以使用 Databricks 和生成式 AI 作为解决方案的一部分,来结束本书,并介绍如何进一步推进 Apache Spark 和时间序列分析的应用。
本部分包含以下章节:
-
第八章,大规模处理
-
第九章,投入生产
-
第十章,进一步使用 Apache Spark
-
第十一章,时间序列分析的最新发展
第八章:扩展计算能力
在上一章构建并测试模型后,我们将讨论在大型分布式计算环境中扩展时间序列分析的需求和注意事项。我们将讨论 Apache Spark 如何扩展第七章中的示例,内容从特征工程开始,接着是超参数调优,以及单模型和多模型训练。这些信息对于我们在时间紧迫的情况下分析大量时间序列数据至关重要。
在本章中,我们将涵盖以下主要话题:
-
为什么我们需要扩展时间序列分析?
-
扩展特征工程
-
扩展模型训练
技术要求
在进入主要话题之前,我们将先介绍本章的技术要求,具体如下:
-
书籍 GitHub 仓库中的
ch8
文件夹,网址为:github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch8
-
合成数据:我们将使用合成数据库工具(Synthetic Data Vault),这是一个用于生成合成表格数据的 Python 库。你可以在这里找到更多关于合成数据库的信息:
docs.sdv.dev/sdv
。 -
Databricks 平台:虽然 Databricks 社区版是免费的,但其资源有限。类似地,在个人计算机或笔记本电脑上使用时,资源也可能受到限制。鉴于本章需要演示计算能力的扩展,我们将使用 Databricks 的非社区版。如第一章中所讨论的,你可以注册 Databricks 的 14 天免费试用版,但前提是你需要先拥有一个云服务提供商的账户。一些云服务提供商在开始时提供免费积分,这将为你提供比社区版更多的资源,且仅限时使用。请注意,试用期结束后,费用将转到你注册时提供的信用卡。
使用的 Databricks 计算配置如图 8.1所示。这里展示的工作节点和驱动节点类型基于 AWS,与 Azure 和 GCP 上的配置不同。请注意,UI 界面可能会发生变化,在这种情况下,请参考最新的 Databricks 文档:
https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_1.jpg
图 8.1:Databricks 计算配置
为什么我们需要扩展时间序列分析?
扩展时间序列分析的需求通常源于需要更快地进行分析或处理更大的数据集。在本章中,我们将探讨如何在将数据集大小扩大五倍的同时,减少在第七章中实现的处理时间。这将得益于 Apache Spark 所提供的处理能力。
扩展后的数据集
为了测试 Spark 的可扩展性,我们需要一个比以往更广泛的数据集。虽然你可能已经拥有这样的数据集,但为了本章的目的,我们将扩展在第七章和之前章节中使用的家庭能源消耗数据集。扩展后的数据集将使用 Synthetic Data Vault 工具生成,如技术要求部分所述。
本节的代码位于ts-spark_ch8_1.dbc
中。我们将代码导入到 Databricks 中,方法与第一章中逐步操作:加载和可视化时间序列部分对于社区版的讲解类似。
在这段代码中,我们希望使用一个家庭的数据生成四个其他家庭的能源消耗数据,以将数据规模扩大五倍。
我们首先捕获pdf_main
的元数据,这是较小的参考数据集。元数据作为输入,用于创建一个名为synthesizer
的GaussianCopulaSynthesizer
对象,它表示数据的统计模型。然后,该合成器通过fit
方法在参考数据集(pdf_main
)上进行训练。最终,这个模型将用于通过sample
方法生成合成数据。
较小的参考数据集(pdf_main
)与客户标识符(cust_id
)1
相关联,合成数据集则与标识符2
、3
、4
和5
相关联:
# Initialize metadata object for the dataset
metadata = SingleTableMetadata()
# Automatically detect and set the metadata from the Pandas DataFrame
metadata.detect_from_dataframe(pdf_main)
# Initialize the Gaussian Copula Synthesizer with the dataset metadata
synthesizer = GaussianCopulaSynthesizer(metadata)
# Fit the synthesizer model to the Pandas DataFrame
synthesizer.fit(pdf_main)
…
# Define the number of customer datasets to generate:
num_customers = 5
# Count the number of rows in the original dataset:
sample_size = df_main.count()
i = 1
df_all = df_main.withColumn(
'cust_id', F.lit(i)
) # Add a 'cust_id' column to the original dataset with a constant
# value of 1
…
synthetic_data = spark.createDataFrame(
synthesizer.sample(num_rows=sample_size)
) # Generate synthetic data matching the original dataset's size
…
在这个新的、更大的数据集上运行第七章中的代码将无法达到高效的性能。我们可以对更大的数据集进行纵向扩展或横向扩展时间序列分析。接下来,我们将解释这两种扩展方式。
扩展
扩展是一种更简单的扩展方式,不需要我们修改在第七章中编写的代码。通过增加更多的内存(RAM)和使用更强大的 CPU 或甚至 GPU,我们可以提高性能。这样做在一定程度上是有效的,但会在达到扩展极限、成本过高或回报递减时遇到瓶颈。实际上,由于系统瓶颈和开销,扩展并不会导致性能线性提升。
要进一步扩展,我们需要进行横向扩展。
横向扩展
扩展不仅仅是让我们的单一机器变得更强大,它还涉及增加更多的机器并行处理。这需要一种机制,让代码能够被分发并并行执行,而 Apache Spark 正是提供了这种机制。
在接下来的章节中,我们将介绍 Apache Spark 可用于扩展时间序列分析的几种不同方法:
-
特征工程
-
模型训练
特征工程
Apache Spark 可以利用其分布式计算框架扩展特征工程的处理。这使得特征工程任务能够进行并行处理,我们将在本节中展示如何操作。
我们将在第五章继续讨论数据准备,并改进第七章中的特征工程。我们将在本节中以第七章中开发与测试部分的基于 pandas 的代码示例为基础进行讨论。接下来的示例将展示如何将非 Spark 代码重写为 Spark 兼容代码,从而利用其可扩展性的优势。
虽然 Spark 可以用于多种特征工程方式,但我们将重点讨论以下三种与改进第七章代码相关的方式:
-
列转换
-
重采样
-
滞后值计算
接下来,我们将开始进行列转换的讨论。
列转换
在第一个代码示例中,我们将重写ts-spark_ch7_1e_lgbm_comm.dbc
中已有的列转换代码,该代码用于第七章中的开发与测试部分。我们将通过使用pyspark.sql.functions
库将代码修改为支持 Spark 的版本。为此,我们需要执行以下操作:
-
使用
concat_ws
函数,将现有的Date
和Time
列合并,替换Date
列。 -
将
Date
列转换为时间戳格式(to_timestamp
函数)。 -
有选择性地(使用
when
和otherwise
条件)将Global_active_power
中的错误值?
替换为None
。 -
使用
regexp_replace
函数将Global_active_power
中的,
替换为.
,以确保符合float
值的正确格式。
以下代码示例演示了前述步骤:
from pyspark.sql import functions as F
# Combine 'Date' and 'Time' into a single 'Date' column of timestamp
# type
df_all = df_all.withColumn(
'Date',
F.to_timestamp(
F.concat_ws(' ', F.col('Date'), F.col('Time')),
'd/M/yyyy HH:mm:ss')
)...
# Select only the 'cust_id', 'Date' and 'Global_active_power' columns
df_all = df_all.select(
'cust_id', 'Date', 'Global_active_power'
)
# Replace '?' with None and convert 'Global_active_power' to float
df_all = df_all.withColumn(
'Global_active_power',
F.when(F.col('Global_active_power') == '?', None)
.otherwise(F.regexp_replace(
'Global_active_power', ',', '.').cast('float')
)
)
# Sort the DataFrame based on 'cust_id' and 'Date'
df_all = df_all.orderBy('cust_id', 'Date')
在使用 Spark 并行化列转换后,我们接下来要讲解的代码优化是对时间序列数据进行重采样。
重采样
在第二个代码转换示例中,我们将重写ts-spark_ch7_1e_lgbm_comm.dbc
中每小时重采样的代码,该代码用于第七章中的开发与测试部分。我们希望计算每个客户的Global_active_power
的每小时均值。为此,我们需要执行以下操作:
-
使用
date_format
函数将Date
列转换为日期和小时组件。 -
对每个客户(
groupBy
函数),将Global_active_power
的重采样平均值转换为每小时的均值(使用agg
和mean
函数)。
以下代码展示了前述步骤:
from pyspark.sql import functions as F
# Convert the 'Date' column to a string representing the
# start of the hour for each timestamp
data_hr = df_all.withColumn(
'Date',
F.date_format('Date', 'yyyy-MM-dd HH:00:00'))
# Group the data by 'cust_id' and the hourly 'Date',
# then calculate the mean 'Global_active_power' for each group
data_hr = data_hr.groupBy(
'cust_id', 'Date').agg(
F.mean('Global_active_power').alias('Global_active_power')
)
现在我们已经使用 Spark 对重采样过程进行了并行化,接下来我们要介绍的代码优化是计算时间序列数据的滞后值。
计算滞后值
在第三个例子中,使用 Apache Spark 进行特征工程的扩展时,我们将重写存在于 ts-spark_ch7_1e_lgbm_comm.dbc
中的滞后计算代码,该代码用于第七章中的开发和测试部分。我们希望为每个客户计算不同的滞后值。为此,我们需要执行以下操作:
-
定义一个滑动日期窗口来计算每个客户的滞后值(
partitionBy
函数)。我们已经为每个客户按日期排序(orderBy
函数)。 -
计算滑动窗口上的不同滞后值(
lag
和over
函数)。 -
请注意,由于滞后计算基于先前的值,数据集开头的某些滞后值可能没有足够的先前值进行计算,因此会为空。我们使用
dropna
函数删除这些空滞后值的行。
以下代码演示了上述步骤:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
# Define a window specification partitioned by -
# 'cust_id' and ordered by the 'Date' column
windowSpec = Window.partitionBy("cust_id").orderBy("Date")
# Add lagged features to the DataFrame to incorporate
# past values as features for forecasting
# Apply the lag function to create the lagged column,
# separately for each 'cust_id'
# Lag by 1, 2, 3, 4, 5, 12, 24, 168 hours (24 hours * 7 days)
lags = [1, 2, 3, 4, 5, 12, 24, 24*7]
for l in lags:
data_hr = data_hr.withColumn(
'Global_active_power_lag' + str(l),
F.lag(F.col('Global_active_power'), l).over(windowSpec))
# Remove rows with NaN values that were introduced by
# shifting (lagging) operations
data_hr = data_hr.dropna()
通过使用 Spark 函数而不是 pandas,我们将使 Spark 能够并行化处理大数据集的滞后计算。
现在我们已经介绍了不同的方法来利用 Apache Spark 提升第七章中特征工程部分的代码,接下来我们将深入探讨模型训练的扩展。
模型训练
在本节中,我们将涵盖以下几种不同的方式,即 Apache Spark 如何用于规模化模型训练:
-
超参数调优
-
单模型并行训练
-
多模型并行训练
这些方法使得在拥有大数据集或需要训练多个模型时能够高效地进行模型训练。
当使用不同的超参数重复训练同一模型时,超参数调优可能是昂贵的计算。我们希望能够利用 Spark 高效地找到最佳超参数。
同样地,对于大数据集,训练单一模型可能需要很长时间。在其他情况下,我们可能需要训练许多模型以适应不同的时间序列数据集。我们希望通过在 Spark 集群上并行化训练来加速这些过程。
我们将在下一节详细介绍这些方法,从超参数调优开始。
超参数调优
如第四章中所述,机器学习中的超参数调优是为机器学习算法找到最佳配置集的过程。这种寻找最佳超参数的过程可以使用 GridSearchCV、Hyperopt 和 Optuna 等库来并行化,这些库与 Apache Spark 结合提供后端处理并行性的框架。
我们在第三章中讨论了 Spark 的处理并行性。这里我们将更专注于使用 Optuna 与 Apache Spark 结合进行超参数调整。
如果你还记得,在第七章中,我们在单个节点上使用 GridSearchCV 调整了 LightGBM 模型的超参数。在这一节的代码示例中,我们将通过并行化过程来改进这一点。我们将使用 Optuna 与 Spark 一起,找到我们在第七章中探索的 LightGBM 模型的最佳超参数。
Optuna 是一个开源的超参数优化框架,用于自动化超参数搜索。您可以在这里找到有关 Optuna 的更多信息:optuna.org/
。
我们将通过定义一个 objective
函数开始调优过程(稍后我们将使用 Optuna 优化该函数)。此 objective
函数执行以下操作:
-
在
params
中定义超参数值的搜索空间。 -
初始化 LightGBM
LGBMRegressor
模型,使用特定于试验的参数。 -
在训练数据集上训练(
fit
)模型。 -
使用模型对验证数据集进行预测。
-
计算模型的评估指标(
mean_absolute_percentage_error
)。 -
返回评估指标。
以下代码展示了前面的步骤:
import lightgbm as lgb
from sklearn.metrics import mean_absolute_percentage_error
import optuna
def objective(trial):
# Define the hyperparameter configuration space
params = {
# Specify the learning task and
# the corresponding learning objective:
"objective": "regression",
# Evaluation metric for the model performance:
"metric": "rmse",
# Number of boosted trees to fit:
"n_estimators": trial.suggest_int("n_estimators", 50, 200),
# Learning rate for gradient descent:
"learning_rate": trial.suggest_float(
"learning_rate", 0.001, 0.1, log=True),
# Maximum tree leaves for base learners:
"num_leaves": trial.suggest_int("num_leaves", 30, 100),
}
# Initialize the LightGBM model with the trial's parameters:
model = lgb.LGBMRegressor(**params)
# Train the model with the training dataset:
model.fit(X_train, y_train)
# Generate predictions for the validation dataset:
y_pred = model.predict(X_test)
# Calculate the Mean Absolute Percentage Error (MAPE)
# for model evaluation:
mape = mean_absolute_percentage_error(y_test, y_pred)
# Return the MAPE as the objective to minimize
return mape
一旦定义了目标函数,接下来的步骤如下:
-
注册 Spark(
register_spark
函数)作为后端。 -
创建一个研究(
create_study
函数),它是一个包含试验的集合,用于最小化评估指标。 -
在 Spark
parallel_backend
上运行研究,以优化objective
函数在n_trials
上的表现。
以下代码展示了前面的步骤:
from joblibspark import register_spark
# This line registers Apache Spark as the backend for
# parallel computing with Joblib, enabling distributed
# computing capabilities for Joblib-based parallel tasks.
register_spark()
…
# Create a new study object with the goal of minimizing the objective # function
study2 = optuna.create_study(direction='minimize')
# Set Apache Spark as the backend for parallel execution of –
# trials with unlimited jobs
with joblib.parallel_backend("spark", n_jobs=-1):
# Optimize the study by evaluating the –
# objective function over 10 trials:
study2.trial.value) and parameters (trial.params) for best_trial:
从优化研究中获取最佳试验
trial = study2.best_trial
打印最佳试验的目标函数值,
通常是准确性或损失
print(f"最佳试验准确率:{trial.value}")
print(“最佳试验参数:”)
遍历最佳试验的超参数并打印它们
for key, value in trial.params.items():
print(f" {key}: {value}")
The outcome of the hyperparameter tuning, shown in *Figure 8**.2*, is the best hyperparameters found within the search space specified, as well as the related model accuracy.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_2.jpg>
Figure 8.2: Hyperparameter tuning – best trials
In addition to the scaling of the hyperparameter tuning stage, which we have seen in this section, Spark clusters can also be used to parallelize the next step, that is, fitting the model to the training data. We will cover this next.
Single model in parallel
Ensemble methods such as Random Forest and gradient boosting machines can benefit from task parallelism during the model training stage. Each tree in a Random Forest can be trained independently, making it possible to parallelize across multiple processors. Similarly in the case of Gradient Boosting models such as LightGBM and XGBoost, the tree’s construction can be parallelized, even though the boosting itself is sequential,
In *Chapter 7*’s example in the *Classical machine learning model* section, we used LightGBM. This model was not Spark enabled. Here, as we want to demonstrate training parallelism with a Spark-enabled Gradient Boosting model, we will use `SparkXGBRegressor` instead.
As a first step, we will build a vector of the features using `VectorAssember`, as shown in the following code:
from pyspark.ml.feature import VectorAssembler
定义一个列表来保存滞后特征列的名称
inputCols = []
遍历滞后间隔列表以创建特征列
名称
for l in lags:
inputCols.append(‘Global_active_power_lag’ + str(l))
初始化 VectorAssembler 并使用
创建特征列名称并指定输出列名称
assembler = VectorAssembler(
inputCols=inputCols, outputCol=“features”)
We then create the `SparkXGBRegressor` model object, setting `num_workers` to all available workers, and specifying the target column with `label_col`:
from xgboost.spark import SparkXGBRegressor
初始化 SparkXGBRegressor 用于回归任务。
num_workers
设置为默认的并行度级别 -
Spark 上下文,用于利用所有可用核心。
label_col
指定目标变量列名
预测。
missing
设置为 0.0,以处理数据集中的缺失值。
xgb_model = SparkXGBRegressor(
num_workers=sc.defaultParallelism,
label_col=“Global_active_power”, missing=0.0
)
As we have seen so far, hyperparameter tuning is an important step in finding the best model. In the following code example, we will use `ParamGridBuilder` to specify the range of parameters that are specific to the model and that we want to evaluate.
We then pass the parameters to `CrossValidator` together with `RegressionEvaluator`. We will use the root mean square error (`rmse`) as the evaluation metric. This is the default metric for `RegressionEvaluator`, making it suitable for our example here:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
初始化超参数调整的参数网格
- max_depth: 指定模型中树的最大深度
- n_estimators: 定义模型中树的数量
paramGrid = ParamGridBuilder()\
.addGrid(xgb_model.max_depth, [5, 10])\
.addGrid(xgb_model.n_estimators, [30, 100])\
.build()
初始化回归评估器用于模型评估
- metricName: 指定用于评估的指标
这里是 RMSE(均方根误差)
- labelCol: 标签列的名称
- predictionCol: 预测列的名称
evaluator = RegressionEvaluator(
metricName=“rmse”,
LabelCol = xgb_model.getLabelCol(),
PredictionCol = xgb_model.getPredictionCol()
)
初始化 CrossValidator 进行超参数调整
- estimator: 需要调优的模型
- evaluator: 用于模型评估的评估器
- estimatorParamMaps: 用于调优的参数网格
cv = CrossValidator(
estimator = xgb_model, evaluator = evaluator,
estimatorParamMaps = paramGrid)
At this point, we are ready to build a pipeline (`Pipeline`) to train (`fit`) the model. We will do this by combining in sequence the `VectorAssembler` (`assembler`) and `CrossValidator` (`cv`) stages:
from pyspark.ml import Pipeline
初始化一个包含两个阶段的管道对象:
一个特征组装器和一个交叉验证器用于模型调优
pipeline = filter function) 训练数据筛选为 cust_id 1。接着,我们使用所有记录(head function)进行训练,除去最后 48 小时的数据,因为这些数据将用于测试。最终得到的 train_hr DataFrame 包含了每小时的训练数据:
# Filter the dataset for customer with cust_id equal to 1
train_hr = data_hr.filter('cust_id == 1')
# Create a Spark DataFrame excluding the last 48 records for training
train_hr = spark.createDataFrame(
train_hr.head(train_hr.count() - 48)
)
# Fit the pipeline model to the training data
pipelineModel = pipeline.fit(train_hr)
同样,对于测试,我们将筛选出 cust_id
1
,并在这种情况下使用最后的 48 小时数据。然后,我们可以将模型(pipelineModel
)应用于测试数据(test_hr
),以获取这 48 小时的能耗预测:
# Filter the dataset for customer with cust_id equal to 1 for testing
test_hr = data_hr.filter('cust_id == 1')
# Create a Spark DataFrame including the last 48 records for testing
test_hr = spark.createDataFrame(train_hr.tail(48))
…
# Apply the trained pipeline model to the test data to generate
# predictions
predictions = RegressionEvaluator (the evaluator object) to calculate (the evaluate function) the RMSE:
使用评估器评估模型的性能
均方根误差(RMSE)指标
rmse = evaluator.evaluate(predictions)
For comparison, we also calculate the **Symmetric Mean Absolute Percentage Error** (**SMAPE**) and **Weighted Average Percentage Error** (**WAPE**) similarly to how we have done in the *Classical machine learning model* section of *Chapter 7*. The results are shown in *Figure 8**.3*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_3.jpg>
Figure 8.3: XGBoost evaluation metrics
We plot the forecast against the actual values in *Figures 8.4* and *8.5*.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_4.jpg>
Figure 8.4: XGBoost forecast versus actuals (training and testing)
We zoom in on the testing period in *Figure 8**.5* for a visual comparison of the forecast and actuals.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_5.jpg>
Figure 8.5: XGBoost forecast versus actuals (zoom on test data)
In this section, we have seen parallelism in single-model training. This requires the use of a library, such as XGBoost used here, which supports a multi-node processing backend such as Apache Spark. In addition to ensemble methods, other models, such as deep learning, can benefit from training parallelism.
Multiple models can also be trained in parallel, which we will explore next.
Multiple models in parallel
Earlier in this chapter, we scaled the dataset to represent the household energy consumption of multiple customers. In this section, we will train a different machine learning model for each customer in parallel. This is required if we want to predict the energy consumption of individual customers based on their own historical consumption. There are several other use cases where such multi-model training is required, for example, in the retail industry when doing sales forecasting for individual products or stores.
Coming back to our energy consumption example, the `train_model` function does the following for each customer:
1. Get the customer ID (`cust_id`) from the pandas DataFrame passed as input.
2. Choose the features (`X`) and target (`y`) variables.
3. Split (`train_test_split`) the dataset into training and testing, specifying `shuffle` as `False` to preserve the time order. As discussed in *Chapter 1*, this is an important consideration for time-series datasets.
4. Perform hyperparameter tuning with `GridSearchCV` using `LGBMRegressor` as the model and `TimeSeriesSplit` for the dataset splits.
5. Train (`fit`) the final model with the best hyperparameters (`best_params`) on the full training dataset.
6. Test the final model on the test dataset and calculate the evaluation metrics (`rmse` and `mape`).
7. Return the result of `train_model` in a DataFrame with `cust_id`, `best_params`, `rmse`, and `mape`.
The following code shows the function definition with the preceding steps:
def train_model(df_pandas: pd.DataFrame) -> pd.DataFrame:
提取用于训练模型的客户 ID
cust_id = df_pandas[“cust_id”].iloc[0]
从 DataFrame 中选择特征和目标变量
X = df_pandas[[
‘Global_active_power_lag1’, ‘Global_active_power_lag2’,
‘Global_active_power_lag3’, ‘Global_active_power_lag4’,
‘Global_active_power_lag5’, ‘Global_active_power_lag12’,
‘Global_active_power_lag24’, ‘Global_active_power_lag168’
]]
y = df_pandas[‘Global_active_power’]
将数据集分为训练集和测试集,并保持
时间顺序
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False, random_state=12
)
定义 LightGBM 模型调优的超参数空间
param_grid = {
‘num_leaves’: [30, 50, 100],
‘learning_rate’: [0.1, 0.01, 0.001],
‘n_estimators’: [50, 100, 200]
}
初始化 LightGBM 回归模型
lgbm = lgb.LGBMRegressor()
初始化 TimeSeriesSplit 进行交叉验证
尊重时间序列数据结构
tscv = TimeSeriesSplit(n_splits=10)
使用交叉验证进行网格搜索
gsearch = GridSearchCV(
estimator=lgbm, param_grid=param_grid, cv=tscv)
gsearch.fit(X_train, y_train)
提取最佳超参数
best_params = gsearch.best_params_
使用最佳参数训练最终模型
final_model = lgb.LGBMRegressor(**best_params)
final_model.fit(X_train, y_train)
对测试集进行预测
y_pred = final_model.predict(X_test)
计算 RMSE 和 MAPE 指标
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = mean_absolute_percentage_error(y_test, y_pred)
准备返回的结果 DataFrame
return_df = pd.DataFrame(
[[cust_id, str(best_params), rmse, mape]],
columns=[“cust_id”, “best_params”, “rmse”, “mape”]
)
return return_df
Now that the model training function is defined, we can launch it in parallel for each customer (the `groupBy` function), passing a pandas DataFrame of all the rows for this specific customer to the `applyInPandas` function.
pandas UDFs, mapInPandas, and applyInPandas
Using Spark-enabled libraries, as we did in the previous section with single-model parallel training, is usually faster for large datasets than single-machine libraries. There are, however, cases when we have to use a library that isn’t implemented natively for Spark’s parallel processing. In these situations, we can use pandas `mapInPandas`, or `applyInPandas`. These methods allow you to call pandas operations in a distributed way from Spark. The common use cases are as follows:
- **pandas UDF**: One input row for one output row
- **mapInPandas**: One input row for multiple output rows
- **applyInPandas**: Multiple input rows for one output row
Note that these are general guidance and that there is great flexibility in how these methods can be used.
In the example in this section, we use `applyInPandas` as we want to execute a pandas-enabled function for all the rows in the dataset corresponding to a specific customer for model training. We want the function to output one row with the result of model training for the specific customer.
Note how, in the following code extract, we specified the `train_model_result_schema` schema of the function’s return value. This is a requirement for serializing the result that is added to the `train``_model_result_df` pandas DataFrame:
from pyspark.sql.functions import lit
按客户 ID 对数据进行分组,并应用
将 train_model 函数应用于每个组,使用 Pandas UDF
结果 DataFrame 的模式由以下定义
train_model_result_schema
缓存结果 DataFrame 以优化性能
后续操作
train_model_result_df = (
data_hr
.groupby(“cust_id”)
.applyInPandas(train_model, schema=train_model_result_schema)
.cache()
)
*Figure 8**.6* shows the outcome of the multi-model training. It shows the best hyperparameters (the `best_params` column) and evaluation metrics (the `rmse` and `mape` columns) for each customer.
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_6.jpg>
Figure 8.6: Multi-model training – best hyperparameters and evaluation metrics
With this example, we have trained five different models representing different customers. We have found the best hyperparameters to use for each model, which we are then able to use to do individual energy consumption forecasting.
With this, we conclude the different ways in which we can leverage Apache Spark to scale time-series analysis. Next, we will discuss some of the ways that the training process can be optimized.
Training optimization
When training machine learning models at a large scale, several inefficiencies and overheads can impact resource utilization and performance. These include the following:
* Idle time waiting for resources such as GPU, network, and storage accesses, which can delay the training process.
* Frequent checkpointing, which saves the model during training to avoid restarting in case of failure. This results in additional storage and time during model training.
* Hardware or software failures during the training result in restarts, which waste resources and delay the training.
The following mitigation techniques can be used, depending on the model being trained and the library in use:
* Eliminate the cause of idle wait times by provisioning sufficient compute, network, and storage resources
* Avoid too frequent checkpointing
* Rearrange features based on correlation with the target variable or their importance to facilitate convergence during model training
* Reduce the dimensionality of the dataset, choosing the most informative features
While the implementation details of these techniques are beyond our scope here, we recommend researching and addressing these points when operating at a large scale due to the potentially high impact on cost, efficiency, and scalability.
Summary
In this chapter, we saw the need to scale the processing capacity for bigger datasets. We examined different ways of using Apache Spark to this end. Building on and extending the code examples from *Chapter 7*, we focused on scaling the feature engineering and model training stages. We looked at leveraging Spark to scale transformations, aggregations, lag values calculation, hyperparameter tuning, and single- and multi-model training in parallel.
In the next chapter, we will cover the considerations for going to production with time-series analysis, using and extending what we have learned so far.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
<https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1>.jpg)