愉快的学习就从翻译开始吧_3-Time Series Forecasting with the Long Short-Term Memory Network in Python

最新推荐文章于 2022-10-23 21:55:22 发布

翻译最新推荐文章于 2022-10-23 21:55:22 发布 · 280 阅读

本文介绍如何将时间序列数据预处理成适用于LSTM模型的监督学习问题，包括使用Pandas进行数据转换的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

LSTM Data Preparation/ LSTM 数据准备

Before we can fit an LSTM model to the dataset, we must transform the data.

在我们将一个LSTM模型拟合到数据集之前，我们必须转换数据

This section is broken down into three steps:

本节分为以下三步：

Transform the time series into a supervised learning problem./转换时间序列为监督学习问题
Transform the time series data so that it is stationary./转换时间序列数据，使其稳定

Transform the observations to have a specific scale./转换观测值到一个指定的缩放范围

Transform Time Series to Supervised Learning/转换时间序列为监督学习

The LSTM model in Keras assumes that your data is divided into input (X) and output (y) components.
Keras中的LSTM模型假定你数据被分为输入（X）和输出（y）两部分（什么输出？叫验证值好吗？叫目标值好吗？输出是你的预测值好吗？不知道有多少人要被带偏了，看来国内国外目前人工智能的学习现状都一样的处在混乱状态，对初学者来说真的是坑太多。后面的翻译只要不是预测值，都应将作者所述output视为验证值，或目标值。）

For a time series problem, we can achieve this by using the observation from the last time step (t-1) as the input and the observation at the current time step (t) as the output.

对于时间序列问题，我们实现这个目的通过把上一个时间步（t-1）的观测值作为输入，以当前时间步（t-1）的观测值作为目标值

We can achieve this using the shift() function in Pandas that will push all values in a series down by a specified number places. We require a shift of 1 place, which will become the input variables. The time series as it stands will be the output variables.

我们可以使用Pandas中的shift（）函数来实现这个功能，它可以将一系列中的所有值按指定的位置向下推。我们需要移动1个位置，这将成为输入变量。站着的（也就是没有位移的）时间序列就是目标变量。

We can then concatenate these two series together to create a DataFrame ready for supervised learning. The pushed-down series will have a new position at the top with no value. A NaN (not a number) value will be used in this position. We will replace these NaN values with 0 values, which the LSTM model will have to learn as “the start of the series” or “I have no data here,” as a month with zero sales on this dataset has not been observed.

然后，我们可以将这两个序列连接起来，创建一个DataFrame，以供监督学习。推下来的序列将在顶部有一个新的位置，没有任何价值。 NaN（非数字）值将用于此位置。我们将用0值替代这些NaN值，LSTM模型将不得不学习“该系列的开始”或“我在这里没有数据”，因为没有观察到该数据集上销售额为零的月份（注意这里又有坑，究竟是谁没有观测到数据集上销售为0的月份，是模型还是人？我理解是人，而不是模型，就是说我们看这个销售数据集没有销售为0的月份，所以才可以将NaN值用0替换，而模型则必须学到0值为序列的开始，或是这里没有数据）。

The code below defines a helper function to do this called timeseries_to_supervised(). It takes a NumPy array of the raw time series data and a lag or number of shifted series to create and use as inputs.

下面的代码定义了一个名为timeseries_to_supervised（）的辅助函数。它需要原始时间序列数据的NumPy数组以及移位序列的滞后量或是数量（作为参数）来创建并用作输入。

# frame a sequence as a supervised learning problem
def timeseries_to_supervised(data, lag=1):
	df = DataFrame(data)
	columns = [df.shift(i) for i in range(1, lag+1)]
	columns.append(df)
	df = concat(columns, axis=1)
	df.fillna(0, inplace=True)
	return df

pandas.DataFrame

class pandas. DataFrame ( data=None, index=None, columns=None, dtype=None, copy=False ) [source]

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters:

Parameters:	data : numpy ndarray (structured or homogeneous), dict, or DataFrame Dict can contain Series, arrays, constants, or list-like objects Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided columns : Index or array-like Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input

data : numpy ndarray (structured or homogeneous), dict, or DataFrame

Dict can contain Series, arrays, constants, or list-like objects

Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.

index : Index or array-like

Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided

columns : Index or array-like

Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

dtype : dtype, default None

Data type to force. Only a single dtype is allowed. If None, infer

copy : boolean, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

examples for DataFrame()

Constructing DataFrame from a dictionary.

  >>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

 

Notice that the inferred dtype is int64.

  >>> df.dtypes
col1    int64
col2    int64
dtype: object

 

To enforce a single dtype:

  >>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

 

Constructing DataFrame from numpy ndarray:

  >>> df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
...                    columns=['a', 'b', 'c', 'd', 'e'])
>>> df2
    a   b   c   d   e
0   2   8   8   3   4
1   4   2   9   0   9
2   1   0   7   8   0
3   5   1   7   1   3
4   6   0   2   4   2
 

pandas.DataFrame.shift

DataFrame. shift ( periods=1, freq=None, axis=0 ) [source]

Shift index by desired number of periods with an optional time freq

Parameters:

Parameters:	periods : int Number of periods to move, can be positive or negative freq : DateOffset, timedelta, or time rule string, optional Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes. axis : {0 or ‘index’, 1 or ‘columns’}
Returns:	shifted : DataFrame

periods : int

Number of periods to move, can be positive or negative

freq : DateOffset, timedelta, or time rule string, optional

Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.

axis : {0 or ‘index’, 1 or ‘columns’}

Returns:

shifted : DataFrame

Notes

If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.

pandas.DataFrame.fillna

DataFrame. fillna ( value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs ) [source]

Fill NA/NaN values using the specified method

Parameters:

Parameters:	value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap axis : {0 or ‘index’, 1 or ‘columns’} inplace : boolean, default False If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame). limit : int, default None If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None. downcast : dict, default is None a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns:	filled : DataFrame

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or ‘index’, 1 or ‘columns’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : DataFrame

Examples

    >>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                    columns=list('ABCD'))
>>> df
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

   

Replace all NaN elements with 0s.

    >>> df.fillna(0)
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

   

We can also propagate non-null values forward or backward.

    >>> df.fillna(method='ffill')
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

   

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

    >>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

   

Only replace the first NaN element.

    >>> df.fillna(value=values, limit=1)
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4
   

We can test this function with our loaded Shampoo Sales dataset and convert it into a supervised learning problem.

我们可以用我们加载的洗发水销售数据集来测试这个函数，并把它转化到一个监督学习问题

from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from pandas import concat
 
# frame a sequence as a supervised learning problem
def timeseries_to_supervised(data, lag=1):
	df = DataFrame(data)
	columns = [df.shift(i) for i in range(1, lag+1)]
	columns.append(df)
	df = concat(columns, axis=1)
	df.fillna(0, inplace=True)
	return df
 
# load dataset
def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
# transform to supervised learning
X = series.values
supervised = timeseries_to_supervised(X, 1)
print(supervised.head())

Running the example prints the first 5 rows of the new supervised learning problem.

运行该示例将打印新监督学习问题的前5行。

            0           0
0    0.000000  266.000000
1  266.000000  145.899994
2  145.899994  183.100006
3  183.100006  119.300003
4  119.300003  180.300003

For more information on transforming a time series problem into a supervised learning problem, see the post:

有关将时间序列问题转化为监督学习问题的更多信息，请参阅文章：

Time Series Forecasting as Supervised Learning

总结一下所谓的监督学习模型就是有目标值做参考的模型。至于是用前几个时间步的数据做输入，当前还是以后几个时间步的数据做目标值，可能会产生不少变种，如果用类来说，就是各种实例了，哈哈