LSTM Data Preparation/ LSTM 数据准备
Before we can fit an LSTM model to the dataset, we must transform the data.
在我们将一个LSTM模型拟合到数据集之前,我们必须转换数据
This section is broken down into three steps:
本节分为以下三步:
- Transform the time series into a supervised learning problem./转换时间序列为监督学习问题
- Transform the time series data so that it is stationary./转换时间序列数据,使其稳定
- Transform the observations to have a specific scale./转换观测值到一个指定的缩放范围
Transform Time Series to Supervised Learning/转换时间序列为监督学习
The LSTM model in Keras assumes that your data is divided into input (X) and output (y) components.
Keras中的LSTM模型假定你数据被分为输入(X)和输出(y)两部分(什么输出?叫验证值好吗?叫目标值好吗?输出是你的预测值好吗?不知道有多少人要被带偏了,看来国内国外目前人工智能的学习现状都一样的处在混乱状态,对初学者来说真的是坑太多。后面的翻译只要不是预测值,都应将作者所述output视为验证值,或目标值。)
For a time series problem, we can achieve this by using the observation from the last time step (t-1) as the input and the observation at the current time step (t) as the output.
对于时间序列问题,我们实现这个目的通过把上一个时间步(t-1)的观测值作为输入,以当前时间步(t-1)的观测值作为目标值
We can achieve this using the shift() function in Pandas that will push all values in a series down by a specified number places. We require a shift of 1 place, which will become the input variables. The time series as it stands will be the output variables.
我们可以使用Pandas中的shift()函数来实现这个功能,它可以将一系列中的所有值按指定的位置向下推。 我们需要移动1个位置,这将成为输入变量。 站着的(也就是没有位移的)时间序列就是目标变量。
We can then concatenate these two series together to create a DataFrame ready for supervised learning. The pushed-down series will have a new position at the top with no value. A NaN (not a number) value will be used in this position. We will replace these NaN values with 0 values, which the LSTM model will have to learn as “the start of the series” or “I have no data here,” as a month with zero sales on this dataset has not been observed.
然后,我们可以将这两个序列连接起来,创建一个DataFrame,以供监督学习。 推下来的序列将在顶部有一个新的位置,没有任何价值。 NaN(非数字)值将用于此位置。 我们将用0值替代这些NaN值,LSTM模型将不得不学习“该系列的开始”或“我在这里没有数据”,因为没有观察到该数据集上销售额为零的月份(注意这里又有坑,究竟是谁没有观测到数据集上销售为0的月份,是模型还是人?我理解是人,而不是模型,就是说我们看这个销售数据集没有销售为0的月份,所以才可以将NaN值用0替换,而模型则必须学到0值为序列的开始,或是这里没有数据)。
The code below defines a helper function to do this called timeseries_to_supervised(). It takes a NumPy array of the raw time series data and a lag or number of shifted series to create and use as inputs.
下面的代码定义了一个名为timeseries_to_supervised()的辅助函数。 它需要原始时间序列数据的NumPy数组以及移位序列的滞后量或是数量(作为参数)来创建并用作输入。
# frame a sequence as a supervised learning problem
def timeseries_to_supervised(data, lag=1):
df = DataFrame(data)
columns = [df.shift(i) for i in range(1, lag+1)]
columns.append(df)
df = concat(columns, axis=1)
df.fillna(0, inplace=True)
return df
pandas.DataFrame
-
class
-
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
Parameters: data : numpy ndarray (structured or homogeneous), dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects
Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
pandas.
DataFrame
(
data=None,
index=None,
columns=None,
dtype=None,
copy=False
)
[source]
examples for DataFrame()
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes
col1 int64
col2 int64
dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1 int8
col2 int8
dtype: object
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
... columns=['a', 'b', 'c', 'd', 'e'])
>>> df2
a b c d e
0 2 8 8 3 4
1 4 2 9 0 9
2 1 0 7 8 0
3 5 1 7 1 3
4 6 0 2 4 2
pandas.DataFrame.shift
-
Shift index by desired number of periods with an optional time freq
Parameters: periods : int
Number of periods to move, can be positive or negative
freq : DateOffset, timedelta, or time rule string, optional
Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.
-
axis
:
{0 or ‘index’, 1 or ‘columns’}
Returns: -
shifted
:
DataFrame
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
DataFrame.
shift
(
periods=1,
freq=None,
axis=0
)
[source]
pandas.DataFrame.fillna
-
Fill NA/NaN values using the specified method
Parameters: value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
-
axis
:
{0 or ‘index’, 1 or ‘columns’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns: -
filled
:
DataFrame
Examples>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list('ABCD')) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill') A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
DataFrame.
fillna
(
value=None,
method=None,
axis=None,
inplace=False,
limit=None,
downcast=None,
**kwargs
)
[source]
We can test this function with our loaded Shampoo Sales dataset and convert it into a supervised learning problem.
我们可以用我们加载的洗发水销售数据集来测试这个函数,并把它转化到一个监督学习问题
from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from pandas import concat
# frame a sequence as a supervised learning problem
def timeseries_to_supervised(data, lag=1):
df = DataFrame(data)
columns = [df.shift(i) for i in range(1, lag+1)]
columns.append(df)
df = concat(columns, axis=1)
df.fillna(0, inplace=True)
return df
# load dataset
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
# transform to supervised learning
X = series.values
supervised = timeseries_to_supervised(X, 1)
print(supervised.head())
Running the example prints the first 5 rows of the new supervised learning problem.
运行该示例将打印新监督学习问题的前5行。
0 0
0 0.000000 266.000000
1 266.000000 145.899994
2 145.899994 183.100006
3 183.100006 119.300003
4 119.300003 180.300003
For more information on transforming a time series problem into a supervised learning problem, see the post:
有关将时间序列问题转化为监督学习问题的更多信息,请参阅文章: