[ICLR 2024]Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

最新推荐文章于 2025-04-18 11:30:52 发布

夏莉莉iy

最新推荐文章于 2025-04-18 11:30:52 发布

阅读量773

点赞数 17

分类专栏：论文精读文章标签：语言模型人工智能自然语言处理机器学习深度学习神经网络 python

本文链接：https://blog.youkuaiyun.com/Sherlily/article/details/145764113

版权

论文精读专栏收录该内容

172 篇文章

订阅专栏

论文网址：[2310.01728] Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

论文代码：GitHub - KimMeen/Time-LLM: [ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.1. Model Structure

2.5. Main Results

2.5.1. Long-term Forecasting

2.5.2. Short-term Forecasting

2.5.3. Few-shot Forecasting

2.5.4. Zero-shot Forecasting

2.5.5. Model Analysis

2.6. Conclusion and Future Work

1. 心得

（1）一觉醒来，已经爆肝了两篇论文了。我的年好像还没有过，起来读读论文好了

（2）怎么ICLR也收酱紫深度的吗，不是数学大赏了？赏不动一点

（3）有些会议的章节名字是真的可爱

其实可以看出来是大号大写+大写大大写+大写，虽然在latex肯定输入的\section{Main Results}，这个样式让我当时自己写的时候，单词都给强制大写得来认不出来了

（4）经历了贪婪饕餮嫉妒懒惰傲慢暴怒之后的我已经心如止水 fine

2. 论文逐段精读

2.1. Abstract

①Large language model (LLM) is able to deal multi tasks

②Challenge of LLM: how to combine time series and natural language

2.2. Introduction

①Advantages of pre-trained model: excellent performance on few-shot and zero-shot transfer learning

②⭐Other models are generally based on statistics, while LLM has reasoning and pattern recognition capabilities

③Challenge: discrete LLM token and continuous time series data are hard to align, and reasoning ability in time series is not exist

desiderata n. 需求；迫切需要得到之物；迫切需要的东西；必需品；迫切需要之物

2.3. Related Work

①Task-specific learning: design a small model for specific task like (a)

②In-modality adaptation: limited data scale, as (b)

③Cross-modality adaptation: the alignment and combination of two modalities can better suit more tasks

2.4. Methodology

①Model architecture:

②Given a sequence of historical observation $\mathbf{X}\in\mathbb{R}^{N\times T}$ , where $N$ is the number of sequence and $T$ is time steps

③Task: finding a mapping $f(\cdot )$ to forecast the future $H$ time steps $\hat{\mathbf{Y}}\in\mathbb{R}^{N\times {H}}$

④Aiming: minimize the regression loss i.e. $\frac{1}{H}\sum_{h=1}^{{H}}\|\hat{\mathbf{Y}}_{h}-\mathbf{Y}_{h}\|_{F}^{2}$

⑤Process: employ normalization, patching, and embedding to each sequence $\mathbf{X}^{(i)}\in\mathbb{R}^{1\times T}$ to predict $\mathbf{\hat{Y}}^{(i)}\in\mathbb{R}^{1\times H}$

2.4.1. Model Structure

（1）Input Embedding

①Mitigating the time series distribution shift by individually normalized $\mathbf{X}(i)$ to have zero mean and unit standard deviation via reversible instance normalization (RevIN)

②Divide $\mathbf{X}(i)$ into several consecutive overlapped or non-overlapped patches with length $L_p$ （滑窗捏~~）. So the patch number and patches will be $P=\lfloor\frac{(T-L_{p})}{S}\rfloor+2$ and $\mathbf{X}_{P}^{(i)}\in\mathbb{R}^{P\times L_{p}}$ , where $S$ is the stride

③Embedding patches $\mathbf{X}_{P}^{(i)}\in\mathbb{R}^{P\times L_{p}} \rightarrow \mathbf{\hat{X}}_{P}^{(i)}\in\mathbb{R}^{P\times d_{m}}$ by linear layer

（2）Patch Reprogramming

①A common method for LLM learning is to learn a "noise"

“噪声”形式是什么？

这里的“噪声”并不是指随机干扰，而是一种可学习的输入变换，用于将目标数据（如时间序列）转换为预训练模型（如LLM或视觉模型）能够理解的格式。这种变换允许模型在不更新参数的情况下处理新模态的数据。

举例说明：

视觉模型的跨域应用（Misra et al, 2023）：
假设有一个预训练的视觉模型，原本用于处理自然图像（如猫狗分类）。现在想用它处理医学图像（如X光片）。由于医学图像与自然图像的分布不同，直接输入会导致模型性能下降。
解决方法是通过学习一种“噪声”变换，将医学图像转换为类似自然图像的格式。例如，可以通过添加特定的滤波器或颜色映射，使X光片看起来更像自然图像。这种变换是可学习的，且不需要更新预训练模型的参数。

声学模型处理时间序列数据（Yang et al, 2021）：
假设有一个预训练的声学模型，原本用于处理语音信号。现在想用它处理时间序列数据（如股票价格）。
由于语音信号和时间序列数据的结构不同，直接输入会导致模型无法理解。
解决方法是通过学习一种“噪声”变换，将时间序列数据转换为类似语音信号的格式。例如，可以通过重采样或频谱变换，使时间序列数据在频域上类似于语音信号。这种变换也是可学习的。

在这些例子中，“噪声”是一种输入层面的适配器，用于弥合源数据（预训练模型的输入）和目标数据（新任务的数据）之间的差距。

②However, time series are not suitable for transformation

时间序列不可编辑是什么意思？

时间序列数据的“不可编辑”指的是其固有的连续性和复杂性，使得无法通过简单的变换将其无损地转换为其他模态（如自然语言或图像）。具体来说：

时间序列的连续性：
时间序列数据是按时间顺序记录的，每个时间点的值都依赖于之前的值。这种连续性使得对时间序列的编辑（如插入、删除或修改）可能会破坏其整体结构和语义。

时间序列的复杂性：
时间序列数据通常包含多种模式（如趋势、周期、噪声），这些模式之间的关系复杂，难以用简单的规则或语言描述。

举例说明：

假设有一段股票价格的时间序列数据：

如果直接将其转换为自然语言描述（如“股票价格先上涨，后下跌”），会丢失大量细节信息（如具体的波动幅度、频率等）。

如果尝试将其转换为图像（如折线图），虽然可以保留部分信息，但图像无法完全捕捉时间序列的动态特性（如时间依赖关系）。

因此，时间序列数据的“不可编辑”意味着：

无法通过简单的变换将其无损地转换为其他模态。

无法通过直接编辑输入数据来适配预训练模型，而需要更复杂的适配方法（如学习“噪声”变换）。

③Thus, they proposed a word embedding method, including $\mathbf{E}\in\mathbb{R}^{V\times D}$ , where $V$ is the vocabulary size. A large number of vocabulary $\mathbf{E}$ results in wastage of computational resources, so they maps $\mathbf{E}$ to a smaller set $\mathbf{E}^{\prime}\in\mathbb{R}^{V^{\prime}\times D}$ by linear probing:

where $V^{\prime}\ll V$

④Then they get target by combining prototypes through multi-head cross-attention layer:

（3）Prompt-as-Prefix

①Language has low sensitivity to numerical sequences and each language model stores numerical values differently in Patch-as-Prefix:

②But Prompt as Prefix overcomes these limitations by adding (1) dataset context, (2) task instruction, and (3) input statistics:

（4）Output Projection

③Flatten and project representations to get prediction $\hat{\mathbf{Y}}(i)$

2.5. Main Results

①Fair configuration: Llama-7B as default backbone

②Baselines: Transformer based: PatchTST, ESTformer, Non-Stationary Transformer, FEDformer, Autoformer, Informer, Reformer; competitive models: GPT4TS, LLMTime, DLinear, TimesNet, LightTS and short term N-HiTS, N-BEATS