ECE 498/598 Baggins Bilbo


ECE 498/598 Fall 2024, Homeworks 3 and 4
Remarks:
1. HW3&4: You can reduce the context length to 32 if you are having trouble with the
training time.
2. HW3&4: During test evaluation, note that positional encodings for unseen/long
context are not trained. You are supposed to evaluate it as is. It is OK if it doesn’t
work well.
3. HW3&4: Comments are an important component of the HW grade. You are expected
to explain the experimental findings. If you don’t provide technically meaningful
comments, you might receive a lower score even if your code and experiments are
accurate.
4. The deadline for HW3 is November 11th at 11:59 PM, and the deadline for HW4 is
November 18th at 11:59 PM. For each assignment, please submit both your code and a
PDF report that includes your results (figures) for each question. You can generate the
PDF report from a Jupyter Notebook (.ipynb file) by adding comments in markdown
cells.
1
The objective of this assignment is comparing transformer architecture and SSM-type
architectures (specifically Mamba [1]) on the associative recall problem. We provided an
example code recall.ipynb which provides an example implementation using 2 layer
transformer. You will adapt this code to incorporate different positional encodings, use
Mamba layers, or modify dataset generation.
Background: As you recall from the class, associative recall (AR) assesses two abilities
of the model: Ability to locate relevant information and retrieve the context around that
information. AR task can be understood via the following question: Given input prompt
X = [a 1 b 2 c 3 b], we wish the model to locate where the last token b occurs earlier
and output the associated value Y = 2. This is crucial for memory-related tasks or bigram
retrieval (e.g. ‘Baggins’ should follow ‘Bilbo’).
To proceed, let us formally define the associative recall task we will study in the HW.
Definition 1 (Associative Recall Problem) Let Q be the set of target queries with cardinal ity |Q| = k. Consider a discrete input sequence X of the form X = [. . . q v . . . q] where the
query q appears exactly twice in the sequence and the value v follows the first appearance
of q. We say the model f solves AR(k) if f(X) = v for all sequences X with q ∈ Q.
Induction head is a special case of the definition above where the query q is fixed (i.e. Q
is singleton). Induction head is visualized in Figure 1. On the other extreme, we can ask the
model to solve AR for all queries in the vocabulary.
Problem Setting
Vocabulary: Let [K] = {1, . . . , K} be the token vocabulary. Obtain the embedding of
the vocabulary by randomly generating a K × d matrix V with IID N(0, 1) entries, then
normalized its rows to unit length. Here d is the embedding dimension. The embedding of
the i-th token is V[i]. Use numpy.random.seed(0) to ensure reproducibility.
Experimental variables: Finally, for the AR task, Q will simply be the first M elements
of the vocabulary. During experiments, K, d, M are under our control. Besides this we will
also play with two other variables:
• Context length: We will train these models up to context length L. However, we
will evaluate with up to 3L. This is to test the generalization of the model to unseen
lengths.
• Delay: In the basic AR problem, the value v immediately follows q. Instead, we will
introduce a delay variable where v will appear τ tokens after q. τ = 1 is the standard.
Models: The motivation behind this HW is reproducing the results in the Mamba paper.
However, we will also go beyond their evaluations and identify weaknesses of both trans former and Mamba architectures. Specifically, we will consider the following models in our
evaluations:
2
Figure 1: We will work on the associative recall (AR) problem. AR problem requires the
model to retrieve the value associated with all queries whereas the induction head requires
the same for a specific query. Thus, the latter is an easier problem. The figure above is
directly taken from the Mamba paper [1]. The yellow-shaded regions highlight the focus of
this homework.
• Transformer: We will use the transformer architecture with 2 attention layers (no
MLP). We will try the following positional encodings: (i) learned PE (provided code),
(ii) Rotary PE (RoPE), (iii) NoPE (no positional encoding)
• Mamba: We will use the Mamba architecture with 2 layers.
• Hybrid Model: We will use an initial Mamba layer followed by an attention layer.
No positional encoding is used.
Hybrid architectures are inspired by the Mamba paper as well as [2] which observes the
benefit of starting the model with a Mamba layer. You should use public GitHub repos to
find implementations (e.g. RoPE encoding or Mamba layer). As a suggestion, you can use
this GitHub Repo for the Mamba model.
Generating training dataset: During training, you train with minibatch SGD (e.g. with
batch size 64) until satisfactory convergence. You can generate the training sequences for
AR as follows given (K, d, M, L, τ):
1. Training sequence length is equal to L.
2. Sample a query q ∈ Q and a value v ∈ [K] uniformly at random, independently. Recall
that size of Q is |Q| = M.
3. Place q at the end of the sequence and place another q at an index i chosen uniformly
at random from 1 to L − τ.
4. Place value token at the index i + τ.
3
5. Sample other tokens IID from [K]−q i.e. other tokens are drawn uniformly at random
but are not equal to q.
6. Set label token Y = v.
Test evaluation: Test dataset is same as above. However, we will evaluate on all sequence
lengths from τ + 1 to 3L. Note that τ + 2 is the shortest possible sequence.
Empirical Evidence from Mamba Paper: Table 2 of [1] demonstrates that Mamba can do
a good job on the induction head problem i.e. AR with single query. Additionally, Mamba
is the only model that exhibits length generalization, that is, even if you train it pu to context
length L, it can still solve AR for context length beyond L. On the other hand, since Mamba
is inherently a recurrent model, it may not solve the AR problem in its full generality. This
motivates the question: What are the tradeoffs between Mamba and transformer, and can
hybrid models help improve performance over both?
Your assignments are as follows. For each problem, make sure to return the associated
code. These codes can be separate cells (clearly commented) on a single Jupyter/Python file.
Grading structure:
• Problem 1 will count as your HW3 grade. This only involves Induction Head
experiments (i.e. M = 1).
• Problems 2 and 3 will count as your HW4 grade.
• You will make a single submission.
Problem 1 (50=25+15+10pts). Set K = 16, d = 8, L = 32 or L = 64.
• Train all models on the induction heads problem (M = 1, τ = 1). After training,
evaluate the test performance and plot the accuracy of all models as a function of
the context length (similar to Table 2 of [1]). In total, you will be plotting 5 curves
(3 Transformers, 1 Mamba, 1 Hybrid). Comment on the findings and compare the
performance of the models including length generalization ability.
• Repeat the experiment above with delay τ = 5. Comment on the impact of delay.
• Which models converge faster during training? Provide a plot of the convergence rate
where the x-axis is the number of iterations and the y-axis is the AR accuracy over a
test batch. Make sure to specify the batch size you are using (ideally use 32 or 64).
Problem 2 (30pts). Set K = 16, d = 8, L = 32 or L = 64. We will train Mamba, Transformer
with RoPE, and Hybrid. Set τ = 1 (standard AR).
• Train Mamba models for M = 4, 8, 16. Note that M = 16 is the full AR (retrieve any
query). Comment on the results.
• Train Transformer models for M = 4, 8, 16. Comment on the results and compare
them against Mamba’s behavior.
4
• Train the Hybrid model for M = 4, 8, 16. Comment and compare.
Problem 3 (20=15+5pts). Set K = 16, d = 64, L = 32 or L = 64. We will only train
Mamba models.
• Set τ = 1 (standard AR). Train Mamba models for M = 4, 8, 16. Compare against the
corresponding results of Problem 2. How does embedding d impact results?
• Train a Mamba model for M = 16 for τ = 10. Comment if any difference.
 

潮汐研究作为海洋科学的关键分支,融合了物理海洋学、地理信息系统及水利工程等多领域知识。TMD2.05.zip是一套基于MATLAB环境开发的潮汐专用分析工具集,为科研人员与工程实践者提供系统化的潮汐建模与计算支持。该工具箱通过模块化设计实现了两大核心功能: 在交互界面设计方面,工具箱构建了图形化操作环境,有效降低了非专业用户的操作门槛。通过预设参数输入模块(涵盖地理坐标、时间序列、测站数据等),用户可自主配置模型运行条件。界面集成数据加载、参数调整、可视化呈现及流程控制等标准化组件,将复杂的数值运算过程转化为可交互的操作流程。 在潮汐预测模块中,工具箱整合了谐波分解法与潮流要素解析法等数学模型。这些算法能够解构潮汐观测数据,识别关键影响要素(包括K1、O1、M2等核心分潮),并生成不同时间尺度的潮汐预报。基于这些模型,研究者可精准推算特定海域的潮位变化周期与振幅特征,为海洋工程建设、港湾规划设计及海洋生态研究提供定量依据。 该工具集在实践中的应用方向包括: - **潮汐动力解析**:通过多站点观测数据比对,揭示区域主导潮汐成分的时空分布规律 - **数值模型构建**:基于历史观测序列建立潮汐动力学模型,实现潮汐现象的数字化重构与预测 - **工程影响量化**:在海岸开发项目中评估人工构筑物对自然潮汐节律的扰动效应 - **极端事件模拟**:建立风暴潮与天文潮耦合模型,提升海洋灾害预警的时空精度 工具箱以"TMD"为主程序包,内含完整的函数库与示例脚本。用户部署后可通过MATLAB平台调用相关模块,参照技术文档完成全流程操作。这套工具集将专业计算能力与人性化操作界面有机结合,形成了从数据输入到成果输出的完整研究链条,显著提升了潮汐研究的工程适用性与科研效率。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值