从0开始，搭建你自己的循环神经网络

本文链接：https://blog.youkuaiyun.com/Qiang_san/article/details/146979088

前言
构造函数
def _init_sin()
def _dataloader()
def train()
def forward()
def test()

前言

在深度学习笔记 | 漫游RNN（循环神经网络）中，初步分析了自己如何实现一个RNN类，RNN类中需要具备哪些方法。接下来，将看看具体的代码，讲解部分代码的实现逻辑。

构造函数

class my_RNN(nn.Module):
    def __init__(self, input_size=1, hidden_size=1, output_size=1, data_size=110, loss=nn.MSELoss()):
        super(my_RNN, self).__init__()
        self.hidden_size = hidden_size

        # Initialize parameters
        self.W_xh = nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.W_hh = nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.W_hy = nn.Parameter(torch.randn(hidden_size, output_size) * 0.1)
        self.b_h = nn.Parameter(torch.zeros(hidden_size))
        self.b_y = nn.Parameter(torch.zeros(output_size))
        
        for para in self.parameters():
            para.requires_grad_(True)
        
        self.loss_fn = loss
        self.optimizer = optim.SGD(self.parameters(), lr=0.01)
        self.sin=self._init_sin(data_size + (int)(data_size * 0.1))
        self.data_train = self.sin[:data_size]
        self.data_test = self.sin[data_size:]

在构造函数中，我们定义了RNN的基本参数：隐藏层权重（ $W_{xh}$ ）、隐藏层到隐藏层权重（ $W_{hh}$ ）、隐藏层偏置（ $b_h$ ）以及隐藏层到输出层权重（ $W_{hy}$ ）、输出偏置（ $b_y$ ）。
Parameter()类是tensor类的一个子类，因此具有tensor类的所有特性，此外它们在Module中被parameter()直接调用。这样可以更方便地访问模型中的参数。

Parameters are Tensor subclasses, that have a very special property when used with Module s - when they’re assigned as Module attributes they are automatically added to the list of its parameters, and will appear e.g. in parameters() iterator. Assigning a Tensor doesn’t have such effect. This is because one might want to cache some temporary state, like last hidden state of the RNN, in the model. If there was no such class as Parameter, these temporaries would get registered too.

同时，在类中通过调用self._init_sin，生成指定数量的训练数据。

def _init_sin()

def _init_sin(self, n, freq=1.0, amplitude=1.0, noise_std=0.01):
        '''
        Generate sine wave data with optional noise
        '''
        x = np.linspace(0, 2 * np.pi * freq, n)
        data = amplitude * np.sin(x) + np.random.normal(0, noise_std, n)
        return torch.tensor(data, dtype=torch.float32)

首先在一个周期 $x\in [0, 2\pi\cdot f]$ 上生成所需数量的采样点，然后获得对应数值并加入白噪声。

TIPS：原本是想用斐波那契数列训练模型，让模型学习其二阶马尔科夫过程。但是斐波那契数列的增长速度太快，容易导致梯度爆炸，如果进行归一化后又会导致梯度消失问题，所以最后改为采样自正弦波的数据集。这样也可以对比《动手学习深度学习》中，用MLP进行多步预测的效果。

def _dataloader()

该方法对构造时生成的数据进行预处理：

采用随机抽样的方法，生成训练时的序列
根据需求，对数据切片，适配模型输入时的维度要求

def _dataloader(self, mode="train", batch_size=1, seq_length=5):
	'''
	generate train data in shape [seq_length, batch_size]
	'''
	# random sampling
	offset = random.randint(0, seq_length)
	if mode == "train":
		data = self.data_train

	num_tokens = ((len(data) - offset - 1) // batch_size) * batch_size
	Xs = torch.tensor(data[offset: offset + num_tokens])
	Ys = torch.tensor(data[offset + 1: offset + num_tokens + 1])
	
	# reshape to 2 dim array
	Xs = Xs.reshape(batch_size, -1)
	Ys = Ys.reshape(batch_size, -1)

	num_batches = Xs.shape[1] // seq_length
	for i in range(0, seq_length * num_batches, seq_length):
		x_seq = Xs[:, i: i + seq_length]
		y_seq = Ys[:, i: i + seq_length]
		x_seq = x_seq.T
		y_seq = y_seq.T
		yield x_seq, y_seq

num_tokens = ((len(data) - offset - 1) // batch_size) * batch_size 这一句理解起来稍微有些难度，请看图解：

len(data) - offset - 1：让序列开始位置偏移offset，‘-1’确保输入值总有对应的输出值，保留序列的最后一个元素，即当16输入时的有期望输出为17【图中菱形框标注】

Dataloader_01

第一次切片，保留num_tokens个元素，确保能维度变换为一个矩阵
维度变换，使0维度为batch_size
第二次切片，使迭代器每次输出的维度均为[seq_length, batch_size]

def train()

def train(self, epochs=3, lr=1e-1, batch_size=1, seq_length=5):
	'''
	Train the RNN model
	'''
	optimizer = self.optimizer
	loss_list = []
	for epoch in range(epochs):
		if epoch == 0:
			h = self._init_state(batch_size)
		total_loss = 0
		
		# X.shape = [seq_length, batch_size]
		# Y.shape = [seq_length, batch_size]
		for X, Y in self._dataloader(mode="train", batch_size=batch_size, seq_length=seq_length):
			loss = 0
			for t in range(seq_length):
				x_t = X[t].unsqueeze(1)
				y_t = Y[t].unsqueeze(1)
				# 使用 h_next 避免 inplace 修改 h
				y_hat, h_next = self._forward(x_t, h)  
				loss += self.loss_fn(y_hat, y_t)
				h = h_next.detach()  # 分离计算图，避免梯度累积问题

			optimizer.zero_grad()
			loss.backward(retain_graph=True)  # 保留计算图以避免报错
			torch.nn.utils.clip_grad_norm_(self.parameters(), max_norm=1.0)
			optimizer.step()
			total_loss += loss.item()
			
		loss_list.append(total_loss / seq_length)
		print(f'Epoch {epoch + 1}/{epochs}, Loss: {total_loss / seq_length}')

	plt.plot( range(1, epochs + 1, 1), loss_list, label='Loss')
	plt.xlabel('Epochs')  
	plt.ylabel('Loss')    
	plt.title('Training Loss Over Epochs')
	plt.legend()
	plt.show()

	return h.detach()

由于采用顺序抽样的方式，只在每个训练周期开始时重置状态。在序列中将不断传递，以构建长期以来关系。
由于采用沿时间的反向传播，所以在backward()中，需要使retain_graph选项设置为True。
torch.nn.utils.clip_grad_norm_() 为Pytorch提供的梯度截断方法, 对大多数的RNN，经验值为1.0。

torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2.0, error_if_nonfinite=False, foreach=None)

Clip the gradient norm of an iterable of parameters.
The norm is computed over the norms of the individual gradients of all parameters, as if the norms of the individual gradients were concatenated into a single vector. Gradients are modified in-place.

从Loss上看，模型的训练效果还是不错的：
Loss

def forward()

def _forward(self, x, h):
	h_next = torch.tanh(torch.matmul(x, self.W_xh) + torch.matmul(h, self.W_hh) + self.b_h)
	y_hat = torch.matmul(h, self.W_hy) + self.b_y
	return y_hat, h_next

按照公式 $h_t = f_W\bigl(W_{xh} \, x_t + W_{hh} \, h_{t-1} + b_h\bigr)$ 与 $\hat{y} = g\bigl(W_{hy} h_t + b_y)$ 一比一翻译。唯一需要注意，输入x，与权重W_wh的先后顺序取决于构造函数中定义的权重W_wh维度有关，不能颠倒。

def test()

def test(self,h_last, seq_length=5):
        '''
        Test the RNN model
        Predict next seq_length numbers
        '''
        with torch.no_grad():
            h = h_last[0].unsqueeze(0)
            Xs = self.data_test[:seq_length].unsqueeze(1)
            Ys = self.data_test[1:seq_length + 1].unsqueeze(1)
            for t in range(seq_length):
                if t == 0:
                    x_t = Xs[t]
                    y_hat, h = self._forward(x_t, h)
                else:
                    y_hat, h = self._forward(y_hat, h)
                print(f'Target y_t: {Ys[t]} Prediction y_hat: {y_hat}')

该方法将使用训练最后一个周期返回的隐状态h_t，继续向后预测一个序列长度。当然，由于在类中存储了测试数据，也可以扩展该方法，以测试模型的鲁棒性。在测试的时候需要分离梯度，防止在计算的时候反向传播。
最终的预测效果如下，预测均值比较接近实际值，但是方差比较大。动手实践的时候，可以自己改变参数，看看是否能有更好的结果。

Target y_t: tensor([-0.5215]) Prediction y_hat: tensor([[0.8490]])
Target y_t: tensor([-0.5417]) Prediction y_hat: tensor([[0.2312]])
Target y_t: tensor([-0.5423]) Prediction y_hat: tensor([[0.5907]])
Target y_t: tensor([-0.5397]) Prediction y_hat: tensor([[0.4353]])
Target y_t: tensor([-0.5430]) Prediction y_hat: tensor([[0.6064]])

感谢你的阅读！