本文是根据WildML的Recurrent Neural Networks Tutorial写的学习笔记。

原文的例子

原文中计划实现一个循环神经网络,用于发现自然语言句子中单词出现的模式,最终可以生成一些合理的句子。

  • 数据来源
    原文中,从网上下载了很多条句子(英文的)。

  • 数据的前期处理
    首先,统计了所有单词(包括标点符号)。
    取出最常见的7997单词,并且编号,每个单词有一个token。
    设置了3个特殊的token:
    UNKNOWN_TOKEN:匹配没有在8000列表中的单词。
    SENTENCE_START: 表示句子开始。
    SENTENCE_END: 表示句子结束。

  • 输入和输出
    输入x的维度是8000,意味着可以接受的句子长度最大是8000。
    输出y的维度是8000,和x一一对应。
    下面是一个句子构造后的实际例子:

x:
SENTENCE_START what are n't you understanding about this ? !
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]
y:
what are n't you understanding about this ? ! SENTENCE_END
[51, 27, 16, 10, 856, 53, 25, 34, 69, 1]

理解:y的每n位是x前n位的期望输出。

每个输入\(X_t\)(尽管有8000维),只有一个维度有值且为1,代表第\(t\)的单词的token的维度。
比如:what的token是51。那么\(X_t\)只有第51位为1,其它都是0。
这个叫做one-hot vector。
输出:每个token的可能性。

state的维度是100。

  • 预测公式和维度
\[s_t = tanh(x_tU + s_{t_1}W) \\
o_t = softmax(s_tV) \\
where \\
x_t.dimension = 8000 \\
o_t.dimension = 8000 \\
s_t.dimension = 100 \\
U.dimension = 100 * 8000 : x_tU \text{ is a 100 dimension vector} \\
W.dimension = 100 * 100 : s_{t-1}W \text{ is a 100 dimension vector} \\
V.dimension = 8000 * 100 : s_tV \text{ is a 8000 dimension vector}
\]

  • 初始化U,V,W
    初始化很重要。跟激活函数(这里是tanh)有关。
    U,V,W每个元素是一个位于区间\(\left [ -\sqrt{n}, \sqrt{n} \right ]\)的随机数。\(n\)是输入数的长度。

循环神经网络训练流程

Recurrent Neural NetWork - Training Process Recurrent Neural Network - Training ProcessP Prepare DataI Initialize Model {U, V, W}P->I FP Forward PropagationI->FP xL Calculate LossFP->L y'BPTT Back Propagation Trough TimeL->BPTT L(cross-entropy loss)GD Gradient DescentBPTT->GD {ΔL/ΔU, ΔL/ΔV, ΔL/ΔW}GD->FP iterate{U, V, W}UVW Result: {U, V, W}GD->UVW

反向传播(Back Propagation Through Time(BPTT))

训练的过程:

  1. 正向传播 - 根据设计的预测算法和初始\(V,U,W\),得到计算结果\(\hat{y}\)
  2. 计算损失 - 用计算结果\(\hat{y}\)和期望结果\(y\),根据交叉熵方法(cross entropy loss) 可得到损失\(L\)
  3. 反向传播 - 根据\(E\)和其它的已知值,计算出偏微分\({\partial{L} \over \partial{U}}, {\partial{L} \over \partial{V}}, {\partial{L} \over \partial{W}}\)
  4. 梯度下降 - 根据偏微分结果,通过随机梯度下降算法(Stochastic Gradient Descent),可以学习到新的\(V,U,W\)

有上面可见,反向传播的算法是训练的关键。(因为其它步骤的计算方法都是已知的。)
反向传播的算法的目的是:计算预测算法权值的偏微分

激活函数的微分

关于激活函数和损失函数微分的证明请看:
神经网络学习笔记 - 激活函数的作用、定义和微分证明
神经网络学习笔记 - 损失函数的定义和微分证明

  • sigmoid

sigmoid函数和其微分

\[\sigma(x) = \frac{1}{1 + e^{-x}} \\
\sigma'(x) = (1 - \sigma(x))\sigma(x)
\]

  • tanh

tanh函数和其微分

\[\tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1} \\
tanh'(x) = 1 - tanh(x)^2
\]

  • softmax
    激活函数softmax和损失函数会一起使用。
    激活函数会根据输入的参数(一个矢量,表示每个分类的可能性),计算每个分类的概率(0, 1)。
    损失函数根据softmax的计算结果\(\hat{y}\)和期望结果\(y\),根据交叉熵方法(cross entropy loss) 可得到损失\(L\)

softmax函数和其微分

\[\text{softmax:} \\
\hat{y_{t_i}} = softmax(o_{t_i}) = \frac{e^{o_{t_i}}}{\sum_{k}e^{o_{t_k}}} \\
\hat{y_t} = softmax(z_t) = \begin{bmatrix}
\cdots &
\frac{e^{o_{t_i}}}{\sum_{k}e^{o_{t_k}}} &
\cdots
\end{bmatrix} \\
\\
softmax'(z_t) = \frac{\partial{y_t}}{\partial{z_t}} =
\begin{cases}
\hat{y_{t_i}}(1 - \hat{y_{t_i}}), & \text{if } i = j \\
-\hat{y_{t_i}} \hat{y_{t_j}}, & \text{if } i \ne j
\end{cases}
\]

  • Loss function (cross entropy loss)

cross entropy loss函数

\[L_t(y_t, \hat{y_t}) = - y_t \log \hat{y_t} \\
L(y, \hat{y}) = - \sum_{t} y_t \log \hat{y_t} \\
\frac{ \partial L_t } { \partial z_t } = \hat{y_t} - y_t \\
\text{where} \\
z_t = s_tV \\
\hat{y_t} = softmax(z_t) \\
y_t \text{ : for training data x, the expected result y at time t. which are from training data}
\]

训练数据过程中的公式

预测公式

预测公式和前面是一样的。为了方便反向传播的计算。我们写成这样:

\[s_t = tanh(x_tU + s_{t_1}W) \\
z_t = s_tV \\
\hat{y_t} = softmax(z_t) \\
where \\
s_{-1} = [0 \cdots 0]
\]

损失函数

\[L_t(y_t, \hat{y_t}) = - y_t \log \hat{y_t} \\
L(y, \hat{y}) = - \sum_{t} y_t \log \hat{y_t} \\
\text{where} \\
y_t \text{ : for training data x, the expected result y at time t. which are from training data}
\]

随机梯度下降函数(Stochastic Gradient Descent)

\[W_{new} = W - s * dW \\
where \\
s \text{ : step size, learning rate, a value between } (0, 1) \\
dW = \frac{\partial L}{\partial W} \text{ : W's descent, loss differentiation at W.} \\
\]

注:\(U,V,W\)的随机梯度下降是一样的。

关于learning rate, 有时会根据损失的变化情况,而改变。比如:如果损失变大了,说明上次的learning rate有点过了,因此,可将learning rate变成以前的十分之一。

计算V的偏微分

现在就只剩下求\(U,V,W\)的偏微分了。

计算公式

\[\frac{\partial L_t}{\partial V} = (\hat{y_t} - y_t) \otimes s_t
\]

证明

\[\begin{align}
\frac{\partial L_t}{\partial V}
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial V} \\
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial V} \\
& = \frac{\partial L_t}{\partial z_t} \frac{\partial z_t}{\partial V} \\
& \because \frac{\partial L_t}{\partial z_t} = (\hat{y_t} - y_t) \text{ : see cross entropy loss differential.} \\
& \because \frac{\partial z_t}{\partial V} = \frac{\partial (s_tV)}{\partial V} = s_t \\
& = (\hat{y_t} - y_t) \otimes s_t
\end{align}
\]

计算W的偏微分

计算公式

\[\frac{\partial L_t}{\partial W}
= (\hat{y} - y) V (1 - s_t^2) \left ( s_{t-1} + W \frac{\partial (s_{t-1})}{\partial W} \right ) \\
\frac{\partial s_t}{\partial W}
= (1 - s_t^2) \left ( s_{t-1} + W \frac{\partial (s_{t-1})}{\partial W} \right )
\]

证明
在计算\(L_t\)\(W\)的偏微分前,我们需要先做一些辅助计算。

\[\begin{align}
\frac{\partial s_t}{\partial W}
& = \frac{\partial (tanh(x_tU + s_{t-1}W))}{\partial W} \\
& \because \text{tanh differentiation formula and the chain rule of differentiation} \\
& = (1 - s_t^2) \frac{\partial (x_tU + s_{t-1}W)}{\partial W} \\
& \because \text{sum rule of differentiation} \\
& = (1 - s_t^2) \frac{\partial (s_{t-1}W)}{\partial W} \\
& \because \text{product rule of differentiation} \\
& = (1 - s_t^2) \left ( \frac{\partial (s_{t-1})}{\partial W}W + s_{t-1}\frac{\partial W}{\partial W} \right ) \\
& = (1 - s_t^2) \left ( s_{t-1} + W \frac{\partial (s_{t-1})}{\partial W} \right ) \\
\end{align} \\
\because s_{t-1} \text{ is a function of W. we need to calculate the chain with the product rule of differentiation.}
\]

\[\begin{align}
\frac{\partial z_t}{\partial s_t}
& = \frac{\partial (s_tV )}{\partial s_t} \\
& = V
\end{align}
\]

\[\begin{align}
\frac{\partial L_t}{\partial W}
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial W} \\
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial W} \\
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial W} \\
& = \frac{\partial L_t}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial W} \\
& = (\hat{y} - y) V \frac{\partial s_t}{\partial W} \\
& = (\hat{y} - y) V \prod_{k=0}^{t} ((1 - s_k^2) W) \\
\end{align}
\]

计算U的偏微分

计算公式

\[\frac{\partial L_t}{\partial U}
= (\hat{y} - y) V (1 - s_t^2) \left( x_t + W \frac{\partial s_{t-1}}{\partial U} \right ) \\
\frac{\partial s_t}{\partial U}
= (1 - s_t^2) (x_t + W \frac{\partial s_{t-1}}{\partial U})
\]

证明

\[\begin{align}
\frac{\partial s_t}{\partial U}
& = \frac{\partial (tanh(x_tU + s_{t-1}W))}{\partial U} \\
& = (1 - s_t^2) (x_t + \frac{\partial (s_{t-1}W)}{\partial U}) \\
& = (1 - s_t^2) (x_t + W \frac{\partial s_{t-1}}{\partial U}) \\
\end{align} \\
\because s_{t-1} \text{ is a function of U. we need to calculate the chain.}
\]

\[\begin{align}
\frac{\partial L_t}{\partial U}
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial U} \\
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial U} \\
& = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial U} \\
& = \frac{\partial L_t}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial U} \\
& = (\hat{y} - y) V \frac{\partial s_t}{\partial U} \\
\end{align}
\]

梯度消失问题(Vanishing Gradients Problem)

突然有种万事到头一场空的感觉。
RNN有一个Vanishing Gradients Problem。我没有仔细研究这个问题。主要原因是激活函数tanh的使用,导致梯度消失(\((1 - s_t^2) = 0\)),无法计算偏分。
这个问题可以用激活函数ReLU来解决。
LSTM和GRU的出现,提供了一个新的解决方案。

下一篇

神经网络学习笔记-04-循环神经网络算法解释

参照