《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

2023年4月11日上午2:44 • 深度学习

看到有不少人挺推崇：An overview of gradient descent optimization algorithms；特此放到最上面，大家有机会可以阅读一下；

本文内容主要来源于Coursera吴恩达《优化深度神经网络》课程，另外一些不同优化算法之间的比较也会出现在其中，具体来源不再单独说明，会在文末给出全部的参考文献；

本主要主要介绍的优化算法有：

Mini-batch梯度下降（Mini-batch gradient descent）
指数加权平均（Exponentially weighted averages）
Momentum梯度下降法
RMSprop算法

Adam算法

其实就是对梯度下降的优化算法，每一种优化算法会介绍其：基本原理、TensorFlow中的使用、不同优化算法的优缺点总结；在最后会介绍调整学习率衰减的方式以及局部最优问题；

1. Mini-batch gradient descent
2. Exponentially weighted averages
3. Gradient descent with momentum（Momentum梯度下降法）
4. RMSprop
- 4.1 伪代码表示
- 4.2 TensorFlow中的RMSprop
  - 4.2.1 构建optimizer
  - 4.2.2 tf.train.RMSPropOptimizer()
5. Adam optimization algorithm
- 5.1 Adam算法流程-伪代码
- 5.2 TensorFlow中Adam optimization algorithm
  - 5.2.1 构建optimizer
  - 5.2.2 tf.train.AdamOptimizer
6. 不同优化算法的优缺点总结
7. Learning rate decay
- 7.1 学习率减小的几种方式
- 7.2 TensorFlow中的学习率设置
8. The problem of local optima
Reference

1. Mini-batch gradient descent

如果样本数量不是过于庞大，一般使用batch的方式进行计算，即将整个样本集投入到深度神经网络进行梯度下降；而一般实际应用中，样本集的数量将会很大，如达到百万数量级，这种情况下如果继续使用batch的方式，训练的速度往往会很慢；

因此，假如每次只对整个样本集中的部分样本执行梯度下降，这就有了Mini-batch gradient descent。

1.1 算法原理

整个样本集\(X=[x^1, x^2, \cdots, x^m] \in R^{n \times m}\)；\(Y=[y^1, y^2, \cdots, y^m] \in R^{1 \times m}\)；

假设：

\(m=5000000\)；每一个mini-batch含有1000个样本，即\(X^{\{t\}} \in R^{n \times 1000},Y^{\{t\}} \in R^{1 \times 1000}, t=1, 2, \cdots, 5000\)；

\(x^i\)表示第\(i\)个样本；\(Z^{[l]}\)表示网络第\(l\)层网络的线性输出；\(X^{\{t\}}, Y^{\{t\}}\)表示第\(t\)组mini-batch；

即在每一个mini-batch上执行梯度下降，伪代码如下：

# 一个epoch
for t = 1, ..., T{
    Forward Propagation
    Compute Cost Function
    Backward Propagation
}

其中，每一步详解：

（1）Forward Propagation

第一层网络非线性输出：

\[Z^{[1]} = W^{[1]}X^{\{t\}} + b^{[1]}
\]

\[A^{[1]} = g^{(1)}(Z^{[1]})
\]

第\(l\)层网络非线性输出：

\[A^{[l]} = g^{[l]}(Z^{[l]})
\]

（2）Compute Cost Function

计算代价函数：

\[J = \dfrac{1}{1000} \sum_{i=1}^{l}Loss(\hat{y}^i, y^i) + \dfrac{\lambda}{2 \times 1000} \sum_{l}||W^l||_F^2
\]

（3）Backward Propagation

更新权重和偏置：

\[W^{[l]} : = W^{[l]} - \alpha dW^{[l]}
\]

\[b^{[l]} : = b^{[l]} - \alpha db^{[l]}
\]

经过T次for循环后，表示已经在整个样本集上训练了一次，即一个epoch；可以执行多个epoch；

1.2 进一步理解Mini-batch gradient descent

对与Batch Gradient Descent来说，一个epoch只进行了一次梯度下降；而对于Mini-batch Gradient Decent来说，一个epoch进行T次梯度下降；

1.2.1 Cost function

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

（1）左图表示一般神经网络中，使用Batch Gradient Descent，随着在整个样本集上迭代次数的增加，cost在不断的减小；

（2）右图表示使用Mini-batch Gradient Descent，随着在不同的mini-batch上进行训练，cost整体趋势处于下降，但由于受到噪声的影响，会出现震荡；

（3）Mini-batch Gradient Descent中cost出现震荡的原因时：不同的mini-batch之间是存在差异的，可能其中某些mini-batch是好的子集，而某些子集中存在噪声，因此cost会出现震荡的情况；

1.2.2 如何选择batch size

总共有三种选择方式：（1）batch_size=m；（2）batch_size=1；（3）batch_size介于1和m之间；

（1）Batch Gradient Descent（batch_size = m）

当batch_size=m，就成了Batch Gradient Descent，只有包含一个子集，就是整个数据集；即\((X^{\{1\}}, Y^{\{1\}})=(X,Y)\)；

（2）Stochastic Gradient Descent（batch_size=1）

当batch_size=m，就成了Stochastic Gradient Descent，共包含m个子集，每个样本作为一个子集，即\((X^{\{1\}}, Y^{\{1\}})=(x^i,y^i)\)；

（3）Mini-batch gradient descent（batch_size介于1和m之间）

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

上图表示三者之间梯度下降曲线：

a. 蓝色表示Batch Gradient Descent，会比较平稳的接近全局最小值；由于使用了全部数据集，每次前进的速度会比较慢；

b. 紫色表示Stochastic Gradient Descent，每次前进速度很快；但由于每次只使用了一个样本，会出现较大的震荡；而且，不会收敛到最小值，最终会在最小值附近来回波动

c. 绿色表示Mini-batch gradient descent，每次前进速度较快，且震荡较小，基本能够接近最小值；如果出现在最小值附近波动，可以减小学习率；

算法	Stochastic Gradient Descent	Mini-batch gradient descent	Batch Gradient Descent
优点	适用于单个样本；	（1）能够快速学习；（2）向量化加速；（3）未在整个训练集上训练完，就可以执行后续工作；
缺点	（1）丢失了向量化带来的加速；（2）效率低；		单次迭代时间太长；

如何为Mini-batch gradient descent选择batch size？

64-512，2的n次方，提高运算速度；
\(X^{\{t\}}, Y^{\{t\}}\)符合GPU、CPU内存；

1.3 TensorFlow中的梯度下降

1.3.1 构建optimizer

optimizer = tf.train.GradientDescentOptimizer(leraning_rate)
train = optimizer.minimize(loss)

1.3.2 tf.train.GradientDescentOptimizer()

tf.train.GradientDescentOptimizer.__init__(self, 
                                           learning_rate, 
                                           use_locking=False, 
                                           name="GradientDescent"):
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate to use.  # 学习率
	use_locking: If True use locks for update operations.  # 
	name: Optional name prefix for the operations created when applying gradients. Defaults to "GradientDescent".

1.3.3 TensorFlow中的使用

#coding=utf-8
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
y_pred = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(y_pred - y))  # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)  # reset values to wrong
for i in range(1000):
    sess.run(train, {x: x_train, y: y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s" % (curr_W, curr_b, curr_loss))

2. Exponentially weighted averages

指数加权平均（Exponentially weighted averages）是除梯度下降算法之外其他优化算法中重要的概念，因此，这里先介绍其概念。

2.1 伦敦天气温度

这里不再介绍如何引入指数加权平均的，具体参考：网易云课堂-吴恩达《优化深度神经网络》-第二周或红色石头Will-吴恩达《优化深度神经网络》课程笔记；

假设：\(V_0 = 0\)；

\[V_t = \beta V_{t-1} + (1 - \beta) \theta_t
\]

其中，\(\theta_t\)表示第\(t\)天的温度；\(V_t\)表示通过移动平均的方法对每天气温进行平滑处理后结果；
\(\beta\)值决定了指数加权平均的天数，即\(\dfrac{1}{1-\beta}\)；\(\beta\)表示加权平均的天数越多，平均后的趋势越平缓，同时也会向右移动；

即，当\(\beta=0.9\)，则\(\dfrac{1}{1-\beta}=10\)，表示将前10天进行指数加权平均；

2.2 进一步理解Exponentially weighted averages

2.2.1 理解指数加权平均一般形式

\[V_t = \beta V_{t-1} + (1-\beta)\theta_{t}
\]

\[V_t = (1-\beta) \cdot \theta_{t} + (1-\beta) \cdot \beta \cdot \theta_{t-1} + (1-\beta) \cdot \beta^2 \cdot \theta_{t-2} + \cdots + (1-\beta)\cdot \beta^{t-1}\cdot \theta_1 + \beta^t\cdot V_0
\]

其中，\(\theta_t, \theta_{t-1}, \cdots , \theta_1\)表示原始数据集，即下图中的第一张图；

\((1-\beta), (1-\beta)\cdot \beta, \cdots, (1-\beta)\cdot \beta^{t-1}\)类似指数曲线，如下图中第二张图；从右向左，呈指数下降；

\(V_t\)表示两者点乘，将原始数据值与衰减指数点乘，相当于做了指数衰减，离的越近，影响就越大；离的越远，影响就越小，衰减就越严重；

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

2.2.2 实际计算指数加权平均

实际应用中，为了减少内存的使用，可以使用如下语句实现指数加权平均：

\(V_0=0\)

Repeat{

\[Get \quad next \quad \theta_t
\]

\[V_{\theta} := \beta V_{\theta} + (1-\beta)\theta_t
\]

}

2.3 偏差修正（bias correction）

因为初始假设\(V_0=0\)，可以想到，在使用\(V_t = \beta V_{t-1} + (1-\beta)\theta_t\)计算的时候，前面的一些值将会受到很大的影响，会比正常值小一些，直到计算后面数据的时候，影响才会渐渐变小，趋于正常。

因此，修正这种问题的方式是偏移修正（bias correction），即对\(V_t\)作如下处理：

\[\dfrac{V_t}{1-\beta^t}
\]

在机器学习中，偏移修正不是必须的；

3. Gradient descent with momentum（Momentum梯度下降法）

动量梯度下降算法（Gradient descent with momentum）的速度要快于标准的梯度下降算法；

具体做法是：在每次训练时，对梯度计算指数加权平均，然后使用得到的梯度值更新权重和偏置；

3.1 梯度下降

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

如上图蓝色折线所示，表示标准梯度下降算法；在梯度下降的过程中，会出现震荡的情况，这是因为每一点的梯度只与当前梯度方向有关，因此会出现折线的效果；

如上图红色折线所示，表示使用momentum梯度下降算法；可以看到，在梯度下降的过程中，不会出现剧烈的震荡，这是因为，每一个点的梯度不仅与当前梯度方向有关，还与之前的梯度方向有关；能够做到纵轴摆动变小，横轴方向运动更快；

3.2 伪代码表示

On iteration t{

Compute dW, db on the current mini-batch

\(V_{dW} = \beta V_{dW} + (1-\beta)dW\)

\(V_{db} = \beta V_{db} + (1-\beta)db\)

更新权重和偏置

\(W := W - \alpha V_{dW}, b := b - \alpha V_{db}\)

}

其中，初始化时，\(V_{dW}=0, V_{db}=0, \beta=0.9\)；

3.3 TensorFlow中的Gradient descent with momentum

3.3.1 构建optimizer

# optimizer
optimizer = tf.train.MomentumOptimizer(0.01, momentum) # \beta 
train = optimizer.minimize(loss)

3.3.2 tf.train.MomentumOptimizer()

tf.train.MomentumOptimizer.__init__(self, learning_rate, momentum,
               use_locking=False, name="Momentum", use_nesterov=False):
    
Args:
	learning_rat: A `Tensor` or a floating point value.  The learning rate. # 学习率
	momentum: A `Tensor` or a floating point value.  The momentum. # 就是指数加权平均中的超参数\alpha=0.9
	use_locking: If `True` use locks for update operations. 
	name: Optional name prefix for the operations created when applying gradients.  Defaults to "Momentum".
	use_nesterov: If `True` use Nesterov Momentum. # 另一种优化算法，由momentum改进而来，效果更好；来源于：http://jmlr.org/proceedings/papers/v28/sutskever13.pdf

Return:
    optimizer

4. RMSprop

RMSprop（Root mean squared prop）是另外一种优化梯度下降的算法，类似于Momentum Gradient descent，同样可以在纵轴上减小摆动，在横轴方向上运动更快；

4.1 伪代码表示

On iteration t{

Compute dW, db on the current mini-batch

\(S_{dW} = \beta S_{dW} + (1-\beta)(dW)^2\)

\(S_{db} = \beta S_{db} + (1-\beta)(db)^2\)

更新权重和偏置

\(W := W - \alpha \dfrac{dW}{\sqrt{S_W}+\epsilon}, b := b - \alpha \dfrac{db}{\sqrt{S_W}+\epsilon}\)

}

其中，一般取\(\epsilon=10^{-8}\)，防止分母趋近于0；

4.2 TensorFlow中的RMSprop

4.2.1 构建optimizer

# optimizer
optimizer = tf.train.RMSPropOptimizer(0.01, decay, momentum) # decay不清楚具体什么作用？？求解：
train = optimizer.minimize(loss)

4.2.2 tf.train.RMSPropOptimizer()

tf.train.RMSPropOptimizer.__init__(self,
                                  learning_rate,
                                  decay=0.9,
                                  momentum=0.0,
                                  epsilon=1e-10,
                                  use_locking=False,
                                  centered=False,
                                  name="RMSProp")
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate.  # 学习率
	decay: Discounting factor for the history/coming gradient  # ？？
	momentum: A scalar tensor. # \alpha
	epsilon: Small value to avoid zero denominator.  # \epsilon 防止分母趋近于0
	use_locking: If True use locks for update operation.
	centered: If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
	name: Optional name prefix for the operations created when applying gradients. Defaults to "RMSProp".

5. Adam optimization algorithm

Adam优化算法是结合了Gradient descent with momentum与RMSprop两种算法；被证明能够适用于不同的神经网络；

5.1 Adam算法流程-伪代码

初始化：\(V_{dW}=0, S_{dW}=0, V_{db}=0, S_{db}=0\)；

On iteration t {

Compute \(dW, db\) on each mini-batch

\(V_{dW} = \beta_1 V_{dW} + (1-\beta_1)dW\)

\(V_{db} = \beta_1 V_{db} + (1-\beta_1)db\)

\(S_{dW} = \beta_2 S_{dW} + (1-\beta_2)(dW)^2\)

\(S_{db} = \beta_2 S_{db} + (1-\beta_2)(db)^2\)

\(V_{dW}^{corrected}= \dfrac{V_{dW}}{1-\beta_1^t}, V_{db}^{corrected}= \dfrac{V_{db}}{1-\beta_1^t}\)

\(S_{dW}^{corrected}= \dfrac{S_{dW}}{1-\beta_2^t}, S_{db}^{corrected}= \dfrac{S_{db}}{1-\beta_2^t}\)

\(W := W - \alpha \dfrac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon} b := b - \alpha \dfrac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}\)

}

Adam算法中需要做偏差修正；

超参数设置：\(\beta_1 = 0.9, \beta_2=0.999, \epsilon = 10^{-8}\)；一般只需要对学习率\(\alpha\)进行调试；

5.2 TensorFlow中Adam optimization algorithm

5.2.1 构建optimizer

optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon)
train = optimizer.minimize(loss)

5.2.2 tf.train.AdamOptimizer

tf.train.AdamOptimizer._init__(self,
                               learning_rate=0.001,
                               beta1=0.9,
                               beta2=0.999,
                               epsilon=1e-8,
                               use_locking=False,
                               name="Adam"):
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate. # 学习率
	beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. # \beta_1
	beta2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates. # \beta_2
	epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
	use_locking: If True use locks for update operations.
	name: Optional name for the operations created when applying gradients. Defaults to "Adam".

6. 不同优化算法的优缺点总结

6.1 Batch Gradient Descent

思想：基于整个训练集进行梯度下降，更新权重；

优点：

考虑的是全局损失，不会陷入局部最优；

缺点：

每次迭代计算量较大，占用内存较高；

6.2 Stochastic Gradient Descent

思想：从训练集中随机选取一个样本计算梯度更新参数；

优点：

由于是对当个样本的损失计算梯度，因此计算量较小；

缺点：

仅考虑单个样本，容易陷入局部最优；
训练集较大时，训练时间较长；
选择合适的学习率比较困难；
对参数初始化比较敏感；
由于引入了噪声，因此具有正则化的效果；

6.3 Mini Batch Gradient Descent

思想：从整个样本集中选择batch_size个样本计算损失的梯度，更新权重；

优点：

对于很大的训练集，能够较快的收敛；

缺点：

梯度更新的方向依赖于当前batch内的样本，所以梯度的方向不稳定；
可能会出现不会收敛的最小值的情况，需要逐渐减小学习率；

6.4 Gradient Descent with Momentum

思想：基于之前梯度的方向以及当前batch的梯度方向进行更新；

优点：

减弱纵向方向的摆动，对震荡的情况能够有一定的抑制作用；
加速横向的运动，快速接近于最优值，加速收敛；

6.5 RMSprop

思想：类似于动量梯度下降，引入了指数权重加权平均值；

6.6 AdaGrad

思想：综合了Gradient Descent with Momentum与RMSprop两种优化算法；

优点：

训练前期，更新幅度大；
训练后期，更新幅度小；
适合处理稀疏梯度；

缺点：

训练后期，会导致学习率很小，梯度更新的很慢；
自定义全局学习率；

7. Learning rate decay

在神经网络训练的过程中，适当减小学习率有利于提高训练速度，该类方法称为learning rate decay，即随着迭代次数的增加，学习率\(\alpha\)逐渐减小；

7.1 学习率减小的几种方式

（1）第一种：

\[\alpha = \dfrac{1}{1+ decay\_rate \times epoch\_num}\cdot \alpha_0
\]

其中，\(decay\_rate\)衰减参数；\(epoch\_num\)表示迭代次数；

（2）第二种：

\[\alpha = 0.95^{epoch\_num} \cdot \alpha_0
\]

（3）第三种：

\[alpha = \dfrac{k}{\sqrt{epoch\_num}}\cdot \alpha_0 \quad 或 \quad \dfrac{k}{\sqrt{t}}\cdot \alpha_0
\]

（4）第四种：

将\(\alpha\)设置为关于\(t\)的离散值，随着\(t\)的增加，\(\alpha\)呈阶梯式减少；

（5）第五种：

通过查看训练日志，手动调整学习率；

7.2 TensorFlow中的学习率设置

由于TensorFlow中提供的学习率设置方式有不少种，而本文主要是叙述梯度下降的优化算法，在此处介绍将会占用不小的篇幅，显得有些臃肿，因此，另总结一篇博文供自己学习；

TensorFlow中设置学习率的方式

8. The problem of local optima

在使用梯度下降算法减少cost function的时候，可能会得到局部最优解，而不是全局最优解；

我们认为的局部最优可能如下图左边所示；但在神经网络中，局部最优的概念发生了变化；大部分梯度为零的“最优点“不是这些凹槽处，而是如下图右边的马鞍处，称为saddle point。

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

类似马鞍状的plateaus会降低神经网络的学习速度。plateaus是梯度接近于零的平缓区域，如下图所示，在plateaus上梯度很小，前进缓慢，达到saddle point需要很长时间；到达saddle point后，由于随机扰动，梯度能够进去下降；但是会在plateaus上花费很多时间；

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

动量梯度下降、RMSprop、Adam算法能够解决plateaus下降过慢的问题，提高训练速度；

结束！！！

博主个人网站:https://chenzhen.online

Reference

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng - Python技术站

深度学习

0 0 打赏

微信扫一扫

支付宝扫一扫

深度学习的下一个大热门——Swift

上一篇 2023年4月11日

转载-【深度学习】深入理解Batch Normalization批标准化

下一篇 2023年4月11日

深度学习笔记之一些基本术语 Deep learning：一(基础知识_1)

　　不多说，直接上干货！前言: 　　最近打算稍微系统的学习下deep learing的一些理论知识，打算采用Andrew Ng的网页教程UFLDL Tutorial，据说这个教程写得浅显易懂，也不太长。不过在这这之前还是复习下machine learning的基础知识，见网页：http://openclassroom.stanford.edu/…

深度学习 2023年4月13日
000
深度学习系列专题之优化方法（1）总览

深度学习中定义的损失函数基本上都是极度非凸的函数，仅使用梯度下降法（SGD）很容易陷入局部最优解，本系列打算讲解以下方法： 1、SGD (On the importance of initialization and momentum in deep learning) 2、momentum 3、Nesterov accelerated gradient 4…

深度学习 2023年4月11日
000
转：深度学习课程及深度学习公开课资源整理

深度学习课程及深度学习公开课资源整理这里整理一批深度学习课程或者深度学习相关公开课的资源，持续更新，仅供参考。 1. Andrew Ng (吴恩达) 深度学习专项课程 by Coursera and deeplearning.ai 这是 Andrew Ng 老师离开百度后推出的第一个深度学习项目（deeplearning.ai)的一个课程: Deep Le…

深度学习 2023年4月13日
000
深度学习模型压缩与优化加速

转自：https://blog.csdn.net/baidu_31437863/article/details/84474847 深度学习（Deep Learning）因其计算复杂度或参数冗余，在一些场景和设备上限制了相应的模型部署，需要借助模型压缩、优化加速、异构计算等方法突破瓶颈。模型压缩算法能够有效降低参数冗余，从而减少存储占用、通信带宽和计算复杂度…

深度学习 2023年4月13日
000
吴恩达《深度学习》第一门课（3）浅层神经网络

3.1神经网络概述（1）神经网络每个单元相当于一个逻辑回归，神经网络由逻辑回归的堆叠起来。下图是网络结构：针对网络结构进行计算： 1.第一层的正向传播 2.第一层的反向传播 3.第二层的反向传播（正向只要把微分符号去掉即可） 3.2神经网络的表示（1）神经网络各层分别较输入层、掩藏层和输出层，其中说一个网络有几层时一般不包括输入层，如下图是一个两层的网…

深度学习 2023年4月11日
000
什么样的数据集不适合用深度学习?

github博客传送门csdn博客传送门什么样的数据集不适合用深度学习？数据集太小,数据样本不足时,深度学习相对其它机器学习算法,没有明显优势。数据集没有局部相关特性,目前深度学习表现比较好的领域主要是图像／语音／自然语言处理等领域,这些领域的一个共性是局部相关性。图像中像素组成物体，语音信号中音位组合成单词，文本数据中单词组合成句子,这些特征元素的组…

深度学习 2023年4月12日
000
深度学习

准确率99.9%！如何用深度学习最快找出放倒的那张X光胸片（代码+数据）

准确率99.9%！如何用深度学习最快找出放倒的那张X光胸片（代码+数据）技术小能手 2018-05-16 14:49:36 浏览1694深度学习医疗神经网络医学图像数据的质量一直是个老大难题。难以清理的数据制约着许多深度学习的应用。而实际上，深度学习本身就是清洗医疗数据的好帮手。今天，我们就来讲一个案例，展示如何用深度学习迅速清洗一…

2023年4月9日
000
深度学习实践-物体检测-faster-RCNN(原理和部分代码说明) 1.tf.image.resize_and_crop(根据比例取出特征层，进行维度变化) 2.tf.slice(数据切片) 3.x.argsort()(对数据进行排列,返回索引值) 4.np.empty(生成空矩阵) 5.np.meshgrid(生成二维数据) 6.np.where(符合条件的索引) 7.tf.gather取值

1. tf.image.resize_and_crop(net, bbox, 256, [14, 14], name) # 根据bbox的y1,x1,y2,x2获得net中的位置，将其转换为14*14，因此为[14, 14, 512], 256表示转换的个数，最后的维度为[256, 14, 14, 512] 参数说明：net表示输入的卷积层，bbox表示y…

深度学习 2023年4月13日
000

《深度学习-改善深层神经网络》-第二周-优化算法-Andrew Ng

1. Mini-batch gradient descent

1.1 算法原理

1.2 进一步理解Mini-batch gradient descent

1.2.1 Cost function

1.2.2 如何选择batch size

1.3 TensorFlow中的梯度下降

1.3.1 构建optimizer

1.3.2 tf.train.GradientDescentOptimizer()

1.3.3 TensorFlow中的使用

2. Exponentially weighted averages

2.1 伦敦天气温度

2.2 进一步理解Exponentially weighted averages

2.2.1 理解指数加权平均一般形式

2.2.2 实际计算指数加权平均

2.3 偏差修正（bias correction）

3. Gradient descent with momentum（Momentum梯度下降法）

3.1 梯度下降

3.2 伪代码表示

3.3 TensorFlow中的Gradient descent with momentum

3.3.1 构建optimizer

3.3.2 tf.train.MomentumOptimizer()

4. RMSprop

4.1 伪代码表示

4.2 TensorFlow中的RMSprop

4.2.1 构建optimizer

4.2.2 tf.train.RMSPropOptimizer()

5. Adam optimization algorithm

5.1 Adam算法流程-伪代码

5.2 TensorFlow中Adam optimization algorithm

5.2.1 构建optimizer

5.2.2 tf.train.AdamOptimizer

6. 不同优化算法的优缺点总结

6.1 Batch Gradient Descent

6.2 Stochastic Gradient Descent

6.3 Mini Batch Gradient Descent

6.4 Gradient Descent with Momentum

6.5 RMSprop

6.6 AdaGrad

7. Learning rate decay

7.1 学习率减小的几种方式

7.2 TensorFlow中的学习率设置

8. The problem of local optima

Reference

相关文章