优化方法

直到现在,你已经能够经常使用梯度下降来更新你的参数和最小化代价。在本次作业中,你将学到更多优秀的优化方法以来加速学习速度,并且可能最后得到一个更好的代价函数的最终结果。拥有一个好的优化算法要比等待数日或数小时来得到一个好结果更有意义。

梯度下降就像是一个代价函数J下山一般。

为了能够开始本作业,首先当然是载入所用到的包。

1 – 梯度下降

具体算法原理因为在此之前都已经重述了多次,所以在此不再重述。直接贴出代码。

测试一下:

结果如下

W1 = [[ 1.63535156 -0.62320365 -0.53718766]

 [-1.07799357  0.85639907 -2.29470142]]

b1 = [[ 1.74604067]

 [-0.75184921]]

W2 = [[ 0.32171798 -0.25467393  1.46902454]

 [-2.05617317 -0.31554548 -0.3756023 ]

 [ 1.1404819  -1.09976462 -0.1612551 ]]

b2 = [[-0.88020257]

 [ 0.02561572]

 [ 0.57539477]]

梯度下降的一种变体是随机梯度下降(SGD),这相当于小批量梯度下降(mini-batch),其中每个小批量只有1个例子。您刚刚实现的更新规则没有改变。发生变化的是,您将只计算一个训练实例的梯度,而不是整个培训集。下面的代码示例说明了随机梯度下降和(批)梯度下降的区别。

(Batch) Gradient Descent:

X = data_input

Y = labels

parameters = initialize_parameters(layers_dims)

for i in range(0, num_iterations):

    # Forward propagation

    a, caches = forward_propagation(X, parameters)

    # Compute cost.

    cost = compute_cost(a, Y)

    # Backward propagation.

    grads = backward_propagation(a, caches, parameters)

    # Update parameters.

    parameters = update_parameters(parameters, grads)

 

•Stochastic Gradient Descent:

X = data_input

Y = labels

parameters = initialize_parameters(layers_dims)

for i in range(0, num_iterations):

    for j in range(0, m):

        # Forward propagation

        a, caches = forward_propagation(X[:,j], parameters)

        # Compute cost

        cost = compute_cost(a, Y[:,j])

        # Backward propagation

        grads = backward_propagation(a, caches, parameters)

        # Update parameters.

        parameters = update_parameters(parameters, grads)

在SGD中,在更新所有梯度之前你只需要使用1个训练例子。当训练集非常大时,SGD速度会非常快。。但是这些参数会“振荡”到最小值而不是平滑地收敛。下面是一个例子:

 1518360027978593.jpg

Figure 1 : SGD vs GD
"+" denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

2 – Mini-Batch梯度下降算法

让我们来学习如何从训练集 (X, Y)利用mini-batch进行训练。

分两步:

  • Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the i th column of X is the example corresponding to the ith label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.

1518360080398769.jpg

Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don't need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this:

1518360193455856.jpg

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you with the partitioning step, we give you the following code that selects the indexes for the 1st and 2 nd mini-batches:

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]

second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]

结果:

shape of the 1st mini_batch_X: (12288, 64)

shape of the 2nd mini_batch_X: (12288, 64)

shape of the 3rd mini_batch_X: (12288, 20)

shape of the 1st mini_batch_Y: (1, 64)

shape of the 2nd mini_batch_Y: (1, 64)

shape of the 3rd mini_batch_Y: (1, 20)

mini batch sanity check: [ 0.90085595 -0.7612069   0.2344157 ]

3 – 动量梯度下降算法

Exercise: 初始化向量v。向量v是一个python中的字典,初始化为一组0。这一步的关键跟在梯度向量中类似。

测试一下:

Exercise: Now, implement the parameters update with momentum. The momentum update rule is, for l=1,…,L l=1,…,L:

测试一下:

4 – Adam算法

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum

 

测试一下:

Exercise: Now, implement the parameters update with Adam

5 – 使用不同优化算法的模型

让我们利用“moons”数据库来测试不同的优化模型。

现在使用这个3层神经网络来测试3种优化算法。

5.1 – Mini-batch Gradient descent

Run the following code to see how the model does with mini-batch gradient descent

结果:

Cost after epoch 0: 0.690736

Cost after epoch 1000: 0.685273

Cost after epoch 2000: 0.647072

Cost after epoch 3000: 0.619525

Cost after epoch 4000: 0.576584

Cost after epoch 5000: 0.607243

Cost after epoch 6000: 0.529403

Cost after epoch 7000: 0.460768

Cost after epoch 8000: 0.465586

Cost after epoch 9000: 0.464518

4.jpg

5.jpg

5.2 – Mini-batch gradient descent with momentum

Run the following code to see how the model does with momentum. Because this example is relatively simple, the gains from using momemtum are small; but for more complex problems you might see bigger gains.

结果:

Cost after epoch 0: 0.690741

Cost after epoch 1000: 0.685341

Cost after epoch 2000: 0.647145

Cost after epoch 3000: 0.619594

Cost after epoch 4000: 0.576665

Cost after epoch 5000: 0.607324

Cost after epoch 6000: 0.529476

Cost after epoch 7000: 0.460936

Cost after epoch 8000: 0.465780

Cost after epoch 9000: 0.464740

1518360497956725.jpg

5.3 – Mini-batch with Adam mode

Run the following code to see how the model does with Adam

结果:

Cost after epoch 0: 0.690552

Cost after epoch 1000: 0.185567

Cost after epoch 2000: 0.150852

Cost after epoch 3000: 0.074454

Cost after epoch 4000: 0.125936

Cost after epoch 5000: 0.104235

Cost after epoch 6000: 0.100552

Cost after epoch 7000: 0.031601

Cost after epoch 8000: 0.111709

Cost after epoch 9000: 0.197648

7.jpg

5.4 – 总结

optimization method

accuracy

cost shape

Gradient descent

79.7%

oscillations

Momentum

79.7%

oscillations

Adam

94%

smoother

    动量通常有用,但是考虑到较小的学习速率和简单的数据库,它的影响几乎是微乎其微的。另外,考虑到许多mini-batch比其它优化方法更困难这个事实,从这里面的出来的代价可以看到有很大的振荡。

在另一方面,Adam明显优于mini-batch和动量梯度下降算法。如果你在这个简单的数据集上运行更多时间段的模型,那么所有这些方法都会有非常好的记过。然而你可以看到,Adam收敛更快。

Adam有以下优点:

1)相对较低的内存需求(虽然比梯度下降和动量梯度下降要求高)

2)即使稍微调整超参数(除了α外),也会工作效果很好。