2018-02-11  1,043 views 评论

吴恩达深度学习课程 DeepLearning.ai 编程作业（2-2）

1 – 梯度下降

W1 = [[ 1.63535156 -0.62320365 -0.53718766]

[-1.07799357  0.85639907 -2.29470142]]

b1 = [[ 1.74604067]

[-0.75184921]]

W2 = [[ 0.32171798 -0.25467393  1.46902454]

[-2.05617317 -0.31554548 -0.3756023 ]

[ 1.1404819  -1.09976462 -0.1612551 ]]

b2 = [[-0.88020257]

[ 0.02561572]

[ 0.57539477]]

X = data_input

Y = labels

parameters = initialize_parameters(layers_dims)

for i in range(0, num_iterations):

# Forward propagation

a, caches = forward_propagation(X, parameters)

# Compute cost.

cost = compute_cost(a, Y)

# Backward propagation.

# Update parameters.

X = data_input

Y = labels

parameters = initialize_parameters(layers_dims)

for i in range(0, num_iterations):

for j in range(0, m):

# Forward propagation

a, caches = forward_propagation(X[:,j], parameters)

# Compute cost

cost = compute_cost(a, Y[:,j])

# Backward propagation

# Update parameters.

Figure 1 : SGD vs GD
"+" denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

2 - Mini-Batch梯度下降算法

• Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the i th column of X is the example corresponding to the ith label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.

Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don't need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this:

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you with the partitioning step, we give you the following code that selects the indexes for the 1st and 2 nd mini-batches:

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]

second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]

...

shape of the 1st mini_batch_X: (12288, 64)

shape of the 2nd mini_batch_X: (12288, 64)

shape of the 3rd mini_batch_X: (12288, 20)

shape of the 1st mini_batch_Y: (1, 64)

shape of the 2nd mini_batch_Y: (1, 64)

shape of the 3rd mini_batch_Y: (1, 20)

mini batch sanity check: [ 0.90085595 -0.7612069   0.2344157 ]

3 – 动量梯度下降算法

Exercise: 初始化向量v。向量v是一个python中的字典，初始化为一组0。这一步的关键跟在梯度向量中类似。

Exercise: Now, implement the parameters update with momentum. The momentum update rule is, for l=1,...,L l=1,...,L:

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum

Exercise: Now, implement the parameters update with Adam

5 – 使用不同优化算法的模型

Run the following code to see how the model does with mini-batch gradient descent

Cost after epoch 0: 0.690736

Cost after epoch 1000: 0.685273

Cost after epoch 2000: 0.647072

Cost after epoch 3000: 0.619525

Cost after epoch 4000: 0.576584

Cost after epoch 5000: 0.607243

Cost after epoch 6000: 0.529403

Cost after epoch 7000: 0.460768

Cost after epoch 8000: 0.465586

Cost after epoch 9000: 0.464518

5.2 - Mini-batch gradient descent with momentum

Run the following code to see how the model does with momentum. Because this example is relatively simple, the gains from using momemtum are small; but for more complex problems you might see bigger gains.

Cost after epoch 0: 0.690741

Cost after epoch 1000: 0.685341

Cost after epoch 2000: 0.647145

Cost after epoch 3000: 0.619594

Cost after epoch 4000: 0.576665

Cost after epoch 5000: 0.607324

Cost after epoch 6000: 0.529476

Cost after epoch 7000: 0.460936

Cost after epoch 8000: 0.465780

Cost after epoch 9000: 0.464740

5.3 - Mini-batch with Adam mode

Run the following code to see how the model does with Adam

Cost after epoch 0: 0.690552

Cost after epoch 1000: 0.185567

Cost after epoch 2000: 0.150852

Cost after epoch 3000: 0.074454

Cost after epoch 4000: 0.125936

Cost after epoch 5000: 0.104235

Cost after epoch 6000: 0.100552

Cost after epoch 7000: 0.031601

Cost after epoch 8000: 0.111709

Cost after epoch 9000: 0.197648

5.4 – 总结

 optimization method accuracy cost shape Gradient descent 79.7% oscillations Momentum 79.7% oscillations Adam 94% smoother

动量通常有用，但是考虑到较小的学习速率和简单的数据库，它的影响几乎是微乎其微的。另外，考虑到许多mini-batch比其它优化方法更困难这个事实，从这里面的出来的代价可以看到有很大的振荡。