吴恩达深度学习课程 DeepLearning.ai 编程作业（2-1）Part.2

2018年2月7日 3116点热度 0人点赞 0条评论

正则化

Let's first import the packages you are going to use.

# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import sigmoid, relu, plot_decision_boundary
from reg_utils import initialize_parameters, load_2D_dataset, predict_dec
from reg_utils import compute_cost, predict, forward_propagation
from reg_utils import backward_propagation, update_parameters
import sklearn
import sklearn.datasets
import scipy.io
from testCases import *
%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

Problem Statement: 假设你刚刚被法国足球公司聘为AI专家。他们希望你推荐法国守门员应该踢球的位置，这样法国队的球员用头击出。

守门员将球踢向空中，各队的队员们正在拼命用头撞球，以下二维数据集为他们他们提供的法国过去10场比赛情况。

运行：

train_X, train_Y, test_X, test_Y = load_2D_dataset()
#若运行出现：c of shape (1, 211) not acceptable
#as a color sequence for x with size 211, y with size 211
#请在载入的reg_utils.py中寻找到
#plt.scatter(train_X[0, :], train_X[1, :], 
#c=train_Y, s=40, cmap=plt.cm.Spectral)
#将c=train_Y改为c=np.squeeze(train_Y)即可

得到结果图：

每个点对应于足球运动员在足球场左侧击球之后，用头将球击中的足球场上的位置。

如果这个点是蓝色的，这意味着这个法国球员设法用他/她的头击球

如果这个点是红色的，这意味着另一个队的球员用头撞球

你的目标：使用深度学习模式来找到守门员踢球的场地。

数据集的分析：这个数据集有点嘈杂，但是看起来像是左上角（蓝色）和右下角（红色）分开的对角线，效果很好。

你将首先尝试一个非正则化的模型。然后，您将学习如何正则化，并决定选择哪种模式来解决法国足球公司的问题。

1 – 非正则化模型

您将使用以下神经网络（以下已为您实施）。这个模型可以这样使用：

在 regularization mode -- 通过将lambd输入设置为非零值。我们使用“lambd”而不是“lambda”，因为“lambda”是Python中的保留关键字。
in dropout mode -- 通过将keep_prob设置为小于1的值

您将首先尝试没有正规化的模型。然后，你将执行：

L2 regularization -- functions: "compute_cost_with_regularization()" and "backward_propagation_with_regularization()"
Dropout -- functions: "forward_propagation_with_dropout()" and "backward_propagation_with_dropout()"

在每个部分中，您将使用正确的输入运行此模型，以便调用您实施的功能。看看下面的代码，以熟悉模型。

def model(X, Y, learning_rate = 0.3, num_iterations = 30000,
print_cost = True, lambd = 0, keep_prob = 1):
    """
Implements a three-layer neural network:
LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
   
    Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), 
of shape (output size, number of examples)
learning_rate -- learning rate of the optimization
num_iterations -- number of iterations of the optimization loop
print_cost -- If True, print the cost every 10000 iterations
lambd -- regularization hyperparameter, scalar
keep_prob - probability of keeping a neuron active during drop-out, scalar.
   
    Returns:
parameters--parameters learned by the model. They can then be 
used to predict.
    """
       
    grads = {}
    costs = []      # to keep track of the cost
    m = X.shape[1]    # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
   
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)
 
    # Loop (gradient descent)
 
    for i in range(0, num_iterations):
 
        # Forward propagation: 
        #LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache =
            forward_propagation_with_dropout(X,parameters, keep_prob)
       
        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
           
        # Backward propagation.
        assert(lambd==0 or keep_prob==1)
        #it is possible to use both L2 regularization and dropout,
        # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads=backward_propagation_with_regularization(X,Y,cache,lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
       
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
       
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
   
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
   
return parameters

让我们在没有任何正则化情况下训练模型，，并观察训练集/测试集的准确性。

parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

迭代结果：

Cost after iteration 0: 0.6557412523481002

Cost after iteration 10000: 0.1632998752572419

Cost after iteration 20000: 0.13851642423239133

On the training set:

Accuracy: 0.947867298578

On the test set:

Accuracy: 0.915

训练精度为94.8％，测试精度为91.5％。这是基准模型（您将观察正则化对此模型的影响）。运行以下代码来绘制模型的决策边界。

plt.title("Model without regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x:predict_dec(parameters,x.T),train_X,train_Y)

很明显已经过拟合了。接下来采用两种正则化来测试一下。

2 - L2 正则化

下面来做修改。

Exercise: 执行 compute_cost_with_regularization() 来计算式（2）给的计算代价的方式。

# GRADED FUNCTION: compute_cost_with_regularization
 
def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
Implement the cost function with L2 regularization. See formula (2) above.
   
    Arguments:
    A3 -- post-activation, output of forward propagation, 
    of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
   
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
   
    cross_entropy_cost = compute_cost(A3, Y) 
    # This gives you the cross-entropy part of the cost
   
    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (np.sum(np.square(W1)) + np.sum( np.square(W2)) 
    + np.sum(np.square(W3))) * lambd / (2 * m)
    ### END CODER HERE ###
   
    cost = cross_entropy_cost + L2_regularization_cost
   
return cost

测试一下：

A3, Y_assess, parameters = compute_cost_with_regularization_test_case()
 
print("cost="
+str(compute_cost_with_regularization(A3,Y_assess,parameters,lambd=0.1)))

结果为：

cost = 1.78648594516

当然，因为你改变了成本，你也必须改变后向传播！所有的梯度都必须计算这个新的成本。

Exercise: Implement the changes needed in backward propagation to take into account regularization. The changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term's gradient

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
Implements the backward propagation of our baseline model to 
which we added an L2 regularization.
   
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
   
    Returns:
    gradients -- A dictionary with the gradients with respect to 
    each parameter, activation and pre-activation variables
    """
   
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
   
    dZ3 = A3 - Y
   
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + lambd/m*W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
   
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + lambd/m*W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
   
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + lambd/m*W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
   
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
   
return gradients

测试一下：

X_assess, Y_assess, cache = backward_propagation_with_regularization_test_case()
 
grads = backward_propagation_with_regularization
(X_assess, Y_assess, cache, lambd = 0.7)
print ("dW1 = "+ str(grads["dW1"]))
print ("dW2 = "+ str(grads["dW2"]))
print ("dW3 = "+ str(grads["dW3"]))

结果为：

dW1	[[-0.25604646 0.12298827 -0.28297129] [-0.17706303 0.34536094 -0.4410571 ]]
dW2	[[ 0.79276486 0.85133918] [-0.0957219 -0.01720463] [-0.13100772 -0.03750433]]
dW3	[[-1.77691347 -0.11832879 -0.09397446]]

现在让我们用L2正则化（λ= 0.7）运行模型。 model（）函数将调用：

compute_cost_with_regularization而不是compute_cost

backward_propagation_with_regularization而不是backward_propagation

代码如下：

parameters = model(train_X, train_Y, lambd = 0.7)
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

结果如下：

Cost after iteration 0: 0.6974484493131264

Cost after iteration 10000: 0.2684918873282239

Cost after iteration 20000: 0.2680916337127301

On the train set:

Accuracy: 0.938388625592

On the test set:

Accuracy: 0.93

我们看出精确度又有所提高，下面看边界图像，判断看是否过拟合？

代码如下：

plt.title("Model with L2-regularization by yushuai.me")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary
(lambda x:predict_dec(parameters,x.T),train_X, train_Y)

图像如下：

可以看出，过拟合已经得到很大程度减轻。

L2正则化实际上在做什么？

L2规则化依赖于这样的假设，即具有小权重的模型比具有大权重的模型简单。

因此，通过惩罚成本函数中权重的平方值，可以将所有权重驱动到较小的值。

但对于大权重的成本来说代价太大了，这将导致一个更加平滑的模型，它其中输出随着输入改变变化更慢。

L2正则化对以下内容的影响：

成本计算：

正则化术语被添加到成本中

反向传播功能：

在权重矩阵的梯度中有额外的术语

权重变小（“权重衰减”）：

权重被推到较小的值。

3 - Dropout

最后，Dropout（随机失活）是一种广泛使用的正则化技术，特别是在深度学习中，它在每次迭代中随机关闭一些神经元。

代码如下：

# GRADED FUNCTION: forward_propagation_with_dropout
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: 
LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- 
python dictionary containing your parameters 
"W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - 
probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, 
of shape (1,1)
    cache -- tuple, information stored for computing the backward 
propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
 # LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         
# Steps 1-4 below correspond to the Steps 1-4 described above. 
# Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = np.random.rand(A1.shape[0], A1.shape[1])  
# Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    D1 = D1 < keep_prob 
# Step 3: shut down some neurons of A1 
    A1 = A1 * D1
 # Step 4: scale the value of neurons that haven't been shut down
    A1 = A1 / keep_prob
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
# Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = np.random.rand(A2.shape[0], A2.shape[1]) 
# Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    D2 = D2 < keep_prob 
# Step 3: shut down some neurons of A2 
    A2 = A2 * D2    
# Step 4: scale the value of neurons that haven't been shut down
    A2 = A2 / keep_prob
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
return A3, cache

测试一下：

X_assess, parameters = forward_propagation_with_dropout_test_case()
A3, cache = forward_propagation_with_dropout
(X_assess, parameters, keep_prob = 0.7)
print ("A3 = " + str(A3))

输出结果为：

A3 =[[ 0.36974721 0.00305176 0.04565099 0.49683389 0.36974721]]

3.1 – 带有Dropout的反向传播

Exercise: Implement the backward propagation with dropout. As before, you are training a 3 layer network. Add dropout to the first and second hidden layers, using the masks D[1] and D[2] stored in the cache.

Instruction: Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps:

You had previously shut down some neurons during forward propagation, by applying a mask D[1] to A1. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask D[1] to dA1.
During forward propagation, you had divided A1 by keep_prob. In backpropagation, you'll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if A[1] is scaled by keep_prob, then its derivative dA[1] is also scaled by the same keep_prob).

代码如下：

# GRADED FUNCTION: backward_propagation_with_dropout
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of 
our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, 
number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active 
during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to 
each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D2 to shut down the same neurons as 
#during the forward propagation
    dA2 = dA2 * D2    
# Step 2: Scale the value of neurons that haven't been shut down
    dA2 = dA2 / keep_prob       
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D1 to shut down the same neurons as 
#during the forward propagation
    dA1 = dA1 * D1
# Step 2: Scale the value of neurons that haven't been shut down
    dA1 = dA1 / keep_prob              
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

测试一下：

X_assess, Y_assess,cache=backward_propagation_with_dropout_test_case()
gradients = backward_propagation_with_dropout
(X_assess, Y_assess, cache, keep_prob = 0.8)
print ("dA1 = " + str(gradients["dA1"]))
print ("dA2 = " + str(gradients["dA2"]))

结果：

dA1 = [[ 0.36544439 0. -0.00188233 0. -0.17408748]

[ 0.65515713 0. -0.00337459 0. -0. ]]

dA2 = [[ 0.58180856 0. -0.00299679 0. -0.27715731]

[ 0. 0.53159854 -0. 0.53159854 -0.34089673]

[ 0. 0. -0.00292733 0. -0. ]]

测试一下其cost：

parameters=model(train_X,train_Y,keep_prob=0.86,
learning_rate=0.3)
print ("On the train set:")
predictions_train = predict(train_X, train_Y,parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y,parameters)

结果为：

Cost after iteration 0: 0.6543912405149825

Cost after iteration 10000: 0.061016986574905605

Cost after iteration 20000: 0.060582435798513114

On the train set:

Accuracy: 0.928909952607

On the test set:

Accuracy: 0.95

精确率已经高达95%了。下面看其边界情况：

plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

注意：

1.dropout只能在训练中使用。

2.像tensorflow，PaddlePaddle，keras或caffe这样的深度学习框架带有一个dropout层的实现。不要紧张 - 你很快就会学到一些这样的框架。

关于dropout你应该知道：

1.dropout是一种正则化技术。

2.只能在训练期间只能使用dropout。测试期间不要使用dropout。

3.在前向传播和反向传播期间都是用dropout。

4.在训练期间，通过keep_prob分隔每个丢失层，以保持激活的相同期望值。例如，如果keep_prob是0.5，那么我们将平均关闭一半的节点，所以输出将被缩放0.5，因为只剩下一半对解决方案有贡献。除以0.5相当于乘以2。因此，输出现在具有相同的期望值。即使keep_prob是0.5以外的值，你也可以检查它是否有效。

4 - Conclusions

Here are the results of our three models:

model	train accuracy	test accuracy
3-layer NN without regularization	95%	91.5%
3-layer NN with L2-regularization	94%	93%
3-layer NN with dropout	93%	95%

Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system.

Congratulations for finishing this assignment! And also for revolutionizing French football. :-)

本作品采用知识共享署名 4.0 国际许可协议进行许可