[吴恩达 DL] CLass2 Week2 Mini-batch梯度下降 课程总结+代码实现
2017-12-22 16:30
931 查看
本周内容从Batch gradient descent —> Mini-batch gradient descent,并根据Mini-batch gradient descent 的一些特性,介绍了加速Mini-batch gradient descent训练的优化方法(即如何更新参数),下面进行说明。
而Mini-batch gradient descent则把整个数据集划分成若干个子集,在一次迭代中,每看完一个子集的数据便对参数进行更新,这样在一次迭代中便能进行多次参数更新。
Stochastic gradient descent 是Minibatch的一种特殊情况,即将每一个数据作为样本集的一个子集,在一次迭代中,每看完一个数据便对参数进行更新。
如果采用Stochastic gradient descent,那么将无法收获向量化加速训练的好处
Mini-batch 则没有上述两者的缺点,有利于加速训练
step 1: 打乱数据集
step 2: 对数据集进行划分
{vdb[l]=βvdb[l]+(1−β)db[l]b[l]=b[l]−αvdb[l](2)
β越大,那么过去的数据所占比重越大,参数更新更平滑
β常用取值范围为0.8~0.999,通常取0.9
β = 0,相当于没有运用Momentum
⎧⎩⎨vdb[l]=βvdb[l]+(1−β)db[l]b[l]=b[l]−αdbsdb[l]√+ε(4)
通常,β = 0.999,ε = 10−8
⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪vdW[l]=β1vdW[l]+(1−β1)∂J∂W[l]vcorrecteddW[l]=vdW[l]1−(β1)tsdW[l]=β2sdW[l]+(1−β2)(∂J∂W[l])2scorrecteddW[l]=sdW[l]1−(β1)tW[l]=W[l]−αvcorrecteddW[l]scorrecteddW[l]√+ε(5)
参数b的更新同上所述。
通常,β1=0.9,β2=0.999,ε=10−8
在简单数据集上,当迭代次数足够多时,普通更新,Momentum,Adam均能取得较好的效果,但Adam更快
若想取得更好的效果,需要对超参数α进行调整
一 课程总结
1. Batch gradient descent Vs Mini-batch gradient descent
我们之前所采用的梯度下降算法均为Batch gradient descent,即在每次迭代中,需要看完整个数据集才对参数进行更新。而Mini-batch gradient descent则把整个数据集划分成若干个子集,在一次迭代中,每看完一个子集的数据便对参数进行更新,这样在一次迭代中便能进行多次参数更新。
Stochastic gradient descent 是Minibatch的一种特殊情况,即将每一个数据作为样本集的一个子集,在一次迭代中,每看完一个数据便对参数进行更新。
2. 为什么要采用Mini-batch gradient descent
如果采用Batch gradient descent,那么参数更新的周期太长如果采用Stochastic gradient descent,那么将无法收获向量化加速训练的好处
Mini-batch 则没有上述两者的缺点,有利于加速训练
3. Mini-batch的选择
通常Mini-batch的大小选择32,64,128,256,512等step 1: 打乱数据集
step 2: 对数据集进行划分
4. Mini-batch的特点
由于每次更新参数只看一部分的数据,因此在参数更新过程中会有波动(即Cost并不总是变为更小的值)。为了减小波动对训练速度造成的影响,引入了一些办法对参数更新进行优化:Momentum,RMSprop,Adam(均采用了指数加权平均的思想),learning rate decay等。5. Momentum
{vdW[l]=βvdW[l]+(1−β)dW[l]W[l]=W[l]−αvdW[l](1){vdb[l]=βvdb[l]+(1−β)db[l]b[l]=b[l]−αvdb[l](2)
β越大,那么过去的数据所占比重越大,参数更新更平滑
β常用取值范围为0.8~0.999,通常取0.9
β = 0,相当于没有运用Momentum
6. RMSprop
⎧⎩⎨sdW[l]=βsdW[l]+(1−β)dW[l]W[l]=W[l]−αdWsdW[l]√+ε(3)⎧⎩⎨vdb[l]=βvdb[l]+(1−β)db[l]b[l]=b[l]−αdbsdb[l]√+ε(4)
通常,β = 0.999,ε = 10−8
7. Adam
可以理解为(Momentum + RMSprop + Bias correction)⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪vdW[l]=β1vdW[l]+(1−β1)∂J∂W[l]vcorrecteddW[l]=vdW[l]1−(β1)tsdW[l]=β2sdW[l]+(1−β2)(∂J∂W[l])2scorrecteddW[l]=sdW[l]1−(β1)tW[l]=W[l]−αvcorrecteddW[l]scorrecteddW[l]√+ε(5)
参数b的更新同上所述。
通常,β1=0.9,β2=0.999,ε=10−8
8. Learning rate decay
9. 注意
在learning rate(α)较小且数据集较简单时,单纯的Momentum与普通参数更新效果差异不大在简单数据集上,当迭代次数足够多时,普通更新,Momentum,Adam均能取得较好的效果,但Adam更快
若想取得更好的效果,需要对超参数α进行调整
二 代码实现
1.Mini-batch 划分
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0): """ Creates a list of random minibatches from (X, Y) Arguments: X -- input data, of shape (input size, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) mini_batch_size -- size of the mini-batches, integer Returns: mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) """ np.random.seed(seed) # To make your "random" minibatches the same as ours m = X.shape[1] # number of training examples mini_batches = [] # Step 1: Shuffle (X, Y) permutation = list(np.random.permutation(m)) shuffled_X = X[:, permutation] shuffled_Y = Y[:, permutation].reshape((1,m)) # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case. num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning for k in range(0, num_complete_minibatches): ### START CODE HERE ### (approx. 2 lines) mini_batch_X = shuffled_X[:, k * mini_batch_size : (k + 1) * mini_batch_size] mini_batch_Y = shuffled_Y[:, k * mini_batch_size : (k + 1) * mini_batch_size] ### END CODE HERE ### mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) # Handling the end case (last mini-batch < mini_batch_size) if m % mini_batch_size != 0: ### START CODE HERE ### (approx. 2 lines) mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size : ] mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size : ] ### END CODE HERE ### mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) return mini_batches
2. 普通更新
def update_parameters_with_gd(parameters, grads, learning_rate): L = len(parameters) // 2 # number of layers in the neural networks # Update rule for each parameter for l in range(L): ### START CODE HERE ### (approx. 2 lines) parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)] parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads['db' + str(l+1)] ### END CODE HERE ### return parameters
3.Momentum(两个函数组成)
##################################################################### # 计算Vdw,Vdb def initialize_velocity(parameters): """ Initializes the velocity as a python dictionary with: - keys: "dW1", "db1", ..., "dWL", "dbL" - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. Arguments: parameters -- python dictionary containing your parameters. parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl Returns: v -- python dictionary containing the current velocity. v['dW' + str(l)] = velocity of dWl v['db' + str(l)] = velocity of dbl """ L = len(parameters) // 2 # number of layers in the neural networks v = {} # Initialize velocity for l in range(L): ### START CODE HERE ### (approx. 2 lines) v["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape) v["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape) ### END CODE HERE ### return v ################################################################## # 更新参数 def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate): """ Update parameters using Momentum Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- python dictionary containing the current velocity: v['dW' + str(l)] = ... v['db' + str(l)] = ... beta -- the momentum hyperparameter, scalar learning_rate -- the learning rate, scalar Returns: parameters -- python dictionary containing your updated parameters v -- python dictionary containing your updated velocities """ L = len(parameters) // 2 # number of layers in the neural networks # Momentum update for each parameter for l in range(L): ### START CODE HERE ### (approx. 4 lines) # compute velocities v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1 - beta) * grads["dW" + str(l + 1)] v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1 - beta) * grads["db" + str(l + 1)] # update parameters parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v["dW" + str(l+1)] parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v["db" + str(l+1)] ### END CODE HERE ### return parameters, v
3.Adam(由两个函数组成)
########################################################################### # 初始化Adam def initialize_adam(parameters) : L = len(parameters) // 2 # number of layers in the neural networks v = {} s = {} # Initialize v, s. Input: "parameters". Outputs: "v, s". for l in range(L): ### START CODE HERE ### (approx. 4 lines) v["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape) v["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape) s["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape) s["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape) ### END CODE HERE ### return v, s ########################################################################## # 更新参数 def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8): L = len(parameters) // 2 # number of layers in the neural networks v_corrected = {} # Initializing first moment estimate, python dictionary s_corrected = {} # Initializing second moment estimate, python dictionary # Perform Adam update on all parameters for l in range(L): # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v". ### START CODE HERE ### (approx. 2 lines) v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1) * grads["dW" + str(l + 1)] v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1) * grads["db" + str(l + 1)] ### END CODE HERE ### # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected". ### START CODE HERE ### (approx. 2 lines) v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - np.power(beta1, t)) v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - np.power(beta1, t)) ### END CODE HERE ### # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s". ### START CODE HERE ### (approx. 2 lines) s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * np.power(grads["dW" + str(l + 1)], 2) s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1 - beta2) * np.power(grads["db" + str(l + 1)], 2) ### END CODE HERE ### # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected". ### START CODE HERE ### (approx. 2 lines) s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - np.power(beta2, t)) s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - np.power(beta2, t)) ### END CODE HERE ### # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters". ### START CODE HERE ### (approx. 2 lines) parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW" + str(l+1)] / (np.sqrt(s_corrected["dW" + str(l+1)]) + epsilon) parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db" + str(l+1)] / (np.sqrt(s_corrected["db" + str(l+1)]) + epsilon) ### END CODE HERE ### return parameters, v, s
4.定义model(利用一个三层神经网络进行检验)
移步:[吴恩达 DL] Class1 Week4 深层神经网络+代码实现 查看如何建立深层神经网络def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, num_epochs = 10000, print_cost = True): # 3-layer neural network model which can be run in different optimizer modes. L = len(layers_dims) # number of layers in the neural networks costs = [] # to keep track of the cost t = 0 # initializing the counter required for Adam update seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours # Initialize parameters parameters = initialize_parameters(layers_dims) # Initialize the optimizer if optimizer == "gd": pass # no initialization required for gradient descent elif optimizer == "momentum": v = initialize_velocity(parameters) elif optimizer == "adam": v, s = initialize_adam(parameters) # Optimization loop for i in range(num_epochs): # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch seed = seed + 1 minibatches = random_mini_batches(X, Y, mini_batch_size, seed) for minibatch in minibatches: # Select a minibatch (minibatch_X, minibatch_Y) = minibatch # Forward propagation a3, caches = forward_propagation(minibatch_X, parameters) # Compute cost cost = compute_cost(a3, minibatch_Y) # Backward propagation grads = backward_propagation(minibatch_X, minibatch_Y, caches) # Update parameters if optimizer == "gd": parameters = update_parameters_with_gd(parameters, grads, learning_rate) elif optimizer == "momentum": parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate) elif optimizer == "adam": t = t + 1 # Adam counter parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t, learning_rate, beta1, beta2, epsilon) # Print the cost every 1000 epoch if print_cost and i % 1000 == 0: print ("Cost after epoch %i: %f" %(i, cost)) if print_cost and i % 100 == 0: costs.append(cost) # plot the cost plt.plot(costs) plt.ylabel('cost') plt.xlabel('epochs (per 100)') plt.title("Learning rate = " + str(learning_rate)) plt.show() return parameters
相关文章推荐
- [吴恩达 DL] CLass2 Week1 Part2 Gradient Checking 小结+代码实现
- [吴恩达 DL] CLass2 Week1 Part1 Regularization(正则化) 小结+代码实现
- [吴恩达 DL] Class1 Week3 浅层神经网络+代码实现
- 随机梯度下降(Stochastic gradient descent)和 批量梯度下降(Batch gradient descent )的公式对比、实现对比
- 累积梯度下降,随机梯度下降,基于mini-batch 的随机梯度下降
- 随机梯度下降(Stochastic gradient descent)和 批量梯度下降(Batch gradient descent )的公式对比、实现对比
- 随机梯度下降法(Stochastic Gradient Descent)和批量梯度下降法(Batch Gradient Descent )总结
- 随机梯度下降(Stochastic gradient descent)和 批量梯度下降(Batch gradient descent )的公式对比、实现对比
- Coursera吴恩达机器学习课程 总结笔记及作业代码——第7周支持向量机
- 随机梯度下降(Stochastic gradient descent)和 批量梯度下降(Batch gradient descent )的公式对比、实现对比
- 随机梯度下降(Stochastic gradient descent)和 批量梯度下降(Batch gradient descent )的公式对比、实现对比
- 关于梯度下降batch-GD,SGD,Mini-batch-GD,Stochastic GD,Online-GD的介绍
- 梯度下降实现案例(含python代码)
- 随机梯度下降(Stochastic gradient descent)和 批量梯度下降(Batch gradient descent )的公式对比、实现对比
- Coursera吴恩达机器学习课程 总结笔记及作业代码——第5周神经网络续
- 梯度下降原理及线性回归代码实现(python/java/c++)
- 最小二乘参数估计---梯度下降法求解参数的sas代码实现
- 梯度下降法,牛顿法,高斯-牛顿迭代法,附代码实现
- Coursera吴恩达机器学习课程 总结笔记及作业代码——第6周有关机器学习的小建议
- Python编程实现线性回归和批量梯度下降法代码实例