您的位置:首页 > 其它

Coursera | Andrew Ng (02-week-1-1.8)—其他正则化方法

2018-01-16 14:02 465 查看
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/79072256

1.8 Other regularization method (其他正则化方法)

(字幕来源:网易云课堂)



In addition to L2 regularization and dropout regularization there are few other techniques for reducing overfitting in your neural network. Let’s take a look. Let’s say you’re fitting a cat classifier. If you are over fitting getting more training data can help, but getting more training data can be expensive and sometimes you just can’t get more data. But what you can do is augment your training set by taking image like this. And for example, flipping it horizontally and adding that also with your training set. So now instead of just this one example in your training set, you can add this to your training example. So by flipping the images horizontally, you could double the size of your training set. Because your training set is now a bit redundant, this isn’t as good as if you had collected an additional set of brand new independent examples. But you could do this without needing to pay the expense of going out to take more pictures of cats.



除了 L2 正则化随机失活 (dropout) 正则化 ,还有几种方法可以减少神经网络中的过拟合,我们来看看,假设你正在拟合猫咪图片分类器,如果你想通过扩增训练数据来解决过拟合,但扩增训练数据代价高,而且有时我们无法扩增数据但我们可以通过添加这类图片来增加训练集,例如 水平翻转图片,并把它添加到训练集,所以现在训练集中有原图,还有翻转后的这张图片,所以 通过水平翻转图片,训练集可以增大一倍,因为训练集有冗余,这虽然不如我们额外收集一组新图片那么好,但是这样做节省了获取更多猫咪图片的花费。

And then other than flipping horizontally, you can also take random crops of the image. So here we’re rotated and sort of randomly zoom into the image, and this still looks like a cat. So by taking random distortions and translations of the image you could augment your data set and make additional fake training examples. Again, these extra fake training examples they don’t add as much information as they were to call they get a brand new independent example of a cat. But because you can do this, almost for free, other than for some confrontational costs. This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce over fitting. And by synthesizing examples like this, what you’re really telling your algorithm is that if something is a cat, then flipping it horizontally is still a cat.



除了水平翻转图片,你也可以随意裁剪图片,这张图是把原图旋转并随意放大后裁剪的,仍能辨别出图片中的猫咪,通过随意翻转和裁剪图片,我们可以增大数据集 额外生成假训练数据,和全新的独立的猫咪图片数据相比,这些额外的假数据无法那么多信息,但是我们这么做基本没有花费 代价几乎为 0,除了一些对抗性代价,以这种方式扩增算法数据,进而正则化数据集 减少过拟合比较廉价,像这样人工合成数据的话 我们要通过算法验证,图片中的猫经过水平翻转之后依然是猫。

Notice I didn’t flip it vertically, because maybe we don’t want upside down cats, right? And then also maybe randomly zooming in to part of the image, it’s probably still a cat. For optical character recognition, you can also bring your data set by taking digits and imposing random rotations and distortions to it. So if you add these things to your training set, these are also still digit force. For illustration, I applied a very strong distortion. So this look very wavy for, in practice you don’t need to distort the four quite as aggressively, but just a more subtle distortion than what I’m showing here, to make this example clearer for you, right? But a more subtle distortion is usually used in practice, because this looks like really warped fours. So data augmentation can be used as a regularization technique, in fact similar to regularization.



大家注意 我并没有垂直翻转,因为我们不想上下颠倒图片,(Hinton 在 2017 10-26 发表的 capsule nets,可以解决识别 人脸时 五官错乱,图片颠倒的问题)也可以随机选取放大后的部分图片,猫可能还在上面,对于光学字符识别 我们还可以通过添加数字,随意旋转或扭曲数字来扩增数据,把这些数字添加到训练集,它们仍然是数字,为了方便说明,我对字符做了强变形处理,所以数字 4 看起来是波形的 其实不用对数字 4 做这么夸张的扭曲,只要更轻微的变形就好,我做成这样是为了让大家看的更清楚,实际操作的时候 我们通常对字符做更轻微的变形处理,因为这几个 4 看起来有点扭曲,所以 数据扩增可作为正则化方法使用。实际功能上也与正则化相似。

There’s one other technique that is often used called early stopping. So what you’re going to do is as you run gradient descent, you’re going to plot your, either the training error, you’ll use 01 classification error on the training set, or just plot the cost function J optimizing, and that should decrease monotonically, like so, all right? Because as you’re training, hopefully your training error, your cost function J should decrease. So with early stopping, what you do is you plot this, and you also plot your dev set error. And again, this could be a classification error in a development set, or something like the cost function, like the logistic loss or the log loss of the dev set. And what you find is that your dev set error will usually go down for a while, and then it will increase from there.So what early stopping does is, you will say well, it looks like your neural network was doing best around that iteration, so we just want to stop training on your neural network halfway and take whatever value achieved this dev set error. So why does this work?



还有另外一种常用的方法叫作 early stopping,运行梯度下降时,我们可以绘制训练误差,或只绘制代价函数 J 的优化过程,在训练集上用 0-1记录分类误差次数呈单调下降趋势。如图,因为在训练过程中 我们希望训练误差 代价函数 J 都在下降,通过 early stopping 我们不单可以绘制上面这些内容,还可以绘制验证集误差它可以是验证集上的分类误差,或验证集上的代价函数 逻辑损失和对数损失等,你会发现 验证集误差通常会先呈下降趋势,然后在某个节点处开始上升early stopping 的作用是 你会说,神经网络已经在这个迭代过程中表现得很好了,我们在此停止训练吧,得到验证集误差,它是怎么发挥作用的,

Well when you’ve haven’t run many iterations for your neural network yet, your parameters w will be close to zero. Because with random initialization, you probably initialize w to small random values,so before you train for a long time, w is still quite small. And as you iterate, as you train, w will get bigger and bigger and bigger until here maybe you have a much larger value of the parameters w for your neural network. So what early stopping does is by stopping halfway you have only a mid-size rate w. And so similar to L2 regularization by picking a neural network with smaller norm for your parameters w, hopefully your neural network is over fitting less. And the term early stopping refers to the fact that you’re just stopping the training of your neural network earlier. I sometimes use early stopping when training a neural network.



当你还未在神经网络上运行太多迭代过程的时候,参数 w 接近 0,因为随机初始化 w 值时 它的值可能都是较小的随机值,所以在你长期训练神经网络之前 w依然很小,因为随机初始化 w 值时 它的值可能都是较小的随机值,所以在你长期训练神经网络之前 w 依然很小,在迭代过程和训练过程中 w 的值会变得越来越大,比如在这儿 神经网络中参数 w 的值已经非常大了,所以 early stopping 要做就是在中间点停止迭代过程,我们得到一个 W 值中等大小的弗罗贝尼乌斯范数,与 L2 正则化相似 选择参数 W 范数较小的神经网络,但愿你的神经网络过度拟合不严重,术语 early stopping 代表,提早停止训练神经网络,训练神经网络时 我有时会用到 early stopping。

But it does have one downside, let me explain. I think of the machine learning process as comprising several different steps. One, is that you want an algorithm to optimize the cost function j and we have various tools to do that, such as Gradient descent. And then we’ll talk later about other algorithms, like Momentum and RMSprop and Adam, and so on. But after optimizing the cost function j, you also wanted to not over-fit. And we have some tools to do that such as your regularization, getting more data and so on. Now in machine learning, we already have so many hyper-parameters surge over. It’s already very complicated to choose among the space of possible algorithms. And so I find machine learning easier to think about when you have one set of tools for optimizing the cost function J, and when you’re focusing on optimizing the cost function J, all you care about is finding w and b, so that J(w,b) is as small as possible. You just don’t think about anything else other than reducing this. And then, it’s completely separate task to not over fit, I mean in other words, to reduce variance.



但是它也有一个缺点 我们来了解一下,我认为机器学习过程包括几个步骤,其中一步是选择一个算法来优化代价函数J,我们有多种工具来解决这个问题 如梯度下降,后面我会介绍其它算法 例如 Momentum,RMSprop 和 Adam 等等 ,但是优化代价函数J之后 我也不想发生过拟合,也有一些工具可以解决该问题 比如正则化,扩增数据等等,在机器学习中 超级参数激增,选出可行的算法也变得越来越复杂,我发现 如果我们用一组工具优化代价函数J,机器学习就会变得更简单,在重点优化代价函数J时,你只需要留意 w 和 b, J(w,b)的值越小越好,你只需要想办法减小这个值 其它的不用关注,然后 预防过拟合还有其他任务,换句话说就是减少方差。

And when you’re doing that, you have a separate set of tools for doing it. And this principle is sometimes called Orthogonalization. And this is an idea that you want to be able to think about one task at a time. I’ll say more about Orthogonalization in a later video, so if you don’t fully get the concept yet, don’t worry about it. But, to me the main downside of early stopping is that this couples, these two tasks. So you no longer can work on these two problems independently, because by stopping gradient descent early, you’re sort of breaking whatever you’re doing to optimize cost function J, because now you’re not doing a great job reducing the cost function J. You’ve sort of not done that that well. And then you also simultaneously trying to not over fit. So instead of using different tools to solve the two problems, you’re using one tool that kind of mixes the two. And this just makes the set of things you could try more complicated to think about.



这一步我们用另外一套工具来实现,这个原理有时被称为“正交化”思路就是在一个时间做一个任务,后面课上我会具体介绍正交化,如果你还不了解这个概念 不用担心,但对我来说 early stopping 的主要缺点是,你不能独立地处理这两个问题, 因为提早停止梯度下降,也就是停止了优化代价函数J,因为现在你不再尝试降低代价函数J,所以代价函数 J 的值可能不够小, 同时你又希望不出现过拟合,你没有采取不同的方式来解决这两个问题,而是用一种方法同时解决两个问题,这样做的结果是,我要考虑的东西变得更复杂。

Rather than using early stopping, one alternative is just use L2 regularization then you can just train the neural network as long as possible. I find that this makes the search space of hyper parameters easier to decompose, and easier to search over. But the downside of this though is that you might have to try a lot of values of the regularization parameter lambda. And so this makes searching over many values of lambda more computationally expensive. And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda. If this concept doesn’t completely make sense to you yet, don’t worry about it. We’re going to talk about orthogonalization in greater detail in a later video, I think this will make a bit more sense. Despite it’s disadvantages, many people do use it.

如果不用 early stopping 另一种方法就是 L2 正则化,训练神经网络的时间就可能很长,我发现 这导致超级参数搜索空间更容易分解,也更容易搜索,但是缺点在于 你必须尝试,很多正则化参数 λ 的值,这也导致搜索大量 λ 值的计算代价太高early stopping 的优点是 只运行一次坡度下降,你可以找出 W 的较小值 中间值和较大值,而无需尝试 L2 正则化超级参数 λ 的很多值,如果你还不能完全这个概念 没关系,下节课我们会详细正交化,这样会更好理解,虽然 L2 正则化有缺点 课还是有很多人愿意用它。

I personally prefer to just use L2 regularization and try different values of lambda. That’s assuming you can afford the computation to do so. But early stopping does let you get a similar effect without needing to explicitly try lots of different values of lambda. So you’ve now seen how to use data augmentation as well as if you wish early stopping in order to reduce variance or prevent over fitting in your neural network. Next let’s talk about some techniques for setting up your optimization problem to make your training go quickly.

我个人更倾向于使用 L2 正则化 尝试许多不同的 λ 值,假设你可以负担大量计算的代价,而使用 early stopping 也能得到相似结果,还不用尝试这么多 λ 值,这节课我们讲了如何使用数据扩增,以及如何使用 early stopping 降低神经网络中的方差, 或预防过拟合,下节课 我们会讲一些,配置优化问题的方法 来提高训练速度。

重点总结:

其他正则化方法

数据扩增(Data augmentation):通过图片的一些变换,得到更多的训练集和验证集;



Early stopping:在交叉验证集的误差上升之前的点停止迭代,避免过拟合。这种方法的缺点是无法同时解决 bias 和 variance 之间的最优。



参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-1)– 深度学习的实践方面

PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息