您的位置：首页 > 其它

Coursera | Andrew Ng (03-week2-2.4)—在不同的划分上进行训练并测试

2018-01-26 13:55 537 查看

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79167505

2.4 Training and testing on different distributions (在不同的划分上进行训练并测试)

(字幕来源：网易云课堂)

Deep learning algorithms have a huge hunger for training data.They just often work best when you can find enough label training data to put into the training set.This has resulted in many teams sometimes taking whatever data you can find and just shoving it into the training set just to get it more training data.Even if some of this data, or even maybe a lot of this data,doesn’t come from the same distribution as your dev and test data.So in a deep learning era, more and more teams are now training on data that comes from a different distribution than your dev and test sets.And there’s some subtleties and some best practices for dealing with when your train and test distributions differ from each other.Let’s take a look.Let’s say that you’re building a mobile app where users will upload pictures taken from their cell phones, and you want to recognize whether the pictures that your users upload from the mobile app is a cat or not.So you can now get two sources of data.One which is the distribution of data you really care about, this data from a mobile app like app on the right, which tends to be less professionally shot,less well framed, maybe even blurrier because it’s shot by amateur users.The other source of data you can get is you can crawl the web and just download a lot of, for the sake of this example, let’s say you can download a lot of very professionally framed, high resolution, professionally taken images of cats.

深度学习算法对训练数据的胃口很大，当你收集到足够多带标签的数据，构成训练集时算法效果最好，这导致很多团队用尽一切办法收集数据，然后把它们堆到训练集里让训练的数据量更大，即使有些数据甚至是大部分数据，都来自和开发集测试集不同分布，在深度学习时代越来越多的团队，都用来自和开发集和测试集分布不同的数据来训练，这里有一些微妙的地方一些最佳做法，来处理训练集和测试集存在差异的情况，我们来看看，假设你在开发一个手机应用用户会上传，他们用手机拍摄的照片你想识别，用户从应用中上传的图片是不是猫，现在你有两个数据来源，一个是你真正关心的数据分布来自应用上传的数据，比如右边的应用这些照片一般更业余，取景不太好有些甚至很模糊因为它们都是业余用户拍的，另一个数据来源就是你可以用爬虫程序挖掘网页直接下载，就这个例子而言可以下载很多，取景专业高分辨率拍摄专业的猫图片。

And let’s say you don’t have a lot of users yet for your mobile app.So maybe you’ve gotten 10,000 pictures uploaded from the mobile app.But by crawling the web you can download huge numbers of cat pictures, and maybe you have 200,000 pictures of cats downloaded off the Internet.So what you really care about is that your final system does well on the mobile app distribution of images, right?Because in the end,your users will be uploading pictures like those on the right and you need your classifier to do well on that.But you now have a bit of a dilemma because you have a relatively small dataset,just 10,000 examples drawn from that distribution.And you have a much bigger dataset that’s drawn from a different distribution.There’s a different appearance of image than the one you actually want.So you don’t want to use just those 10,000 images because it ends up giving you a relatively small training set.And using those 200,000 images seems helpful, but the dilemma 两难的境地 is this 200,000 images isn’t from exactly the distribution you want.

如果你的应用用户数还不多，也许你只收集到 10,000 张用户上传的照片，但通过爬虫挖掘网页你可以下载到海量猫图，也许你从互联网上下载了超过20 万张猫图，而你真正关心的算法表现是你的最终系统，处理来自应用程序的这个图片分布时效果好不好，因为最后，你的用户会上传类似右边这些图片，你的分类器必须在这个任务中表现良好，现在你就陷入困境了，因为你有一个相对小的数据集，只有 10,000 个样本来自那个分布，而你还有一个大得多的数据集来自另一个分布，图片的外观和你真正想要处理的并不一样，但你又不想直接用这 10,000 张图片因为这样，你的训练集就太小了，使用这 20 万张图片似乎有帮助但是，困境在于这 20 万张图片并不完全来自你想要的分布。

So what can you do?Well, here’s one option.One thing you can do is put both of these data sets together so you now have 210,000 images.And you can then take the 210,000 images and randomly shuffle them into a train, dev, and test set.And let’s say for the sake of argument that you’ve decided that your dev and test sets will be 2,500 examples each.So your training set will be 205,000 examples.Now so set up your data this way has some advantages but also disadvantages.The advantage is that now you’re training, dev and test sets will all come from the same distribution, so that makes it easier to manage.But the disadvantage, and this is a huge disadvantage is that if you look at your dev set, of these 2,500 examples,a lot of it will come from the web page distribution of images,rather than what you actually care about,which is the mobile app distribution of images.

那么你可以怎么做呢? 这里有一种选择，你可以做的一件事是将两组数据合并在一起，这样你就有 21 万张照片，你可以把这 21 万张照片，随机分配到训练开发和测试集中，为了说明观点我们假设，你已经确定开发集和测试集各包含 2500 个样本，所以你的训练集有 205000 个样本，现在这么设立你的数据集有一些好处也有坏处，好处在于你的训练集开发集和测试集，都来自同一分布这样更好管理，但坏处在于这坏处还不小，就是如果你观察开发集看看这 2500 个样本，其中很多图片都来自网页下载的图片，那并不是你真正关心的数据分布，你真正要处理的是来自手机的图片。

So it turns out that of your total amount of data, 200,000, so I’ll just abbreviate that 200k, out of 210,000 ,we’ll write that as 210k, that comes from web pages.So all of these 2,500 examples on expectation,I think 2,381 of them will come from web pages.This is on expectation, the exact number will vary around depending on how the random shuttle operation went.But on average, only 119 will come from mobile app uploads.So remember that setting up your dev set is telling your team where to aim the target.And the way you’re aiming your target,you’re saying spend most of the time optimizing for the web page distribution of images,which is really not what you want.So I would recommend against option one,because this is setting up the dev set to tell your team to optimize for a different distribution of data than what you actually care about.

所以结果你的数据总量这 200,000个样本，我就用 200k 缩写表示除以 210,000 ，我把它写成 210k 那些是来自网页下载的，所以对于这 2500 个样本数学期望值是，有 2381 张图来自网页下载，这是期望值确切数目会变化，取决于具体的随机分配操作，但平均而言只有 119 张图来自手机上传，要记住设立开发集的目的是，告诉你的团队去瞄准的目标，而你瞄准目标的方式，你的大部分精力，都用在优化来自网页下载的图片，这其实不是你想要的，所以我真的不建议使用第一个选项，因为这样设立开发集就是告诉你的团队，针对不同于你实际关心的数据分布去优化。

So instead of doing this,I would recommend that you instead take another option, which is the following.The training set, let’s say it’s still 205,000 images,I would have the training set have all 200,000 images from the web.And then you can, if you want, add in 5,000 images from the mobile app.And then for your dev and test sets,I guess my data sets size aren’t drawn to scale.Your dev and test sets would be all mobile app images.So the training set will include 200,000 images from the web and 5,000 from the mobile app.The dev set will be 2,500 images from the mobile app, and the test set will be 2,500 images also from the mobile app.The advantage of this way of splitting up your data into train, dev, and test is that you’re now aiming the target where you want it to be.You’re telling your team, my dev set has data uploaded from the mobile app andthat’s the distribution of images you really care about,so let’s try to build a machine learning system that does really well on the mobile app distribution of images.The disadvantage, of course, is that now your training distribution is different from your dev and test set distributions.But it turns out that this split of your data into train, dev and test will get you better performance over the long term.And we’ll discuss later some specific techniques for dealing with your training sets coming from different distribution than your dev and test sets.

所以不要这么做，我建议你走另外一条路 就是这样，训练集比如说还是 205,000 张图片，我们的训练集是来自网页下载的 200,000 张图片，然后如果需要的话再加上 5000 张来自手机上传的图片，然后对于开发集和测试集，这数据集的大小不是按比例画的，你的开发集和测试集都是手机图，而训练集包含了来自网页的 20 万张图片，还有 5000 张来自应用的图片，开发集就是 2500 张来自应用的图片，测试集也是 2500 张来自应用的图片，这样将数据分成训练集开发集和测试集的好处在于，现在你瞄准的目标就是你想要处理的目标，你告诉你的团队我的开发集包含的数据全部来自手机上传，这是你真正关心的图片分布，我们试试搭建一个学习系统让系统在，处理手机上传图片分布时效果良好，缺点在于当然了现在你的训练集分布，和你的开发集和测试集分布并不一样，但事实证明这样把数据分成训练开发和测试集，在长期能给你带来更好的系统性能，我们以后会讨论一些特殊的技巧，可以处理训练集的分布和，开发集和测试集分布不一样的情况。

Let’s look at another example.Let’s say you’re building a brand new product,a speech activated rearview mirror for a car.So this is a real product in China.It’s making its way into other countries but you can build a rearview mirror to replace this little thing there, so that you can now talk to the rearview mirror and basically say, dear rearview mirror, please help me find navigational directions to the nearest gas station and it’ll deal with it.So this is actually a real product,and let’s say you’re trying to build this for your own country.So how can you get data to train up a speech recognition system for this product?Well, maybe you’ve worked on speech recognition for a long time so you have a lot of data from other speech recognition applications,just not from a speech activated rearview mirror.Here’s how you could split up your training and your dev and test sets.So for your training, you can take all the speech data you have thatyou’ve accumulated from working on other speech problems,such as data you purchased over the yearsfrom various speech recognition data vendors.And today you can actually buy data from vendors of x, y pairs,where x is an audio clip and y is a transcript.Or maybe you’ve worked on smart speakers, smart voice activated speakers,so you have some data from that.Maybe you’ve worked on voice activated keyboards and so on.And for the sake of argument, maybe you have 500,000 utterences from all of these sources.

我们来看另一个例子，假设你正在开发一个全新的产品，一个语音激活汽车后视镜，这在中国是个真实存在的产品，它正在进入其他国家但这就是造一个后视镜，把这个小东西换掉现在你就可以和后视镜对话了，然后只需要说亲爱的后视镜请帮我找找，到最近的加油站的导航方向然后后视镜就会处理这个请求，所以这实际上是一个真正的产品，假设现在你要为你自己的国家研制这个产品，那么你怎么收集数据去训练这个产品语言识别模块呢?嗯也许你已经在语音识别领域上工作了很久，所以你有很多来自其他语音识别应用的数据，它们并不是来自语音激活后视镜的数据，现在我讲讲如何分配训练集开发集和测试集。

And for your dev and test set, maybe you have a much smaller data set that actually came from a speech activated rearview mirror.Because users are asking for navigational queries or trying to find directions to various places.This data set will maybe have a lot more street addresses, right?Please help me navigate to this street address, or please help me navigate to this gas station.So this distribution of data will be very different than these on the left.But this is really the data you care about,because this is what you need your product to do well on,so this is what you set your dev and test set to be.So what you do in this example is set your training set to be the 500,000 utterances on the left,and then your dev and test sets which I’ll abbreviate D and T,these could be maybe 10,000 utterances each.That’s drawn from actual the speech activated rearview mirror.Or alternatively, if you think you don’t need to put all 20,000 examples from your speech activated rearview mirror into the dev and test sets,maybe you can take half of that and put that in the training set.So then the training set could be 510,000 utterances,including all 500 from there and 10,000 from the rearview mirror.And then the dev and test sets could maybe be 5,000 utterances each.So of the 20,000 utterances, maybe 10k goes into the training set and 5k into the dev set and 5,000 into the test set.So this would be another reasonable way of splitting your data into train, dev, and test.And this gives you a much bigger training set, over 500,000 utterances,than if you were to only use speech activated rearview mirror data for your training set.

对于你的训练集你可以将你拥有的所有语音数据，从其他语音识别问题收集来的数据，比如这些年你从，各种语音识别数据供应商买来的数据，今天你可以直接买到成 x,y 对的数据，其中 x 是音频剪辑 y 是听写记录，或者也许你研究过智能音箱语音激活音箱，所以你有一些数据，也许你做过语音激活键盘的开发之类的，举例来说也许你从这些来源收集了， 500,000 段录音，对于你的开发集和测试集也许数据集小得多，比如实际上来自语音激活后视镜的数据，因为用户要查询导航信息，或试图找到通往各个地方的路线，这个数据集可能会有很多街道地址对吧?请帮我导航到这个街道地址或者说，请帮助我导航到这个加油站，所以这个数据的分布和左边大不一样，但这真的是你关心的数据，因为这些数据是你的产品必须处理好的，所以你就应该把它设成你的开发和测试集，在这个例子中你应该这样设立你的训练集，左边有 500,000 段语音，然后你的开发集和测试集我把它简写成 D 和 T，可能每个集包含 10,000 段语音，是从实际的语音激活后视镜收集的，或者换种方式如果你觉得不需要将 20000 段，来自语音激活后视镜的录音全部放进开发和测试集，也许你可以拿一半把它放在训练集里，那么训练集可能是 51 万段语音，包括来自那里的 50 万段语音还有来自后视镜的 1 万段语音，然后开发集和测试集也许各自有 5000 段语音，所以有 2 万段语音也许 1 万段语音放入了训练集， 5000 放入开发集 5000 放入测试集，所以这是另一种，将你的数据分成训练开发和测试的方式，这样你的训练集大得多大概有 50 万段语音，比只用语音激活后视镜数据作为训练集要大得多。

So in this video, you’ve seen a couple examples of when allowing your training set datato come from a different distribution than your dev and test setallows you to have much more training data.And in these examples, it will cause your learning algorithm to perform better.Now one question you might ask is, should you always use all the data you have?The answer is subtle, it is not always yes.Let’s look at a counter-example in the next video.

所以在这个视频中你们见到几组例子让你的训练集数据，来自和开发集测试集不同的分布，这样你就可以有更多的训练数据，在这些例子中这将改善你的学习算法，现在你可能会问是不是应该把收集到的数据都用掉?答案很微妙不一定都是肯定的答案，我们在下段视频看看一个反例。

重点总结：

不同分布上的训练和测试

在深度学习的时代，因为需求的数据量非常大，现在很多的团队，使用的训练数据都是和开发集和测试集来自不同的分布。

下面是一些处理训练集和测试集存在差异的最佳的做法。以前一周中的猫的分类问题为例：

我们可以从网上获取大量的高清晰的猫的图片去做分类，如 200000 张，但是只能获取少量利用手机拍摄的不清晰的图片，如 10000 张。但是我们系统的目的是应用到手机上做分类。

也就是说，我们的训练集和开发集、测试集来自于不同的分布。

方法一：

将两组数据合并到一起，总共得到 21万张图片样本。将这些样本随机分配到训练、开发、测试集中。

好处：三个集合中的数据均来自于同一分布；

坏处：我们设立开发集的目的是瞄准目标，而现在我们的目标绝大部分是为了去优化网上获取的高清晰度的照片，而不是我们真正的目标。

这个方法不是一个好的方法。

方法二：

训练集均是来自网上下载的 20 万张高清图片，当然也可以加上 5000 张手机非高清图片；对于开发和测试集都是手机非高清图片。

好处：开发集全部来自手机图片，瞄准目标；

坏处：训练集和开发、测试集来自不同的分布。

从长期来看，这样的分布能够给我们带来更好的系统性能。

个人理解：

0.数据很贵，需要买，对个人开发者而言，经济不允许就 Python 爬虫去爬数据。

1.真正拥有的所需要的目标数据量小，其他不是目标数据量大，导致训练集，开发，测试集不能满足完全的同一分布，那么，优先保证开发和测试集是同一分布。所用数据是真正目标定义的数据，然后少量珍贵的数据放到训练集中。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（3-2）– 机器学习策略（2）

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 同一分布开发集测试集训练集数据划分

相关文章推荐

新的分享

章节导航