您的位置：首页 > 其它

Coursera | Andrew Ng (03-week2-2.6)—定位解决数据不匹配

2018-01-27 12:49 666 查看

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79177042

2.6 Addressing data mismatch (定位数据不匹配)

(字幕来源：网易云课堂)

If your training set comes from a different distribution, than your dev and test set, and if error analysis shows you that you have a data mismatch problem, what can you do?There aren’t completely systematic solutions to this, but let’s look at some things you could try.If I find that I have a large data mismatch problem, what I usually do is carry out manual error analysis and try to understand the differences between the training set and the dev/test sets.To avoid overfitting the test set, technically for error analysis, you should manually only look at a dev set and not at the test set.But as a concrete example, if you’re building the speech-activated rear-view mirror application, you might look or, I guess if it’s speech, listen to examples in your dev set to try to figure out how your dev set is different than your training set.So, for example, you might find that a lot of dev set examples are very noisy and there’s a lot of car noise.And this is one way that your dev set differs from your training set.And maybe you find other categories of errors.For example, in the speech-activated rear-view mirror in your car, you might find that it’s often mis-recognizing street numbers because there area lot more navigational queries which will have street address.So, getting street numbers right is really important.When you have insight into the nature of the dev set errors, or you have insight into how the dev set may be different or harder than your training set.

如果您的训练集来自和开发测试集不同的分布，如果错误分析显示你有一个数据不匹配的问题该怎么办? 这个问题没有完全系统的解决方案，但我们可以看看一些可以尝试的事情，如果我发现有严重的数据不匹配问题，我通常会亲自做错误分析，尝试了解训练集和开发测试集的具体差异，技术上为了避免对测试集过拟合，要做错误分析，你应该只人工去看开发集而不是测试集，但作为一个具体的例子，如果你正在开发一个语音激活的后视镜应用，你可能要看看.. 我想如果是语音的话，你可能要听一下来自开发集的样本，尝试弄清楚开发集和训练集到底有什么不同，所以比如说你可能会发现，很多开发集样本噪音很多有很多汽车噪音，这是你的开发集和训练集差异之一，也许你还会发现其他错误，比如在你的车子里的语言激活后视镜，你发现它可能经常识别错误，街道号码因为那里，有很多导航请求都有街道地址，所以得到正确的街道号码真的很重要，当你了解开发集错误的性质时，你就知道，开发集有可能跟训练集不同或者更难识别。

what you can do is then try to find ways to make the training data more similar.Or, alternatively, try to collect more data similar to your dev and test sets.So, for example, if you find that car noise in the background is a major source of error, one thing you could do is simulate noisy in-car data.So a little bit more about how to do this on the next slide.Or you find that you’re having a hard time recognizing street numbers, maybe you can go and deliberately try to get more data of people speaking out numbers and add that to your training set.Now, I realize that this slide is giving a rough guideline for things you could try.This isn’t a systematic process and,I guess, it’s no guarantee that you get the insights you need to make progress.But I have found that this manual insight, together we’re trying to make the data more similar on the dimensions that matter that this often helps on a lot of the problems.So, if your goal is to make the training data more similar to your dev set, what are some things you can do?One of the techniques you can use is artificial data synthesis and let’s discuss that in the context of addressing the car noise problem.

那么你可以尝试把训练数据变得更像开发集一点，或者你也可以收集更多类似你的开发集和测试集的数据，所以比如说如果你发现车辆背景噪音是主要的错误来源，那么你可以模拟车辆噪声数据，我会在下一张幻灯片里详细讨论这个问题，或者你发现很难识别街道号码，也许你可以有意识地收集更多，人们说数字的音频数据加到你的训练集里，现在我知道这张幻灯片只给出了粗略的指南列出一些你可以做的尝试，这不是一个系统化的过程，我想这不能保证你一定能取得进展，但我发现这种人工见解，我们可以一起尝试收集更多和真正重要的场合相似的数据，这通常有助于解决很多问题，所以如果你的目标是让训练数据更接近你的开发集，那么你可以怎么做呢? 你可以利用的其中一种技术是，人工合成数据 我们讨论一下，在解决汽车噪音问题的场合。

So, to build a speech recognition system, maybe you don’t have a lot of audio that was actually recorded inside the car with the background noise of a car, background noise of a highway, and so on.But, it turns out, there’s a way to synthesize it.So, let’s say that you’ve recorded a large amount of clean audio without this car background noise.So, here’s an example of a clip you might have in your training set.By the way, this sentence is used a lot in AI for testing because this is a short sentence that contains every alphabet from A to Z, so you see this sentence a lot.But, given that recording of “the quick brown fox jumps over the lazy dog,” you can then also get a recording of car noise like this.So, that’s what the inside of a car sounds like, if you’re driving in silence.And if you take these two audio clips and add them together, you can then synthesize what saying “the quick brown fox jumps over the lazy dog” would sound like, if you were saying that in a noisy car. So, it sounds like this.So, this is a relatively simple audio synthesis example.In practice, you might synthesize other audio effects like reverberation which is the sound of your voice bouncing off the walls of the car and so on.But through artificial data synthesis, you might be able to quickly create more data that sounds like it was recorded inside the car without needing to go out there and collect tons of data, maybe thousands or tens of thousands of hours of data in a car that’s actually driving along.

所以要建立语音识别系统，也许实际上你没那么多，实际在汽车背景噪音下录得的音频，或者在高速公路背景噪音下录得的音频，但我们发现你可以合成，所以假设你录制了大量清晰的音频，不带车辆背景噪音的音频，所以这可能是你的训练集里的一段音频，顺便说一下这个句子在 AI 测试中经常使用，因为这个短句包含了从 a 到 z 所有字母，所以你会经常见到这个句子，但是有了这个“敏捷的棕色狐狸跳过懒狗”这段录音之后，你也可以收集一段这样的汽车噪音，这就是汽车内部的背景噪音，如果你一言不发开车的话就是这种声音，如果你把两个音频片段放到一起，你就可以合成出，“敏捷的棕色狐狸跳过懒狗”，在汽车背景噪音中的效果听起来像这样，所以这是一个相对简单的音频合成例子，在实践中你可能会合成其他音频效果比如混响，就是声音从汽车内壁上反弹叠加的效果，但是通过人工数据合成，你可以快速制造更多的训练数据，就像真的在车里录的那样，那就不需要花时间实际出去收集数据，比如说在实际行驶中的车子，录下上万小时的音频。

So, if your error analysis shows you that you should try to make your data sound more like it was recorded inside the car, then this could be a reasonable process for synthesizing that type of data to give you a learning algorithm.Now, there is one note of caution I want to sound on artificial data synthesis which is that, let’s say, you have 10,000 hours of data that was recorded against a quiet background.And, let’s say, that you have just one hour of car noise.So, one thing you could try is take this one hour of car noise and repeat it 10,000 times in order to add to this 10,000 hours of data recorded against a quiet background.If you do that, the audio will sound perfectly fine to the human ear, but there is a chance, there is a risk that your learning algorithm will over fit to the one hour of car noise.And, in particular, if this is the set of all audio that you could record in the car or, maybe the sets of all car noise backgrounds you can imagine, if you have just one hour of car noise background, you might be simulating just a very small subset of this space.You might be just synthesizing from a very small subset of this space.And to the human ear, all these audio sounds just fine because one hour of car noise sounds just like any other hour of car noise to the human ear.But, it’s possible that you’re synthesizing data from a very small subset of this space, and the neural network might be overfitting to the one hour of car noise that you may have.I don’t know if it will be practically feasible to inexpensively collect 10,000 hours of car noise so that you don’t need to repeat the same one hour of car noise over and over but you have 10,000 unique hours of car noise to add to 10,000 hours of unique audio recording against a clean background.But it’s possible, no guarantees.But it is possible that using 10,000 hours of unique car noise rather than just one hour, that could result in better performance through learning algorithm.And the challenge with artificial data synthesis is to the human ear, as far as your ears can tell, these 10,000 hours all sound the same as this one hour, so you might end up creating this very impoverished synthesized data set froma much smaller subset of the space without actually realizing it.

所以如果错误分析显示你应该尝试让你的数据听起来更像在车里录的，那么人工合成那种音频，然后喂给你的机器学习算法这样做是合理的，现在我们要提醒一下，人工数据合成有一个潜在问题，比如说你在安静的背景里录得 10000 小时音频数据，然后比如说你只录了一小时车辆背景噪音，那么你可以这么做将这 1 小时汽车噪音，回放 10000 次并叠加到，在安静的背景下录得的 10,000 小时数据，如果你这么做了人听起来这个音频没什么问题，但是有一个风险有可能，你的学习算法对这 1 小时汽车噪音过拟合，特别是，如果这组汽车里录的音频，可能是你可以想象的所有汽车噪音背景的集合，如果你只录了一小时汽车噪音，那你可能只模拟了全部数据空间的一小部分，你可能只从汽车噪音的很小的子集来合成数据，而对于人耳来说，这些音频听起来没什么问题因为一小时的车辆噪音，对人耳来说听起来和其他任意一小时车辆噪音是一样的，但你有可能从这整个空间很小的一个子集出发合成数据，神经网络最后可能，对你这一小时汽车噪音过拟合，我不知道以较低成本，收集 10,000 小时的汽车噪音是否可行，这样你就不用一遍又一遍地回放那 1 小时汽车噪音，你就有 10,000 个小时永不重复的汽车噪音，来叠加到 10000 小时安静背景下录得的永不重复的语音录音，这是可以做的但不保证能做，但是使用 10000 小时永不重复的汽车噪音而不是 1 小时重复，学习算法有可能取得更好的性能，人工数据合成的挑战在于人耳的话，人耳是无法分辨的，这 10000 个小时听起来和那 1 小时没什么区别，所以你最后可能会，制造出这个原始数据很少的，在一个小得多的空间子集合成的训练数据但你自己没意识到。

Here’s another example of artificial data synthesis.Let’s say you’re building a self driving car and so you want to really detect vehicles like this and put a bounding box around it let’s say.So, one idea that a lot of people have discussed is, well, why should you use computer graphics to simulate tons of images of cars?And, in fact, here are a couple of pictures of cars that were generated using computer graphics.And I think these graphics effects are actually pretty good and I can imagine that by synthesizing pictures like these, you could train a pretty good computer vision system for detecting cars.

这里有人工合成数据的另一个例子，假设你在研发无人驾驶汽车你可能希望，检测出这样的车然后用这样的框包住它，很多人都讨论过的一个思路是，为什么不用计算机合成图像来模拟成千上万的车辆呢?事实上这里有几张车辆照片，其实是用计算机合成的，我想这个合成是相当逼真的，我想通过这样合成图片，你可以训练出一个相当不错的计算机视觉系统来检测车子。

Unfortunately, the picture that I drew on the previous slide again applies in this setting.Maybe this is the set of all cars and, if you synthesize just a very small subset of these cars, then to the human eye, maybe the synthesized images look fine.But you might overfit to this small subset you’re synthesizing.In particular, one idea that a lot of people have independently raised is, once you find a video game with good computer graphics of cars and just grab images from them and get a huge data set of pictures of cars, it turns out that if you look at a video game, if the video game has just 20 unique cars in the video game, then the video game looks fine because you’re driving around in the video game and you see these 20 other cars and it looks like a pretty realistic simulation.But the world has a lot more than 20 unique designs of cars, and if your entire synthesized training set has only 20 distinct cars, then your neural network will probably overfit to these 20 cars.And it’s difficult for a person to easily tell that, even though these images look realistic, you’re really covering such a tiny subset of the sets of all possible cars.

不幸的是上一张幻灯片介绍的情况，也会在这里出现，比如这是所有车的集合，如果你只合成这些车中很小的子集，对于人眼来说，也许这样合成图像没什么问题，但你的学习算法可能会对合成的这一个小子集过拟合，特别是很多人都独立提出了一个想法，一旦你找到一个电脑游戏里面车辆渲染的画面很逼真，那么就可以截图得到数量巨大的汽车图片数据集，事实证明如果你仔细观察一个视频游戏，如果这个游戏只有20辆独立的车，那么这游戏看起来还行，因为你是在游戏里开车，你只看到这 20 辆车这个模拟看起来相当逼真，但现实世界里车辆的设计可不只20种，如果你用着 20 量独特的车合成的照片去训练系统，那么你的神经网络很可能对这 20 辆车过拟合，但人类很难分辨出来，即使这些图像看起来很逼真，你可能真的只用了所有可能出现的车辆的很小的子集。

So, to summarize, if you think you have a data mismatch problem,I recommend you do error analysis, or look at the training set, or look at the dev set to try this figure out, to try to gain insight into how these two distributions of data might differ.And then see if you can find some ways to get more training data that looks a bit more like your dev set.One of the ways we talked about is artificial data synthesis.And artificial data synthesis does work.In speech recognition, I’ve seen artificial data synthesis significantly boost the performance of what were already very good speech recognition system.So, it can work very well.But, if you’re using artificial data synthesis, just be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of the space of all possible examples.So, that’s it for how to deal with data mismatch.Next, I like to share with you some thought son how to learn from multiple types of data at the same time.

所以 总而言之 如果你认为存在数据不匹配问题，我建议你做错误分析，或者看看训练集，或者看看开发集 试图找出，试图了解这两个数据分布到底有什么不同，然后看看是否有办法收集更多，看起来像开发集的数据作训练，我们谈到其中一种办法是人工数据合成，人工数据合成确实有效，在语音识别中我已经看到人工数据合成，显著提升了已经非常好的语音识别系统的表现，所以这是可行的，但当你使用人工数据合成时，一定要谨慎要记住你有可能，从所有可能性的空间只选了很小一部分去模拟数据，所以这就是如何处理数据不匹配问题，接下来我想和你分享一些想法，就是如何从多种类型的数据同时学习。

重点总结：

定位解决数据分布不匹配问题

如果通过上一节的误差分析，我们可以得知，模型最终在开发和测试集上的误差最终是由于数据分布不匹配而导致。那么这样的情况下如何解决？

进行人工误差分析，尝试去了解训练集和开发测试集的具体差异在哪里。如：噪音等；

尝试把训练数据变得更像开发集，或者收集更多的类似开发集和测试集的数据，如增加噪音；

获取数据的一种方法是，人工合成数据，它确实有效，但是要谨慎，从所有可能性的空间只选了很小一部分去模拟数据，可能会导致过拟合。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（3-2）– 机器学习策略（2）

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 数据不匹配人工合成数据开发集

相关文章推荐

新的分享

章节导航