Machine learning system design - Data for machine learning

本文链接：https://blog.youkuaiyun.com/edward_wang1/article/details/108136094

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十二章《机器学习系统设计》中第98课时《机器学习数据》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In the previous video, we talked about evaluation metrics. I'd like to switch tracks a bit and touch on another important aspect of machine learning system design, which will often come up, which is the issue of how much data to train on. Now, in some earlier videos, I'd cautioned against blindly going out and just spending lots of time collecting lots of data, because it's only sometimes that would actually help. But it turns out that under certain conditions, and I will say in this video what those conditions are, getting a lot of data and training on a certain type of learning algorithm can be a very effective way to get a learning algorithm to do very good performance. And this arises often enough that if those conditions hold true for your problem and if you're able to get a lot of data, this could be a very good way to get a very high performance learning algorithm. So, in this video, let's talk more about that.

Let me start with a story. Many years ago, two researchers Michelle Banko and Eric Broule run the following fascinating study. They were interested in studying the effect of using different learning algorithms versus trying them out on different training set sciences.They were considering the problem of classifying between confusable words, so for example, in the sentence: for breakfast, I ate, should it be to, two or too? Well, for this example, for breakfast I ate two, 2 eggs. So, this is one example of a set of confusable words and that's a different set. So they took machine learning problems like these supervised learning problems to try to categorize what's the appropriate word to go into a certain position in an English sentence. They took a few different learning algorithms which were sort of considered state of the art back in the day, when they ran the study in 2001. So they took a variance on logistic regression called the Perceptron. They also took some of their algorithms. The Winnow algorithm is also very similar to a regression but different in some ways. And they used a naive Bayes algorithm which is something I'll actually talk about in this course. The exact details of these algorithms aren't important. But what they did was they varied the training set size and tried out these learning algorithms on the range of training set sizes and that's the result they got. The trends are very clear. First, most of these algorithms give remarkably similar performance. And second, as the training set size increases, on the horizontal axis is the training set size in millions, as you go from a hundred thousand up to a thousand million, that's a billion training examples, the performance of the algorithms all pretty much monotonically increase. And in fact that if you picked any algorithm, maybe pick a "inferior algorithm", but if you give that "inferior algorithm" more data, then from these examples, it looks like it will most likely beat even a "superior algorithm". So since this original study was very influential, there's been a range of many different studies showing similar results. That show that many different learning algorithms tend to, can sometimes, depending on details, can give pretty similar ranges of performance, but what can really drive performance is you can give the algorithm a ton of training data. These results like this has led to a saying in machine learning that often in machine learning, it's not who has the best learning algorithm that wins, it's who has the most data. So when is this true and when is this not true? Because we have a learning algorithm for which this is true then getting a lot of data is often maybe the best way to ensure that we have an algorithm with very high performance rather than debating worrying about exactly which of these algorithms to use.

Let's try to lay out a set of assumptions under which having a massive training set we think will be able to help. Let's assume that in our machine learning problem, the features $x$ have sufficient information with which we can use to predict $y$ accurately. For example, if we take the confusable words problem that we had in the previous slide. Let's say that its features $x$ capture what are the surrounding words around the blank that we're trying to fill in. So the features capture that we want to have, with sentences for breakfast I ate __ eggs. Then that's pretty much information to tell me that the word I want in the middle is TWO and that is not the word TO or TOO. So the features capture what are the surrounding words that give me enough information to pretty unambiguously decide what is the label $y$ , or in other words what is the word that should be used in that blank out of this set of three confusable words. So that's an example that the features $x$ has sufficient information to predict $y$ . For a counterexample. Consider a problem of predicting the price of a house from only the size of the house and from no other features. So, if you imagine I tell you that a house is 500 square feet, but I don't give you any other features. I don't tell you if the house is in an expensive part of the city. Or if I don't tell you the number of rooms in the house, or how nicely furnished the house is or whether the house is new or old. If I don't tell you anything other than that this is a 500 square foot house, there's so many other factors that would affect the price of a house. If all you know is the size, it's actually very difficult to predict the price accurately. So that will be a counterexample to this assumption that the features have sufficient information to predict the price to the desired level of accuracy. The way I think about testing this assumption, one way I often think about is, I often I ask myself: given the features $x$ , given the features, given the same information available to a learning algorithm, if we were to go to human expert in this domain, can a human expert actually or can a human expert confidently predict the value of $y$ . For this first example, if we go to an expert human English speaker, then the human expert English speaker would just probably be able to predict what word should go in here. So, a good English speaker can predict this well, so this gives me confidence that $x$ allows us to predict $y$ accurately. But in contrast, if we go to an expert in human prices. Maybe an expert realtor, someone who sells houses for a living. If I just tell them the size of a house and I tell them what the price is. Even an expert in pricing or selling houses would not be able to tell me. So this is a sign that for the housing price example knowing only the size doesn't give me enough information to predict the price of the house. So, let's say this assumption holds. Let's see then, when having a lot of data could help.

Suppose the features have enough information to predict the value of $y$ . And let's suppose we use a learning algorithm with a large number of parameters so maybe logistic regression or linear regression with a large number of features. Or one thing that I sometimes do is using neural network with many hidden units. That would be another learning algorithm with a lot of parameters. So, these are all powerful learning algorithms with a lot of parameters that can fit very complex functions. I am going to think of these as low-bias algorithms because they can fit very complex functions. And because we have very powerful learning algorithm, changes are, if we run these algorithms on the data sets, it would be able to fit the training set well. And hopefully, the training error will be small ( $J_{train}(\theta )$ ). Now, let's say we use a massive training set. In that case, if we a huge training set, then hopefully even though we have a lot of parameters, but if the training set is sort of even much larger than the number of parameters, then hopefully these algorithms will be unlikely to overfit. Because we have such a massive training set and by unlikely to overfit what that means is that the training error will hopefully be close to the test error ( $J_{train}(\theta ) \approx J_{test}(\theta )$ ). Finally, putting these two together, if the training set error is small, and the test error is close to the test error, what this two together imply is that hopefully the test set error ( $J_{test}(\theta )$ ) will also be small. Another way to think about this is that in order to have a high performance learning algorithm, we want it not to have high bias and not to have high variance. So, the bias problem we're going to address by making sure we have a learning algorithm with many parameters and so that gives us a low bias algorithm. And by using a very large training set, this ensures that we don't have a variance problem either. So hopefully our algorithm will have low variance. So by putting these two together, that we end up with a low bias and a low variance learning algorithm and this allows us to do well on the test set. And fundamentally, it's a key ingredients of assuming that the features have enough information and we have a rich class of functions, that's what guarantees low bias, and then it having a massive training set that is what gurantees low variance.

So this gives us a set of conditions rather hopefully some understanding of what's the sort of problem where if you have a lot of data and you train a learning algorithm with a lot of parameters, that might be a good way to give a high performance learning algorithm. And really, I think the key test that I often ask myself are: first, can a human expert look at the features $x$ and confidently predict the value $y$ ? Because that's sort of a certificatin that $y$ can be predicted accurately from the features $x$ . And second, can we actually get a large training set, and training the learning algorithm with a lot of parameters in the training set and if you can do both then that more often give you a very high performance learning algorithm.

<end>