- Contents
- Learning From Data: A Short Course
- THE LECTURES
- Abu-Mostafa Y.S., Magdon-Ismail M., Lin H.-T. Learning From Data: A short course
- Learning From Data - A short course - Abu-Mostafa, Magdon-Ismail, Lin - pettiremerhalf.tk - pdf

The recommended textbook covers 14 out of the 18 lectures. The rest is covered by online material that is freely available to the book readers. This book, together with specially prepared online material freely accessible to our readers, provides a complete introduction to Machine Learning, the. Does anybody have any experience with the Learning from Data textbook by to this (with free PDF download): pettiremerhalf.tk~ullman/pettiremerhalf.tk

Author: | HERMAN HAYNSWORTH |

Language: | English, Spanish, French |

Country: | Pakistan |

Genre: | Health & Fitness |

Pages: | 698 |

Published (Last): | 10.05.2016 |

ISBN: | 763-5-57393-328-4 |

Distribution: | Free* [*Registration Required] |

Uploaded by: | ELFRIEDE |

Learning From Data Click button below to download or read this book. Description. A PHP Error was encountered Severity: Notice Message. LEARNING FROM DATA. The book website AMLbook. com contains supporting material for instructors and readers. LEARNING FROM DATA A SHORT. Download as PDF, TXT or read online from Scribd . There is also a forum that covers additional topics in learning from data. and made it a point to cover the.

The book website AMLbook. Hsuan Tien Lin. This book was typeset by the authors and was printed and bound in the United States of America. The use in this publication of tradenames. No warranty may be created or extended by sales representatives or written sales materials. Yaser S. The advice and strategies contained herein may not be suitable for your situation. Malik Magdon Ismail. NY This work may not be translated or copied in whole or in part without the written permission of the authors. No part of this publication may be reproduced. Abu Mostafa. CA 9 1 You should consult with a professional where appropriate. The authors shall not be liable for any loss of profit or any other commercial damages.

Imagine a situation in backgammon where you have a choice between different actions and you want to identify the best action.

It is not a trivial task to ascertain what the best action is at a given stage of the game, so we cannot 12 1. They still f all into clusters. The rule may be somewhat ambiguous, as type 1 and type 2 could be viewed as one cluster easily create supervised learning examples. If you use reinforcement learning instead, all you need to do is to take some action and report how well things went, and you have a training example.

The reinforcement learning algorithm is left with the task of sorting out the information coming from different ex amples to find the best line of play. We are just given input examples xi, , XN. You may wonder how we could possibly learn anything from mere inputs. Consider the coin classification problem that we discussed earlier in Figure 1.

Suppose that we didn't know the denomination of any of the coins in the data set. This unlabeled data is shown in Figure l. We still get similar clusters , but they are now unlabeled so all points have the same 'color'. The decision regions in unsupervised learning may be identical to those in supervised learning, but without the labels Figure 1.

However, the correct clustering is less obvious now, and even the number of clusters may be ambiguous. Nonetheless, this example shows that we can learn something from the inputs by themselves. Unsupervised learning can be viewed as the task of spontaneously finding patterns and structure in input data. For instance, if our task is to categorize a set of books into topics, and we only use general properties of the various books, we can identify books that have similar prop erties and put them together in one category, without naming that category.

Imagine that you don't speak a word of Spanish, but your company will relocate you to Spain next month. They will arrange for Spanish lessons once you are there, but you would like to prepare yourself a bit before you go.

All you have access to is a Spanish radio station. For a full month, you continuously bombard yourself with Spanish; this is an unsupervised learning experience since you don't know the meaning of the words. However, you gradually develop a better representation of the language in your brain by becoming more tuned to its common sounds and structures.

When you arrive in Spain, you will be in a better position to start your Spanish lessons. Indeed, unsupervised learning can be a precursor to supervised learning. In other cases, it is a stand-alone technique. I f a task can fit more tha n one type, explain how a nd describe the tra i n i n g data for each type.

As a result, learning from data is a diverse subject with many aliases in the scientific literature. The main field dedicated to the subject is called machine learning, a name that distinguishes it from human learning. We briefly mention two other important fields that approach learning from data in their own ways. Statistics shares the basic premise of learning from data, namely the use of a set of observations to uncover an underlying process.

In this case, the process is a probability distribution and the observations are samples from that distribution. Because statistics is a mathematical field, emphasis is given to situations where most of the questions can be answered with rigorous proofs.

As a result, statistics focuses on somewhat idealized models and analyzes them in great detail. This is the main difference between the statistical approach 14 1. The first two rows show the training examples each input x is a 9 bit vector represented visually as a 3 x 3 black and white array. Your task is to learn from this data set what f is, then apply f to the test input at the bottom. We make less restrictive assumptions and deal with more general models than in statistics.

Therefore, we end up with weaker results that are nonetheless broadly applicable. Data mining is a practical field that focuses on finding patterns, correla tions, or anomalies in large relational databases.

For example, we could be looking at medical records of patients and trying to detect a cause-effect re lationship between a particular drug and long-term effects. We could also be looking at credit card spending patterns and trying to detect potential fraud. Technically, data mining is the same as learning from data, with more empha sis on data analysis than on prediction.

Because databases are usually huge, computational issues are often critical in data mining. Recommender systems, which were illustrated in Section 1.

The target function f is the object of learning. The most important assertion about the target function is that it is unknown. We really mean unknown. This raises a natural question. How could a limited data set reveal enough information to pin down the entire target function? A simple learning task with 6 training examples of a 1 target function is shown.

Try to learn what the function is then apply it to the test input given. Now, show the problem to your friends and see if they get the same answer. The chances are the answers were not unanimous, and for good reason.

Both functions agree with all the examples in the data set, so there isn't enough information to tell us which would be the correct answer.

This does not bode well for the feasibility of learning. To make matters worse, we will now see that the difficulty we experienced in this simple problem is the rule, not the exception. This doesn't mean that we have learned f, since it doesn't guarantee that we know anything about f outside of V.

All rights reserved. MachineLearning comments. Want to join? Log in or sign up in seconds. Submit a new link. Submit a new text post. Get an ad-free experience with special benefits, and directly support Reddit. Please have a look at our FAQ and Link-Collection Metacademy is a great resource which compiles lesson plans on popular machine learning topics. Welcome to Reddit, the front page of the internet. Become a Redditor and subscribe to one of thousands of communities.

Use binomial distribution. By contrast. We can get mostly green marbles in the sample while the bin has mostly red marbles.

It states that for any sample size N. XN in V are picked independently according to P. If not. The training examples play the role of a sample from the bin. P can be unknown to us as well. If the sample was not randomly selected but picked in a particular way. With this equivalence. Take any single hypothesis h E 'H and compare it to f on each point x E X. In real learning. How does the bin model relate to the learning problem? The color that each point gets is not known to us.

The two situations can be connected. Let us see if we can extend the bin equivalence to the case where we have multiple hypotheses in order to capture real learning. If v happens to be close to zero.

The learning problem is now reduced to a bin problem. If the inputs xi.

If we have only one hypothesis to begin with. The probability is based on the distribution P over X which is used to sample the data points x. Probability added to the basic learning setup To do that. The error rate within the sample. In the same way. We have made explicit the dependency of Ein on the particular h that we are considering.

Let us consider an entire hypothesis set H instead of just one hypothesis h. If you are allowed to change h after you generate the data set. Each bin still represents the input space X. The probability of red marbles in the mth bin is Eout hm and the fraction of red marbles in the mth sample is Ein hm. With multiple hypotheses in H. The out-of-sample error Eout. Why is that?

The in-sample error Ein. Cmin is the coi n that had the m i n i m u m frequency of heads pick the earlier one in case of a tie.

Crand is a coin you choose at random. R u n a computer sim u lation for flipping 1. The hypothesis g is not fixed ahead o f time before generating the data. There is a simple but crude way of doing that. Since g has to be one of the hm 's regardless of the algorithm and the sample. Let's focus on 3 coins as follows: Vrand a n d Vmin be the fraction of heads you obtai n for the respective three coi ns.

Let v1. Flip each coi n independently times. Vrand and a nd plot the histograms of the distributions of v1. The next exercise considers a simple coin experiment that further illustrates the difference between a fixed h and the final hypothesis g selected by the learning algorithm. BM are any events. If we insist on a deterministic answer. If we accept a probabilistic answer. The question of whether V tells us anything outside of V that we didn't know before has two different answers.

Let us reconcile the two arguments. B2 means that event B1 implies event B2. One argument says that we cannot learn anything outside of V.

We would like to reconcile these two arguments and pinpoint the sense in which learning is feasible: Putting the two rules together.

We will improve on that in Chapter 2. We now apply two basic rules in probability. We have thus traded the condition Eout g Rj 0. Let us pin down what we mean by the feasibility of learning. Remember that Eout g is an unknown quantity. S smart a n d crazy. We don't insist on using any particular probability distribution.

We cannot guarantee that we will find a hypothesis that achieves Ein g Rj 0. What enabled this is the Hoeffding Inequality 1. That's what makes the Hoeffding Inequality applicable. What we get instead is Eout g Rj Ein g. If learning is successful. We still have to make Ein g Rj 0 in order to conclude that Eout g Rj 0.

By adopting the probabilistic view. Assume i n t h e probabilistic view that there i s a probability distribution on X. We consider two learning a lgorithms. Is it possible that the hypothesis that produces turns out to be better than the hypothesis that S produces? Of course this ideal situation may not always happen in practice. The second question is answered after we run the learning algorithm on the actual data and see how small we can get Ein to be. If the number of hypotheses ]VJ goes up.

She is wil ling to pay you to solve her problem a n d produce for her a g which a pproximates f. What is the best that you can promise her a mong the following: I f you d o return a hypothesis g.

Breaking down the feasibility of learning into these two questions provides further insight into the role that different components of the learning problem play.

Financial forecasting is an example where market unpredictability makes it impossible to get a forecast that has anywhere near zero error. Can we make Ein g small enough? The Hoeffding Inequality 1. The feasibility of learning is thus split into two questions: One such insight has to do with the 'complexity' of these components. All we hope for is a forecast that gets it right more often than not. Can we make sure that Eout g is close enough to Ein g? If we get that. This means that a hypothesis that has Ein g somewhat below 0.

Even when we cannot learn a particular f. In many situations. If we want an affirmative answer to the first question. If we fix the hypothesis set and the number of training examples. If the target function is complex. What are the ramifications of having such a 'noisy' target on the learning problem? The first notion is what approximation means when we say that our hypothesis approximates the target function well. In the extreme case.

Either way we look at it. Let us examine if this can be inferred from the two questions above. The second notion is about the nature of the target function. The complexity of f.

We might try to get around that by making our hypothesis set more complex so that we can fit the data better and get a lower Ein g. A close look at Inequality 1. This is obviously a practical observation. Remember that 1. This means that we will get a worse value for Ein g when f i s complex. The choice of an error measure affects the outcome of the learning process.

What are the criteria for choosing one error measure over another? We address this question here. J as the 'cost' of using h when you should use f. An error measure quantifies how well each hypothesis h in the model approximates the target function f. This cost depends on what h is used for.

The final hypothesis g is only an approximation of f. In an ideal world. Different error measures may lead to different choices of the final hypothesis. Example 1. Here is a case in point. So far. Consider the problem of verifying that a fingerprint belongs to a particular person. While E h.

One may view E h. If we define a pointwise error measure e h x. J should be user-specified. The same learning task in different contexts may warrant the use of different error measures. All future revenue from this annoyed customer is lost. The costs of the different types of errors can be tabulated in a matrix. In the supermarket and CIA scenarios.

You just gave away a discount to someone who didn't deserve it. On the other hand. We need to specify the error values for a false accept and for a false reject. For the supermarket. The right values depend on the application. Consider two potential clients of this fingerprint system.

This should be reflected in a much higher cost for the false accept. The inconvenience of retrying when rejected is just part of the job. The other is the CIA who will use it at the entrance to a secure facility to verify that you are authorized to enter that facility. For our examples. For the CIA. An unauthorized person will gain access to a highly sensitive facility. If the right person is accepted or an intruder is rejected. D The moral of this example is that the choice of the error measure depends on how the system is going to be used.

False rejects. One is that the user may not provide an error specification. The other is that the weighted cost may be a difficult objective function for optimizers to work with.

We have already seen an example of this with the simple binary error used in this chapter. The general supervised learning problem that we can independently determine during the learning process. Assume we randomly picked all the y's according to the distribution P y I x over the entire input space X.

While both distributions model probabilistic aspects of x and y. Remember the two questions of learning? With the same learning model. This situation can be readily modeled within the same framework that we have. This realization of P y I x i s effectively a target function. A data point x. The noisy target will look completely random. This does not mean that learning a noisy target is as easy as learning a deterministic one. Eout may be as close to Ein in the noisy case as it is in the This view suggests that a deterministic target function can be considered a special case of a noisy target.

If we use the same h to a pproximate a noisy version of f given by y f x. If y is real-valued for example. One can think of a noisy target as a deterministic target plus added noise. Our entire analysis of the feasibility of learning applies to noisy target functions as well.

In Chapter 2. N Yn wnxn. Use induction. For simplicity. Technical ly. You now pick the second ba l l from that same bag. When you look at the ba l l it is black. I n more tha n two d i mensions. One bag has 2 black ba l ls and the other has a black and a white ba l l.

Problem 1. The fol lowing steps wil l guide you through the proof. You pick a bag at ra ndom a nd then pick one of the ba lls in that bag at random. What is the pro bability that this ba l l is also black?

Use Bayes ' Theorem: Report the n u m ber of u pdates that the a lgorith m ta kes before converging. Com pare you r resu lts with b. Compare you r resu lts with b. In practice. Com ment on whether f is close to g.

Plot a histogra m for the n u m ber of u pdates that the a lgorith m takes to converge. How many u pdates does the a lgorithm ta ke to converge? I n t h e iterations of each experiment. PLA converges more q uickly tha n the bound p suggests. Compare you r results with b. Be sure to mark the exa m ples from different classes d ifferently. This problem leads you to explore the algorith m fu rther with data sets of d ifferent sizes a n d dimensions. Report the error on the test set. Plot the training data set.

In this problem. T h e algorithm a bove i s a variant of the so ca l led Adaline Adaptive Linear Neuron a lgorithm for perceptron learn ing. To get g. I n each iteration. Generate a test data set of size In each it eration t. That is. In P roblem 1. Remember that for a single coin. One of the sim plest forms of that law is the Chebyshev Inequality. For a given coin. The proba bility of obtaining k heads in N tosses of this coin is given by the binomial distribution: UN are iid random varia bles.

Assume we have a n u mber of coins that generate different sa m ples independently. On the same plot show the bound that wou ld be obtained usi ng the Hoeffding I neq u a lity. Eval u ate U s as a fun ction of s. For a fixed V of size N. We focus on the simple case of flipping a fair coin. Tf3 N. Argue that for a ny two deterministic a lgorithms Ai a nd A2.

This in-sa mple error should weight the different types of errors based on the risk matrix. What happens to you r two estimators hmean and hmed? Similar results can be proved for more genera l settings. You have now proved that i n a noiseless setting. You have N data points y YN and wish to estimate a ' representative' val ue. For the two risk matrices in Exa mple 1. Chapter 2 Training versus Testing Before the final exam.

Eout is based on the performance over the entire input space X. Doing well in the exam is not the goal in and of itself. It expressly measures training performance. The in sample error Ein. If the exam problems are known ahead of time. If the professor's goal is to help you do better in the exam. We will also discuss the conceptual and practical implications of the contrast between training and testing. They are the 'training set' in your learning.

Such performance has the benefit of looking at the solutions and adjusting accordingly. The exam is merely a way to gauge how well you have learned the material. Although these problems are not the exact ones that will appear on the exam. We began the analysis of in-sample error in Chapter 1. The same distinction between training and testing happens in learning from data. The goal is for you to learn the course material.

The error bound ln in 2. Eout 2: The mathematical results provide fundamental insights into learning from data. This can be rephrased as follows.

To make it easier on the not-so-mathematically inclined. The Eout h 2: Ein h. We will also make the contrast between a training set and a test set more precise. Not only do we want to know that the hypothesis g that we choose say the one with the best training error will continue to do well out of sample i. We would like to replace with M as 1 Sometimes 'generalization error' is used another name for Eout.

E also holds. Notice that the other side of IEout Ein l Pick a tolerance level 8. E for all h E 1-l. E direction of the bound assures us that we couldn't do much better because every hypothesis with a higher Ein than the g we have chosen will have a comparably higher Eout. To see that the Hoeffding Inequality implies 1. A word of warning: Generalization is a key issue in learning. We have already discussed how the value of Ein does not always generalize to a similar value of Eout.

We may now 2Me 2NE2. If 1-l is an infinite set. Generalization error. This is important for learning. We then over-estimated the probability using the union bound. If you take the perceptron model for instance. Once we properly account for the overlaps of the different hypotheses. The mathematical theory of generalization hinges on this observation.

The union bound says that the total area covered by If the events B1. In a typical learning model. To do this. If h1 is very similar to h2 for instance. The definition o f the growth function i s based on the number o f different hypotheses that 1-l can implement. Let x1. Each h E 1-l generates a dichotomy on x1.

For any 1-l. To compute mH N. If 1-l is capable of generating all possible dichotomies on x1. Definition 2. This signifies that 1-l is as diverse as can be on this particular sample. These three steps will yield the generalization bound that we need. The dichotomies generated by 1-l on these points are defined by 1-l x1.

Such an N-tuple is called a dichotomy since it splits x1. If h E 1-l is applied to a finite sample x1. XN into two groups: A larger 1-l x1.

The growth function is defined for a hypothesis set 1-l by where I I denotes the cardinality number of elements of a set. We will focus on binary target functions for the purpose of this analysis. These examples will confirm the intuition that m1-l N grows faster when the hypothesis set 1-l becomes more complex. One can verify that there are no 4 points that the perceptron can shatter.

D Let us now illustrate how to compute mH N for some simple hypothesis sets. The dichotomy of red versus blue on the 3 colinear points in part a cannot be generated by a perceptron. The most a perceptron can do on any 4 points is 14 dichotomies out of the possible Positive rays: In the case of 4 points.

Example 2. At most 14 out of the possible 16 dichotomies on any 4 points can be generated. Figure 2. Illustration of the growth function for a two dimensional per ceptron. Let us find a formula for mH N in each of the following cases.

If you connect the 1 points with a polygon. The dichotomy we get is decided Nil by which two regions contain the end values of the interval.

Notice that m1-l N grows as the square of of the 'simpler' positive ray case. Since this is the most we can get for any points.

Positive intervals: Per the next: The dichotomy we get on the points is decided by which region contains the value a. This does since it is defined based on the maximum 2. Adding up these possibilities. To compute m1-l N. N which is allowed. Each hypothesis is specified by the two end values of that interval. If both end values fall in the same region.

To compute m1-l N in this case. For the dichotomies that have less than three 1 points. Convex sets: As we vary a. We now use the break point k to derive a bound on the growth function m11 N for all values of N. Verify that m If k is a break point. Exercise 2.

D It is not practical to try to compute m11 N for every hypothesis set we use. Getting a good bound on mH N will prove much easier than computing m1l N itself. In general. If no data set of size k can be shattered by 1-l. Since B N. A similar green box will tell you when rejoin. The fact that the bound is polynomial is crucial. Absent a break point as is the case in the convex hypothesis example.

To evaluate B N. This means that we will generalize well given a sufficient number of examples. If you trust our math. The notation B comes from ' Binomial' and the reason will become clear shortly. If m1-l N replaced M in Equa- tion 2. The definition of B N. To prove the polynomial bound. This bound will therefore apply to any 1-l. We will exploit this idea to get a significant bound on m1-l N in general. A second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered.

Consider the B N. We now assume N 2: We collect these dichotomies in the set S Let S1 have a rows. We collect these dichotomies in the set S2 which can be divided into two equal parts.