Machine Learning for Beginners — a How-to Guide
Addressing the audience at Open Data Science Conference 2017 in Boston, Kirill Eremenko and Hadelin de Ponteves stepped listeners through a collection of different machine learning techniques spanning a wide breadth, explaining the basics behind each to get users off the ground.
This is a great tutorial for getting the general outline of how different popular machine learning algorithms work, plus some code recipes you can use if you want to experiment. Read on for a brief on some of the topics they covered.
Branches of Machine Learning
Eremenko and Ponteves start by outlining some of the different branches of machine learning, as follows:
- Regression: techniques for predicting continuous outcomes. In this case, you have some inputs and are trying to understand how they map to a real value in a range. For example, if you know someone’s income and age, can you predict what their insurance rates will be?
- Classification: techniques for predicting categories. It’s similar to regression, but now we want to get a discrete value. Again, assume you know someone’s income and age, but now we’re interested in whether or not they plan on buying a house. In the real world, there’s more wishy-washiness around this, but in our worldview, there are only two outcomes: they do, or they don’t, and we want to predict which camp they fall in.
- Clustering: discovering groupings in the data based on their features that may otherwise be unobservable. Say you have data for viewing preferences with continuous variables for how much they liked or didn’t like a few different TV shows (think Netflix’s old rating system). You might find that the people who liked Bojack Horsemanalso loved Arrested Development, and that they tend to cluster together. Or people who love Better Call Saul also hated Peaky Blinders and were tepid on House of Cards, so much so that they cluster together.
- Association Rule Learning: discovering rules for observing A given B. This is closely related to clustering in that we’re attempting to find connections between events. However, the difference is the approach. Instead of drawing bounds on a region and seeing if someone fits in that bucket, we use the frequency of a collection of discrete observations to create priors: what’s the probability of observing A, or B, or C? From these, we figure out what the probability is of observing A given B, or observing C given A and B. This is just Bayesian statistics: P(A), P(B | A), P(C | A, B), and so forth.
- Reinforcement Learning: In Eremenko and Poteves’ telling, this is the closest to ‘true AI’. Reinforcement learning works by punishing or rewarding the algorithm for good or bad behavior, respectively. The machine uses a predetermined set of behaviors to attempt to earn rewards and avoid punishments.
- Natural Language Processing: statistical machine learning applied to text, including sentiment analysis, entity disambiguation and speech recognition.
- Deep Learning: the latest and greatest in the field of machine learning. Deep learning is built off of neural networks, which are capable of approximating any real function but require massive amounts of data to do so.
After this explainer, Eremenko and Ponteves get down into the nitty-gritty, running models built with Scikit-Learn on practice datasets. If you want to follow along, you can download their test datasets here.
The Usual Code Pattern
When watching their live code, I recommend paying attention to their recipe. They use a fairly standard pattern that you’ll see with most machine learning libraries, and especially Scikit-Learn, which breaks down like this:
- Pre-process the data. This means cleaning your data, splitting it 80–20 into training and test data, and scaling all of your variables so that they fall into the same range, usually on the interval 0 to 1.
- Train the model. This is usually done in three lines at most. Lots of subtleties are glazed over in this video, such as cross-validation, but the differences in the actual code are only slight.
- Evaluate the model. This is done by simply throwing the test examples into the model and checking how many it predicts correctly. In the case of a continuous output variable, you will measure the error using a metric such as Root Mean Square Error. Visualizing the decision bound can also help, as they do here. Pay special attention to how they use a mesh grid to do this, it’s some wizardry that is difficult to figure out on your own!
Support Vector Machines (SVMs)
Support vector machines are very powerful algorithms built on an extremely simple premise.
- Pick the two examples of your different classes that are closest to each other. In practice, this is typically done by finding the Euclidean distance between every training example. These two examples are your support vectors.
- Draw a line through the space between the support vectors. This is your hyperplane.
- Wiggle the hyperplane around until it maximizes the total distance between the plane and each support vector. This is your cushion.
That’s it!
This very simple technique gets state-of-the-art performance on many problems. It’s arguably less sophisticated than the single-layer perceptron, yet will often outperform it. If you have a classification problem, this will usually be your go-to.
For Eremenko and Ponteves’ explanation, fast-forward the video to 6:20. For their live code, go to 12:45.
Bayes Classifier
The Bayes Classifier is a very simple model that exploits Bayes’ Theorem to predict an outcome given prior knowledge.
For a quick refresher on Bayes’ Theorem, let’s restate it here:
P(A|B) = P(B|A) P(A)P(B)
All this states is that the probability of observing A given B is the same as the probability of observing B given A.
Using this rule, we can make predictions on what class an observation will be. Say we have three types of evidence, A, B, and C, and two outcomes, X or Y. We want to predict whether we’re looking at an X or a Y based on whether A, B, and C are true or false.
To do this, just count up the number of examples for which A is true, and divide by the total number of examples; that’s your P(A). Repeat that to get P(B) and P(C). From here, we can predict the likelihoods of X and Y: P(X | A,B,C) and P(Y | A,B,C). Just pick whichever one is more likely to get the final prediction, and you’re done!
Eremenko and Ponteves point out an interesting technique for making this work on continuous variables, where A, B, and C won’t always break down as true and false. Say they’re all continuous variables; just draw a hypersphere around your point (a circle in two dimensions) with some predetermined radius r. Any point that is in this sphere will be counted as an instance for each of the variables for a given label. For example, if a point is an X, create a true example for A, B, and C when considering the probability of observing X, then divide the number of true examples by the number of points that are in the hypersphere. Use that to get your priors, then you’re off to the races!
For their live coding example, fast-forward to 31:30.
Kernel Support Vector Machines
The kernel support vector machine is essentially the same as the standard SVM, with a cool trick that allows it to discover non-linear decision boundaries.
Instead of using the data as-is, we throw our data into something called a kernel. The kernel is any function that takes an input with a given dimensionality and returns an output with higher dimensionality, effectively adding more features to your examples.
This can mean tacking an extra 1 onto the end of every input vector, though this won’t have any effect and will likely just make the model less accurate. Instead, there are a number of tried-and-true kernels that can be used, most notably the Radial Basis Function kernel.
Once the data is in a higher dimension, we run our SVM on the new data. It’s the exact same process as the regular SVM, only we’ve transformed our data. We then use predictions from the higher dimensional data to classify the corresponding point in the lower dimensional space.
To clarify, an example x of length d goes into the kernel and comes out as x` with at least length d + 1. We train the SVM on all of our x` examples. Now when we want to predict a new point, we repeat the process and return the classification that pops out of the higher dimensional model. That’s all!
For their explanation, fast-forward to 37:25.
Reinforcement Learning with the Upper Confidence Bound
Recall the general setup for reinforcement learning: we have well-defined actions that we can take, so we let the machine figure out how to maximize its reward based on the consequences of those actions.
The Upper Confidence Bound algorithm is a formalization of this idea, where the machine attempts to determine a single action it can take that will maximize its expected return.
This is best understood when framed as the Multi-Armed Bandit Problem. Imagine you have five slot machines and want to figure out which one will give you the biggest returns. An A/B test is too expensive here — testing each machine hundreds of times and comparing the results is prohibitively expensive. Instead, we want to see if we can make some money at the same time we’re honing in on the winner.
At the beginning, we have the same expected value of pulling the lever on any machine, and the same confidence interval for each. The following is a frame from Eremenko and Ponteves’ slideshow:
The red line is our ‘investigated’ expected value of pulling the lever, and the colored lines are the true expected value. The boxes show our confidence interval on the expected value.
Say we pull the lever once — and nothing happens. We lose a little money, so our expected value of pulling the lever decreases.
The average expected value has dropped, and the confidence bound has shrunk. The upper confidence bound is now lower than the highest one, so we move on to the machine with the highest bound. The rest of the bounds are equally high, so we pick one at random.
That worked! We pulled the lever and got a return, so our ‘investigated’ expected value goes up. The up confidence bound still shrinks though; it is now lower than three other machines, so we pick one of those three and proceed.
We keep doing this over and over again, picking the machine that we think has the highest possible return based on the upper confidence bound. This will slowly shrink the bounds and adjust our investigated expected value towards the true expected value.
Now our bounds are so close to the ‘investigated’ expected value, and the investigated expected value so close to the true expected value, that we can sit at the same machine all night and maximize our returns.
For their explanation, jump to 44:34. Their live code begins at 54:50.
These are just a few techniques you can use to derive insights from data, but they are fairly instructive when it comes to how most algorithms work. Most problems boil down to drawing a hyperplane to separate your data, whether it’s probabilistic like a Bayes Classifier or deterministic like a Support Vector Machine.
There are lots of subtleties to each of these approaches that a one-hour video can’t capture, and which you’d do well to learn. However, if you’re just earning your sea legs in machine learning, these models will take you a long way.