Boosting ML models to create strong learners
Boosting ML models allows you to eliminate bias, improve model accuracy, and boost performance.
How hybrid and ensemble techniques allow simple ML to excel
Machine learning is undoubtedly one of the most powerful techniques in AI. However, there are times when ML models are weak learners. Boosting is a way to take several weak models and combine them into a stronger one. Doing this allows you to eliminate bias, improve model accuracy, and boost performance.
Background
Artificial intelligence has been transforming many industries. Undoubtedly, it is a disruptive technology on a par with the invention of digital computing itself. One of the most powerful forms of AI is machine learning (ML). Machine learning can be supervised, where you tell the computer whether it was right or not, or unsupervised.
In supervised learning, you take a large set of known data and use this to train the model to identify the correct features. A good example is support vector machines for feature classification in images. In unsupervised learning, the computer takes the raw data and tries to find interesting features or patterns. An example of this is k-means clustering.
The problem with supervised learning is that you often need a really large set of classified training data to create a robust model. In many cases, you will struggle to find enough training data and your model will be a weak learner, or you won’t have enough computing power to create a strong model. So, how can we overcome this issue? Or, as Kearns and Valiant asked, "Can a set of weak learners create a single strong learner?" This is where “boosting” comes in. Boosting is an ensemble technique that takes a group of ML models that are weak learners and uses them to create a strong learner.
A supervised learning 101
Supervised learning is the classic form of machine learning. It is used to perform a wide range of tasks including clustering, anomaly detection, and pattern recognition. It is also used in other forms of AI like image recognition, speech-to-text, and handwriting recognition.
The starting point is to find (or create) a large set of data with the appropriate features identified and labeled. Next, you need to choose a form of supervised learning algorithm. There are many choices including backpropagation, genetic programming, and naïve Bayes classifier. As an example, we will look at how backpropagation works in artificial neural networks.
The artificial neuron, or perceptron
The basis for any ANN is the artificial neuron or perceptron shown below.
By themselves, perceptrons aren’t very useful. But when you combine them into an artificial neural network, they become far more useful. In an ANN, the number of inputs determines the size of the dataset that can be analyzed and the number of outputs is determined by the number of states you are looking for. Between these, you have one (or more) hidden layers.
In an ANN, the aim is to take the inputs and apply appropriate weights at each stage such that the outputs give the “correct” answer. Typically, you can think of the result at each layer as a set of probabilities. During training, you feed in your known data as the input. You then inspect the output and compare it to the expected one. For instance, in the example above, the correct output for Wednesday and no holiday would be 7 am, in other words, {1.0, 0, 0}. But actually, the current weights return a result of {0.5, 0.8, 0}, or 9am.
You need to know whether to increase or decrease the weight at each step by backpropagation. Starting at the output you look at the error (loss function) and find its derivative. The aim is to minimize the error using a technique such as gradient descent. You then step backward and repeat this process until you reach the input. You repeat this cycle many times over using the training data. Once you are sure the model is accurate, you use a new set of test data to verify it.
Weak and strong learners
Weak machine learning models only ever give results that are marginally better than pure chance. Other models are much stronger learners, and with the correct training, data can become highly accurate. So, surely, we should only ever consider strong learners? Well, there’s one key benefit to weak learners – typically, they are computationally very simple. For instance, one example is the decision stump, which is a decision tree with only a single split. Moreover, even with a perfect training data set, you may struggle to create strong models for some problems.
Boosting
So, how can you improve the performance of these weak models? The answer is to combine them into a better model. There are many approaches to this, but we will look at three common ones.
Adaptive Boosting
Adaptive boosting, or AdaBoost, combines multiple decision stumps into a single strong learner. At the start, all observations are weighted equally. At each step, you check whether the observations were correct, and give higher weights to the incorrect ones, thus ensuring they will be included in the next decision stump. The idea is to slowly force the decision stump to favor the correct result. As a simple example, imagine we want to correctly classify regions as positive (blue) or negative (red). We start with a random decision stump, D1, as shown below.
The decision stump D1 incorrectly includes 3 positives in the negative area. In Box 2, these incorrect observations have been given higher weights, creating decision stump D2. Now, there are 3 negatives in the positive area. In Box 3, these incorrect observations are also given higher weights, resulting in decision stump D3. Finally, in Box 4, we combine the 3 decision stumps into a tree and see a correct classification.
Random Forest
Random forest also uses multiple decision trees. This time, the trees are more complex (not just decision stumps). A random selection of the data and a random set of features are used to create the trees. Simple majority voting is used to combine the trees. The random element removes the risk of overfitting and also reduces the variance of the model.
Gradient boosting
Gradient boosting works by creating an ensemble of increasingly accurate predictors. It works using the gradient descent approach that is common in backpropagation. At every iteration, it will look at the residual errors. It then tries to fit the new predictor to these errors. The result is that each model has a smaller error than the previous one, thus, the final model should be far more robust.
The benefits of boosting
Boosting has a couple of benefits that make it a particularly powerful form of hybrid machine learning. Firstly, it allows you to create complex and accurate models from computationally-simple elements. Secondly, boosting replicates the learning from mistakes that is the hallmark of true intelligence. Thirdly, it creates models that are much more robust against overfitting and bias. Overfitting is a common issue when training supervised ML models. In essence, the model becomes too good at spotting the features in the training data, to the extent that it stops working so well for new, unknown data. Bias in the training data is also a major problem for ML.
Applications of boosting
Nowadays, boosting techniques are used to help solve a wide range of AI problems. For instance here at Functionize, we use an autonomous intelligent test agent to run all your automated tests. This test agent uses multiple forms of artificial intelligence. Many of these rely in turn on boosting. For instance, our ML Engine engine uses computer vision as one way to identify and select elements on the screen.
Computer vision
Computer vision involves teaching a computer to recognize objects within an image. It consists of 3 steps.
- Object localization involves determining which pixels belong to which object. Your aim is to draw bounding boxes around each object.
- Object classification is the process of deciding what each object is.
- Semantic segmentation groups objects together to try and make sense of the overall image.
The most robust object classifiers are all examples of boosting. These include random forest and support vector machine (which uses higher dimensional hyperplanes to separate the data).
Self-driving vehicles
SDVs have a major problem with computational power. Firstly, high-power computers tend to be cumbersome, and secondly, they also consume significant energy. However, SDVs also need to be able to take decisions in a timely manner. There’s no use identifying a pedestrian after you have run them over! This means that SDVs need algorithms that are computationally simple, accurate, and fast to run. Thus, SDVs are full of examples of boosting.
Conclusions
As we have seen, boosting allows you to combine multiple simplistic machine learning models and create one much more robust model. It is widely used in many applications of AI, especially ones that seek to replicate human senses. This is partly because it replicates the way us humans learn to interpret the world around us. The Functionize intelligent test agent relies on many AI techniques that incorporate boosting. And it’s fair to say that that makes boosting central to how our system works.