Producing models is one of the main activities in every Data Scientist’s life. During the process of creating a good model for a given well-prepared dataset, you often end up with a bunch of trained models and select the one that better fits your needs.

Source: Bruno Cestari

For example, you are creating a binary classification model to predict whether or not it will rain tomorrow. You train and test six different models. The predictions for a week are shown in the table below:

Which model should we select?


None of them performs perfectly in this task and some of them are almost equivalent.


But if you look carefully at the predictions, you can see that the errors of each model are non-correlated with each other. What one model predicts wrongly, almost all others predict correctly.


So, what happens if we take another step into this modeling and count how many votes each class has for each day and then decide by a majority vote?

The final model predicts everything correctly! We’ve created an optimal prediction aggregating the models we have. This is known as “ensemble learning.”

Source: Bruno Cestari

This illustrative case is very simple. Real-life cases often will not perform so well. But how do we know this wasn’t luck? Is there anything that guarantees that this technique can increase performance?

To answer this question, let us suppose we have 6 models completely non-correlated. What would the expected accuracy for the ensemble be if each model has an average accuracy of 75%?

For any observation, each model has, individually, 75% chances to predict the right class and 25% chances to predict the wrong class. For 6 models, we can use the binomial distribution with p=0.75 and n=6:

Picking only the cases where the most voted is the correct answer:

The theoretical accuracy is 83%! An improvement achieved just by using a simple majority vote of the already trained models!

For n = 50 models, the expected accuracy will be 99.9%! But it is not that simple… The greatest challenge here is to find 50 non-correlated models. The outputs for different models predicting the same thing usually are somehow correlated, so the real accuracy is lower than expected.

Fortunately, this is not the only way that we have for creating an ensemble model. In the next part of this series, I will delve deeper into how we can improve the ensemble performance by using the four main methods: Weighted voting, Stacking, Baggaging, and Boosting.


Featured Image by Markus Spiske on Unsplash


About the author

Bruno Cestari is a Software Engineer at Poatek.