Ensembles of Classifiers
Review of Lior Rokach, “Ensemble-based classifiers”, Artificial Intelligence Review, Volume 33 Issue 1-2, 2010.
Main components of an ensemble learning framework.
According to the Lior ensemble frameworks have four main building sections 1) training set, 2) base inducer, 3) diversity generator and 4) combiner.
1) Training set. At this section we have labeled data that is use to set the input and the class target that we want the model to train to. We may implement bootstrapping or leave one out techniques to build.
2) Base Inducer. At this block we would implement algorithms to create training sets in order to build the classifier. This way we are able to generate different scenarios to train a classification algorithm. For these task we may implement decision trees or K-nearest neighbor algorithms.
3) Diversity Generator. As we construct different dataset samples to train our classifier, this component aims to construct samples that are true representations of the dataset and because of that, our classifier would have no correlation classification and better accuracy than a stand-alone classifier.
4) Combiner. Once we have build several classifiers under different datasets we have to unify the knowledge of all in one. This combiner could be another algorithm that would represent them all as the final classifier. At this section we need to define the criteria to unify the classifiers. To that end we could use meta-learners or weight base criteria. Meta-learners are good when our classifiers have some bias towards misclassification or that are consistently good classifiers. Weight base combiners are best for more balanced classifiers.
Huge databases pose several challenges including computing complexity and storage problems for most machine learning algorithms.
Three modifications of AdaBoost that tackle issues of the size of the training datasets are a) Breiman processing, b) P-AdaBoost and c) Local boosting. These modifications help to overcome problems regarding computing complexity, classification accuracy (under-fitting) and storage problems.
a) Breiman processing. Breiman proposes to take small votes for classification in terms of the sample size. We would divide a large dataset into small samples to train a classifier and then combine each prediction. By doing it accuracy remains as good as if training was perform using all instances available. On other hand the number of iterations needed are considerable resulting in a complex model.
b) P-AdaBoost. Uses the principle of distributed systems. This version consists on two steps in which the algorithm run. In the first step the goal is to determine the weights of each instance without updating them. The second step takes as input the weights form step one; several classifiers are executed in parallel and then by combining results the final classifier is build.
c) Local Boosting. The classifier performs boosting with resampling. In each boosting process an error is determine for each instance. The error is determine by comparing the classification of each instance with its neighbors in that sample; when a misclassification occur, with respect to its peers, the classification is updated and the weight for that instance is reduce. If the neighbors are also misclassified then the weight for that instance is increased. As in regular AdaBoost correct classifications are given a lower weight. At the end each instance has a score, which is used when classifying new instances. In contrast to AdaBoost noisy cases are captured and reduced in weight, resulting in less bias model improving accuracy and reliability. On the downside the storage of all instances to classify could be costly.
The main methods for combining the base-classifiers’ outputs
Two main methods are discussed for combining classification outputs: i) weighting and ii) meta-learners methods.
i) Weighting methods. This type is useful when the classifiers perform the same task and have similar rates of accuracy. Some of the combination methods are the following:
a. Majority voting. Is a basic ensemble technique that unifies the outputs of several classifiers into one, using the mode.
b. Performance weighting. The classification of an instance is driven by the accuracy that each independent classifier has. Those algorithms with highest accuracy are recognized as better classifiers and its weight in the final classification is increased.
c. Bayesian combination. This combiner is based in the posterior probability of the classifier. Weights are assign given the probability that each class has.
ii) Meta-learner methods. When classifiers are regularly outperforming or miss-performing when classifying.
a. Staking. This method aims to identify the reliable classifiers and copes with different inducers. It produces a new dataset with the predictions of the classifiers to fit the final model; the class target remains as in the original data. Its most relevant step is to divide the dataset in a way that the initial classifier uses a truly representation of the dataset. While combining the results its possible to set probabilities of the classes in the base-level.
b. Arbiter trees. This type of combiner uses decision trees to create the final classifier. It is arbitrary since at each parent node two random inducers would be selected for the partition. The same learning algorithm is use at each partition. There are several forms for arbiter trees; its difference is the selection criteria. Basically the criteria stars selecting instances that disagree; including at the end those that have correct classifications.
The concept of “diversity” for ensembles of classifiers and two approaches for creating diversity on ensembles of classifiers.
Diversity for ensembles of classifiers is the idea of improving accuracy of the classifier. A diversified ensemble would have uncorrelated classifications. According to the paper, the theory does not completely explaining why and how individual models help improving accuracy. Five approaches are mention and here I discuss two of them.
i) Manipulating the training sample. This approach talks about using different training set for each classifier. It is useful for classifiers that are unstable due to important changes in the trained classifier. Bagging and boosting are popular methods of this approach.
Another method is Partitioning. Design for large dataset, the method divides the dataset into random disjoint partitions. The result is not only a more accurate and diverse classifier, it also helps handling large amount of data. Clustering is another approach for partitioning. The idea is to create clusters in such a way that each sub-sample is a truly representation of the entire dataset. Partitioning helps into create faster, more accurate and scalable algorithms.
ii) Manipulating the search space. This approach makes that each classifier is focused on different parts of the dataset. The resulting classifiers are independent and the final algorithm is representing the entire dataset. The division of the dataset could be done with overlapping or without.
As each classifier looks at different subsets of the data set, its possible to use different algorithms, aim to get the most accurate and reliable one for each subset. But that could lead to significant difference among classifiers.
Within this approach there are two common methods: divide and conquer and feature subset-based methods. In the firs method the dataset is divided into several subspaces, using clustering techniques or hybrid combinations like Naïve Bayes trees. For each subspace, different classifiers would be applied; since the performance is influenced by the dataset size and homogeneity, overlap would be allowed.