A Diversity-Accuracy Measure for Homogenous Ensemble Selection

aim of ensemble pruning methods is to improve both the efficiency (prediction time) and the prediction performances [35] because a large number of models increase the computational complexity but guarantees a great diversity within the ensemble. This diversity is represented by models with good or bad predicting performances. Models with poor performance negatively affect the overall performance of the ensemble. Eliminating them while maintaining a large diversity of the remaining elements in the ensemble, improves performance while reducing prediction time. We propose a new method to simplify homogeneous ensembles composed of C4.5 decision trees [39]. This method is based on a DHCEP (Directed Hill Climbing Ensemble Pruning) strategy with a multi-objective function to evaluate the relevance of an ensemble of trees. The function, used in a Hill Climbing process in Forward Selection (FS), allows selection of ensembles with the best compromise between maximum diversity and minimum error rate. The motivation behind the joint use of the two criteria is that there is a correlation between the individual performance of classifiers and their diversity. The more accurate the classifiers, the less they disagree. The use of one of the two properties is not sufficient to find the best performing ensemble.

I. Introduction D ecision trees are classification methods for generating mutually exclusive decision rules structured in trees that must be simple with maximum performances on the learning sample.
These methods have the advantage to generate intelligible rules but are less efficient than the competing methods because of a significant variance due to the trees instability [5] that increases their generalization error rate [20].
To alleviate this weakness and find a compromise between the complexity of a model and its generalization reliability, stopping criteria as well as post-pruning algorithms have been proposed. These techniques reduce the variance but deteriorate the performances (increase the value of the bias). Another solution is to improve an unstable learning algorithm by using it several times to construct a set of different models [6], [40], [15], [18]. The generated models are then aggregated in order to combine their predictions. In the case of learning by regression, the aggregation is based on individual models predictions, whereas in the case of learning by classification, it is done by a majority vote among the classes predicted by the different models.
The effectiveness of these methods is related to the construction of a collection of classifiers that are both sufficiently precise (performance) and sufficiently diverse (diversity). Diversity offers the opportunity to benefit from the complementarity of the individual models that make up the ensemble.
If each classifier is precise and does not commit errors on the same individuals as others, then the uncorrelated errors of the different classifiers are removed using the voting process.
The ensemble methods are noise-resistant; do not suffer from overlearning and give good performances [41], but have the disadvantage of relying on a large number of models, which can have as consequences increased learning time, storage resources [31] and the prediction time related to the interrogation of all models in the set.
The aim of ensemble pruning methods is to improve both the efficiency (prediction time) and the prediction performances [35] because a large number of models increase the computational complexity but guarantees a great diversity within the ensemble. This diversity is represented by models with good or bad predicting performances.
Models with poor performance negatively affect the overall performance of the ensemble. Eliminating them while maintaining a large diversity of the remaining elements in the ensemble, improves performance while reducing prediction time.
We propose a new method to simplify homogeneous ensembles composed of C4.5 decision trees [39]. This method is based on a DHCEP (Directed Hill Climbing Ensemble Pruning) strategy with a multi-objective function to evaluate the relevance of an ensemble of trees. The function, used in a Hill Climbing process in Forward Selection (FS), allows selection of ensembles with the best compromise between maximum diversity and minimum error rate. The motivation behind the joint use of the two criteria is that there is a correlation between the individual performance of classifiers and their diversity. The more accurate the classifiers, the less they disagree. The use of one of the two properties is not sufficient to find the best performing ensemble.
The proposed new multi-objective function is based on this compromise between individual performance of trees and their diversity. A comparison with UWA methods [36], Complementariness (Comp) [32], and Margin Distance Minimization (MARGIN) [33] shows, in most cases, that the proposed method allows generating ensembles that are both smaller in size and more efficient, than those of the methods cited above. This reduced number of trees allows a gain in memory space and computing time which can be very significant for large samples.
The paper is organized as follows: In Section 2 we present a state-ofthe-art on ensembles selection methods with more details on UWA [36], Complementariness [32], and Margin Distance minimization [33].
These methods allow simplifying ensembles and will serve as a basis for comparison (The source code for these different methods is available athttp://mlkd.csd.auth.gr/ensemblepruning.html [36]). In Section 3, we present the DIACES method, detailing the proposed new function as well as the path strategy. Section 4 contains all the experiments and analysis of the obtained results, whereas in the last section we conclude and propose some insights on future work.

II. Bagging, Aggregation, and Hill Climbing
In this section, we highlight the basic elements used in this paper, namely, the bagging method used for the generation of the initial ensemble, aggregation by unweighted and weighted vote.

A. Diversification by Bagging
Bagging Bootstrap Aggregating is a resampling method introduced by Breiman in 1996 [6]. Given a learning sample Ω L and a learning method which generates a predictor ĥ(.,Ω L ) using Ω L . The principle of bagging is to draw several bootstrap samples (Ω L ө1 , …,Ω L өq ) and generate for each one a collection of predictors (ĥ(.,Ω L ө1 ), …, ĥ(.,Ω L өq )) using the base learning method for finally aggregating them.
A bootstrap sample Ω a l is obtained by randomly drawing n observations in the starting sample Ω L . Each observation has the probability of 1/n of being shot; |Ω L |=n, the random variable Ө l represents the random drawing.
Initially, Bagging was introduced with a decision tree as basic rule. But the schema is general and can be apply to other basic rules. In Fig.  1 is presented the principle of Bagging.

B. Aggregation (Unweighted Vote, Weighted Vote)
The unweighted and weighted voting are the most used methods for combining (aggregating) whether homogenous or heterogeneous models. In ensemble methods each model, for an instance, gives a class value, a probability, and the class with most votes, highest average probability is assigned to the instance by the ensemble.
In weighted vote, the classification models are associated with weights assigned relatively to their classification accuracy. Formally this can be written [36]: Let x be an instance and m i, i = 1..k a set of models that output a probability distribution m i (x,c j ) for each class c j , j = 1 . . . n . The output of the (weighted) voting method y(x) for instance x is given by the following mathematical expression: (1)

C. Hill Climbing
Hill climbing is an optimization technique belonging to the family of local search. The algorithm starts with any solution to a problem, then tries iteratively to find a better solution by changing one element of the solution. If the change produces a better solution (maximize or minimize the evaluation function used for the course), an incremental change is made to the new solution. The process is repeated until no improvements can be found (the function reached the maximum or the minimum).
Hill climbing attempts to maximize (or minimize) a target function f(X) where X is a vector of continuous and/or discrete values. Each iteration, hill climbing will adjust a single element in X and determine if the change improves the value of f(X). Any change improving the function f(X) is accepted, the process continues until no amelioration of the function can be found.
For ensemble selection, DHCEP (Directed Hill Climbing Ensemble Pruning) is used, in this case the vector X is composed of classifiers or predictors.
The course can be realized either in backward elimination or in forward selection, in the first case the whole ensemble is considered as a solution and then repeatedly elements not improving the evaluation function are eliminated one by one, in the second case we initialize with an element randomly and we add the elements that improve the evaluation function one by one. The elements to be added or removed are part of the neighborhood of the current solution. In Fig. 2, a hill climbing diagram for an ensemble composed of four models is presented.
Among these methods, a large number for ensemble pruning based on a hill climbing research process have recently been proposed [31], [17], [10], [32], [3]. The methods differ from each other by the adopted research directions, the different evaluation measures or the evaluation ensembles. It will be noted that some methods use the learning sample for evaluation while others promote the use of a separate validation set. The latter depends mainly on the availability of the data.
A first type of approaches uses performance measures. Fan et al. [17] propose a profit-based evaluation function and propose dynamic scheduling to accelerate the prediction process. For the reduction of the ensemble size, the total benefit is used as selection criterion in conjunction with a greedy search algorithm with and without back fitting. The path begins with the model with the greatest benefit. A set of instances x is considered, each instance x can be positive or negative, B(x) denotes the benefit of predicting x as positive and the total benefit BT=∑ x B(x), the authors choose the sub ensembles which maximize the total benefit.
Empirical evaluations of several data sets have revealed that the profit-driven greedy approach with or without back fitting eliminates 90% of the size of an ensemble maintaining or sometimes exceeding the total benefit on the test sample of the original ensemble. The authors have also studied the possibility of combining diversity and total benefit, but the experimental results have shown that the total benefit is a good criterion by itself.
Caruana et al. [10] use several performance metrics and a hill climbing strategy for building ensembles of models from libraries of thousands of models. Model libraries are generated using different learning algorithms. The Forward Stepwise selection consists to add models that maximize its performance.
The second type of approaches uses diversity-based measures. Martínez-Muñoz et al. [32] use diversity with a hill climbing forward selection process. The diversity measure called Complementariness is defined as the complementarity of a model h k with respect to the current sub set and a set of instances of the evaluation sample Eval = {(x i , y i )} such that |Eval| = n, the measure COM is calculated as follows: Where I (True) = 1, I (False) = 0, Sub(x i ) is the classification of the instance x i by the sub-set Sub. The measurement principle is to add to Sub, the model which allows classifying correctly the examples misclassified by the subset.
Martínez-Muñoz et al. [33] propose the minimization of the marginal distance which allows calculating the diversity by associating to each classifier h t a vector c t whose dimension is equal to the number of individuals of the evaluation sample. An element c t (i) takes the value 1 if h t properly classifies the individual i and -1 otherwise. An average vector C Sub , associated with a subset Sub is calculated as . The objective is to reduce the Euclidean distance d(o, C Sub ) where o is a predefined vector. The measure that represents the margin is written: Partalas et al. [35], [36] propose the Uncertainty Weighted Accuracy measure (UWA) in (4) that considers four cases when adding a model to a sub ensemble and using justified weights to distinguish favor cases from others.
Where the parameters α, β represent respectively the number of models in the sub-set Sub correctly classifying the instance (x i ,y i ) and the number of models incorrectly classifying the same instance.
More recent related work [30] theoretically deals with the effect of diversity on voting generalization performance using Probably Approximately Correct (PAC) learning. It is revealed that diversity is closely related to the space complexity hypothesis, and strengthening it can be achieved by applying regularization to ensemble methods. Based on this analysis, the authors apply an explicit regularization of the diversity for the selection of ensembles.
Dai [12] proposes an improvement of the ensemble selection method of the same authors. This method uses backtracking in depth, which is perfectly adapted to systematically seek solutions to combinatorial problems of great magnitude. This improvement concerns the response time of this method, which has been considerably improved in this study.
Zhou et al. [51] propose a new algorithm based on frequent item learning that links data and the simplified ensemble to a transactional database whose transactions are instances and items are classifiers. A Boolean classification matrix is used for each model of the pruned ensemble. Using this matrix, several candidate ensembles are obtained by iterative and incremental extraction of basic classifiers with the best performances.
Bhatnagar et al. [4] perform ensemble selection using a performancebased and diversity-based function that considers the individual performance of classifiers as well as the diversity between pairs of classifiers. A bottom-up search is performed to generate the sub ensembles by adding various pairs of classifiers with high performance.
To simplify a set of classifiers usually involves reducing the number of trees while maximizing performance. Qian et al. [38] adopt the Pareto diagram to solve this two-goal problem using an evolutionary optimization method of Pareto combined with a local search operator. The method is applied in the field of mobile human activity recognition.
Based on the approximate ensembles, Guo et al. [22] propose a new framework for ensemble selection. In this context, the relationship between attributes in an approximate space is considered a priori as well as their degree of maximum dependence. This effectively reduces the search space and increases the diversity of selected sub-ensembles. Finally, to choose the appropriate sub-ensemble, an evaluation function that balances diversity and precision is used. The proposed method allows repetitively changing the search space of the relevant sub-ensembles and selecting the next sub-ensemble from a new search space.
Cavalcanti et al. [11] combine in pairs different matrices of diversity using a genetic algorithm. The combined diversity matrix is then used to group similar (not very diverse) models; they must not belong to the same ensemble. To generate candidate ensembles, the combined diversity matrix is transformed into one or more graphs and then a graph coloring technique is applied.
Guo et al. [23] propose a new metric using the margin (instances) and the diversity (of classifiers) to explicitly evaluate the importance of individual classifiers. By adding the models to the ensemble in decreasing order of the metric, the user can choose the first T models to form a sub-ensemble.
Dai et al. [13] emphasize the utility of optimizing predictive performance together with diversity, which are two indispensable and inseparable parameters for ensemble selection. There have been three measures proposed to simplify ensembles using a greedy algorithm: 1) The first measure simultaneously considers the difference (diversity) between the current subset and the candidate classifier and the performance of each one; 2) The second allows evaluating the diversity within the ensemble and; 3) the last measure reinforces the concern about the accuracy of the resulting sub-ensemble. Experimental results confirm the interest of the three measures which is illustrated by the improvement of performances.

IV. The DIACES Proposed Measure
Our goal is to construct a distribution of the number of errors associated with each case and to calculate the diversity of this distribution. Our goal is to minimize diversity while maintaining good performance for each classifier.
The set of data Ω is divided into two sub samples Ω L (generally 80% of Ω) for learning and pruning and Ω T (generally 20% of Ω) for testing. A bagging ensemble BE of t C4.5 trees is constructed, BE = {T 1 ,…,T i ,…,T t }, using Ω L with |Ω L | = n. Each tree T i is represented by a vector (x 1i , x 2i ,…, x ji ,…, x ni ) T . We have the following notations: • e j : The error rate associated with the tree The evaluation function to optimize noted S connects diversity ϴ i and the error rate e j : .
The component C = is a concentration index of error distribution which is derived from the quadratic entropy (or Gini index). The smallest value of C is 1/n, where all x i+ have the same value. This situation is best conditioned by the value of X. The minimization of C+E, using two metrics error rates and concentration index of errors distribution, allows having a good compromise between the diversity of the trees and their average performance. C* represents the normalized coefficient =>C*= E* represents the normalized coefficient =>E* = Finally the function to minimize S=C*+αE* (α is a parameter determined empirically by the user). Sub ensemble ᴪ 0 of BE; Begin Initialize(ᴪ 0 ); 1. Calculate S(ᴪ 0 ,Ω L ); if Ǝ ᴪ j such as S(ᴪ j ,Ω L ) < S(ᴪ 0 , Ω L ) where ᴪ j ∈ Neighborhood (ᴪ 0 )Then ᴪ 0 =argmin ᴪj (S(ᴪ j ,Ω L )); Goto 1; End.
The algorithm complexity consists in calculating the hill climbing path method complexity which is O(k 2 ) where k is the number of classifiers of the ensemble. The function S is computed from a matrix composed of n rows (number of individuals in the validation set), its complexity is O(n), the calculation of the function is repeated k' times where k' is the number of ensembles traveled in the hill climbing scheme. The complexity of the proposed method is O(n*k'*k 2 ).

A. Initialization and Path
The path strategy used by our method is a hill climbing strategy whose principle is simple [44]. It consists in reducing the number of ensembles generated in the case of an exhaustive path in which 2 k of ensembles are explored (k is the number of classifiers).
The hill climbing allows obtaining a sub-optimal solution by going through subsets, considering a set of states and selecting the next state to be visited from the neighborhood of the current state. In this case, the states are the different ensembles of models and the neighborhood of a subset Sub of BE (set of all hypotheses) is composed of ensembles constructed by adding (forward selection) or deleting (backward elimination) a model of Sub. The method goes across the search space (all subsets of models) from one end to the other; one of the two ends is composed of the empty set and the other of the set of all the models. The complexity of hill climbing is O(k 2 ).
In our case, we go across the set in forward selection and for the initialization we choose a tree that is not very good and not very bad. The fact of not choosing the most precise tree, as is the case for many methods that adopt a hill climbing method for selection in a set, is due to the fact that in some cases the most accurate tree can be perfect (does not make any error) on the evaluation sample, consequently a subset is produced with a single tree (the initialization tree) which can be very bad in generalization.
The proposed solution consists in choosing a tree with "average" performances on the evaluation sample. First the k trees of the initial set BE are ordered using performance on the evaluation sample in ascending or decreasing way, the ordered set B is then decomposed in 3 subsets BE1, BE2, and BE3, finally the initialization tree is randomly selected from the subset BE2.
According to experimental studies that we have carried out, the proposed multi-objective function directly affects the error rate in generalization (on the test sample Ω T ). In Fig. 3, the curve shows the correlation between the function and the error rate for the Ionosphere dataset.

V. Experiments
Experiments consist in building homogeneous sets by sampling the starting sample and using the C4.5 decision tree generation algorithm as a basic rule.

A. Materials and Methods
The Weka platform [46] is used as a source for the C4.5 learning algorithm and validation. For this purpose, we consider 24 Benchmarks of the UCI Repository [2] which are described in Table I, the table is  composed of five   The initial sample decomposition is based on the results presented in [37]. It concludes that for the case of simplification of homogeneous set, pruning on the same learning sample allows obtaining better results than using separate samples, which is also an advantage in the case of unavailability of large amounts of data.
The initial sample is subdivided into two sub-samples, one of which is composed of 80% of the individuals and will be used for learning (model generation) and evaluation (pruning), while the remaining 20% will be used for testing.
The method we propose, DIACES, is compared to a set of ensemble pruning methods based on diversity: UWA, COM, and MARGIN, detailed the state of art. For all these methods, the unweighted majority vote is used for the combination of models and the performance calculation.
The methods use a forward selection strategy in a hill climbing scheme. The stop criterion for methods in literature is the performance on the evaluation sample which generates subsets of reduced sizes compared with the usual stop criteria defined as a fixed number of models [35]. In our case we use the same function for both the path and the stop.

B. Results and Analysis
A first criterion to compare the different pruning methods is the performance of the subsets obtained. For each method and each dataset, 10 different draws are made for which the averages of the success rates obtained for each draw are calculated.
For 24 benchmarks, DIACES shows better performances in 15 cases followed by the COM method with 6 victories, 3 victories for UWA and finally MARGIN with two victories. For the average success rate on all datasets, DIACES is ranked first with an average rate of 80.48%, exceeding the other methods with a rate of at least 0.9%.
Compared to UWA, DIACES improves performance by about 3% on the Audiology, Breast cancer, Vehicle, Primary Tumor, and Titanic datasets. Moreover, DIACES improves the performances obtained by the COM and MARGIN methods by at least 0.2%, 4% on Audiology data set.
To ensure whether the performance differences between DIACES and the other methods are signifi cant; we will use the statistical test of the sign. The test allows ensuring the effectiveness of the proposed method for any data collection. The test considers the method achieving the best success rate and compares it to DIACES. Based on the results presented in Table II, we compare DIACES to COM which gives an average success rate of 79.87% following the next steps: The p-value < 0.05 (0.05 being the risk value) implies that the performance difference between DIACES and COM is significant, and consequently DIACES is better than all the other methods. Depending on the average success rates values, a rank is assigned to each method for each benchmark. In this way the different methods are compared according to their average ranks which is an appropriate criterion of comparison [14], the ranks and average ranks of the methods are presented in Table III. AVR represents the average rank of the generated sub ensembles based on success rates.  We note that the proposed method has the best average rank with 1.81, followed by MARGIN method with 2.67, COM comes in the third position with 2.67, and finally the UWA method with 2.81.
A second comparison criterion is the size of the ensemble obtained. For each benchmark and on 10 draws, the average sizes of the subsets obtained is calculated. Table IV shows the average sizes (over 10 iterations) of the sub sets obtained for each method. The number of models selected is reduced for all methods compared to the original size of the ensemble. The reductions for all methods compared to the initial ensemble of all models vary between 93% and 97%.
The new method allows obtaining subsets of reduced sizes compared to the other methods for 17 data sets, 66% of cases; UWA comes in second place with 6 victories, and finally the COM method which counts one victory.
The study of algorithmic complexity makes it possible to calculate the quantity of resources (time or space) necessary for its execution. The complexities in time of the proposed method are calculated on the 24 data sets (For each one of them it is a question of calculating the average of time and space on 10 iterations) on a machine treating 109 instructions/ seconds (1 Gigahertz) and a memory of 3 Gigabytes. The Hill Climbing search method is fast because it does not cover all the possible case combinations, the maximum time is calculated for cmc and credit-g data sets for which the search lasts 5.12 ms, the minimum times are 0.15 ms for the anneal data set; For other data sets the times vary between 3.7 ms and 0.5 ms. The 24 data sets have a total time of 0.82 seconds.

VI. Conclusion and Future Works
The ensemble methods improve the performance of an unstable classifier but have the disadvantage of the loss of readability of the model provided; Composed of a large number of distinct trees and therefore more difficult to synthesize by humans. This is why other methods have also been proposed to synthesize not the results but the structure of a tree in the form of a "consensus" from a set of classifiers of the same type [42] [45] nevertheless with a deterioration in the quality of prediction.
In this paper we presented a new evaluation function combining performance and diversity for selection in a homogeneous ensemble used in a process of climbing hill path. The method was evaluated on several benchmarks and compared to pruning homogeneous ensembles in literature.
The results show that the proposed method obtains ensembles with performances exceeding the ensembles obtained by the compared methods. In addition the interest of the function is that it can be used equally with homogeneous sets as well as heterogeneous sets (the basic rules are not the same).
We plan, in future work, to use this new pruning measure and apply it for selection in a random forest ensemble, knowing that a random forest ensemble improves the performance of a bagging [7].We also propose to study the possibility of using another path strategy for the selection. The third contribution consists in using the method in the field of predicting student performance in education institutions and comparing the results with those obtained in [24]. In their work [48] use various classification methods (Neural networks, Kppv) separately for the diagnosis of breast cancer. We propose to use a heterogeneous ensemble composed of several classification methods for the same application. The ensemble will then be simplified by using the proposed measure.