A Probability-based Evolutionary Algorithm with Mutations to Learn Bayesian Networks

— Bayesian networks are regarded as one of the essential tools to analyze causal relationship between events from data. To learn the structure of highly-reliable Bayesian networks from data as quickly as possible is one of the important problems that several studies have been tried to achieve. In recent years, probability-based evolutionary algorithms have been proposed as a new efficient approach to learn Bayesian networks. In this paper, we target on one of the probability-based evolutionary algorithms called PBIL (Probability-Based Incremental Learning), and propose a new mutation operator. Through performance evaluation, we found that the proposed mutation operator has a good performance in learning Bayesian networks.


I. INTRODUCTION
AYESIAN network is a well-known probabilistic model that represents causal relationships among events, which has been applied to so many areas such as Bioinformatics, medical analyses, document classifications, information searches, decision support, etc. Recently, due to several useful tools to construct Bayesian networks, and also due to rapid growth of computer powers, Bayesian networks became regarded as one of the promising analytic tools that help detailed analyses of large data in variety of important study areas.
To learn a near-optimal Bayesian network structure from a set of target data, efficient optimization algorithm is required that searches an exponentially large solution space for nearoptimal Bayesian network structure, as this problem was proved to be NP-hard [1]. To find better Bayesian network structures with less time, several efficient search algorithms have been proposed so far. Cooper et al., proposed a wellknown deterministic algorithm called K2 [2] that searches for near-optimal solutions by applying a constraint of the order of events. As for the general cases without the order constraint, although several approaches have been proposed so far, many of which uses genetic algorithms (GAs), which find good Bayesian network structures within a reasonable time [3][4] [5]. However, because recently we are facing on large data, more efficient algorithms to find better Bayesian network models are expected.
To meet this requirement, recently, a new category of algorithms so called EDA (Estimation of Distribution Algorithm) has been reported to provide better performance in learning Bayesian Networks. EDA is a kind of genetic algorithms that evolves statistic distributions to produce individuals over generations. There are several types of EDA such as UMDA (Uni-variate Marginal Distribution Algorithm) [12], PBIL (Population-Based Incremental Learning) [7], MIMIC (Mutual Information Maximization for Input Clustering) [13], etc. According to the result of Kim et al. [11], PBIL-based algorithm would be the most suitable for learning Bayesian networks.
The first PBIL-based algorithm for Bayesian networks was presented by Blanco et al. [9], which learns good Bayesian net-works within short time. However, because this algorithm does not include mutation, it easily falls into local minimum solution. To avoid converging at local minimum solutions, Handa et al. introduced a bitwise mutation into PBIL and showed that the mutation operator improved the quality of solutions in four-peaks problem, Fc4 function, and max-sat problem [10]. Although this operator was not applied to Bayesian networks, Kim et al. later proposed a new mutation operator transpose mutation specifically for Bayesian networks, and compares the performance of EDA-based Bayesian network learning with several mutation variations including bitwise mutation [11].
In this paper, we propose a new mutation operator called probability mutation for PBIL-based Bayesian Network learning. Through evaluation, we show that our new mutation operator is also efficient to find good Bayesian network structures.
The rest of this paper is organized as follows: In Section 2, we give the basic definitions on Bayesian networks and also describe related work in this area of study. In Section 3, we propose a new mutation operator called probability mutation to achieve better learning performance of Bayesian networks. In Section 4, we describe the evaluation results, and finally we conclude this paper in Section 5.  [6], which is one of the best known criterion used in Bayesian networks. Formally, the problem of learning Bayesian networks that we consider in this paper is defined as follows: B. K2 Algorithm K2 [2] is one of the best-used traditional algorithms to learn Bayesian network models. Note that searching good Bayesian network models is generally time consuming because the problem to learn Bayesian networks is NP-hard [1]. K2 avoids the problem of running time by limiting the search space through the constraint of totally order of events. Namely, for a given order of events 12 ... , Note that this constraint is suitable for some cases: if events have their time of occurrence, an event k X that occurred later than l X cannot be a cause of l X . Several practical scenes would be the case. The process of K2 algorithm applied to a set of events 12 , ,..., N X X X with the constraint 12 , ,..., N X X X is described as follows: (1) Select the best structure using two events N X and 1 N X  . Here, the two structures, i.e., 1 N X  → N X and the independent case, can be the candidates, and the one with better criterion value is selected.
(2) Add 2 N X  to the structure. Namely, select the best structure from every possible cases where 2 N X  has edges connected to  (2) to add events to the structure in the order In such cases, we require to tackle the NP-hard problem using a heuristic algorithm for approximate solutions.

C. Related Work for Un-ordered Bayesian Network Models
Even for the cases where the constraint of order is not allowed, several approaches to learn Bayesian network models has been proposed. One of the most basic method is to use K2 with random order, where randomly generated orders are applied repeatedly to K2 to search for good Bayesian network models.
As more sophisticated approaches, several ideas have been proposed so far. Hsu, et al. proposed a method to use K2 algorithm to which the orders evolved by genetic algorithms are applied [3]. Barrière, et al. proposed an algorithm to evolve Bayesian network models based on a variation of genetic algorithms called co-evolving processes [4]. Tonda, et al. proposed another variation of genetic algorithms that applies a graph-based evolution process [5]. However, with these approaches, the performance seems to be limited, and a new paradigm of the algorithm that learn Bayesian networks more efficiently is strongly required.

D. Population-Based Incremental Learning
Recently, a category of the evolutionary algorithms called EDA (Estimation Distribution Algorithm) appears and reported to be efficient to learn Bayesian network models. As one of EDAs, PBIL [7] is proposed by Baluja et al. in 1994, which is based on genetic algorithm, but is designed to evolve a probability vector. Later, Blanco et al. applied PBIL to the Bayesian network learning, and showed that PBIL efficiently works in this problem [9].
In PBIL, an individual creature s is defined as a vector that takes a value 0 or 1, and L is the number of elements that consist of an individual. Let represents the probability to be 1 i v  . Then, the algorithm of PBIL is described as follows: (1) As initialization, we let 0.5 (2) Generate a set S that consists of C individuals according to . P Namely, element i v of each individual is determined according to the corresponding probability where new i p is the updated value of the new probability vector new P ( P is soon replaced with new P ), () ratio i is the function that represents the ratio of individuals in S  that include link i (i.e., 1 i v  ), and α is the parameter called learning ratio. (5) Repeat steps (2)-(4).
By merging top-C individuals, PBIL evolves the probability vector such that the good individuals are more likely to be generated. Different from other genetic algorithms, PBIL does not include "crossover" between individuals. Instead, it evolves the probability vector as a "parent" of the generated individuals.

III. PBIL-BASED BAYESIAN NETWORK LEARNING
In this section, we present a PBIL-based algorithm to learn Bayesian network models to which we apply a new mutation operator. Since our problem (i.e., Problem 1) to learn Bayesian networks is a little different from the general description of PBIL shown in the previous section, a little adjustment is required.
In The process of the proposed algorithm is basically obtained from the steps of PBIL. Namely, the basic steps are described as follows: (2) Generate S as a set of C individual models according to P . Same as PBIL, the proposed algorithm evolves the probability vector to be likely to generate better individual models. However, there is a point specific to Bayesian networks, that is, a Bayesian network model is not allowed to have cycles in it. To consider this point in our algorithm, step 2 is detailed as follows:  ( , ) ij in the order are processed.These steps enable us to treat the problem of learning good Bayesian network models within the framework of PBIL. Note that checking the cycle creation in step (2b) can be done efficiently using a simple table that manages the taboo edges that create cycles when they are added to the model.

A. Mutation Operators
Note that the algorithm introduced in the previous section does not include mutation operator. Thus, naturally, it is easy to converge to a local minimum solution. Actually, PBILbased algorithm to learn Bayesian networks proposed by Blanco et al. [9] stops when the solution converges to a minimal solution, i.e., when score does not improve for recent k generations. However, local minimum solutions prevent us to search for better solutions, thus it should be avoided.
To avoid converging to the local minimum solution and to improve the performance of the algorithm, typically several mutation operations are inserted between steps (2) and (3). The most popular mutation operator is called bitwise mutation (BM) introduced by Handa [10], which apply mutations to each link in each individual, as described in the following step: BM: For each individual in S generated in step (2), we flip each edge with probability mut p . Namely, for each pair of nodes i and (1 , The other mutation operator we try in this paper is called transpose mutation (TM) introduced by [11]. This operation is proposed based on the observation that that reverse-edges frequently appear in the solutions. To avoid this, transpose mutation changes the direction of edges in the individuals produced in each generation. The specific operation inserted between steps (2) and (3) is in the following.
TM: For each individual in S generated in step (2), with probability mut p , we do the following operation: we reverse all edges in the individual with probability mut p , namely, ij ji vv  for all i and . j In contrast to these conventional mutations shown above, our new mutation operator called probability mutation (PM) does not manipulate individuals produced in each generations. Instead, we manipulate the probability vector P to generate better individuals in the next generation, which is inserted between steps (4) and (5). The specific operation of this mutation is shown and in the following (See also

A. Methods
In order to reveal the effectiveness of PBIL-based algorithms, we first evaluate the PBIL-based algorithm with probability mutation in comparison with K2 with its constraint (i.e., the order of events) evolved with genetic algorithms, which is a representative method among traditional approaches to learn Bayesian networks. In this conventional algorithm, we repeat creating Bayesian network models, in which its constraints (i.e., order of nodes) are continuously evolved with a typical genetic algorithm over generations, and output the best score among those computed ever. The results are described in Sec. IV-B. We next compare the performance of three mutation operators BM, TM, and PM applied to the PBIL-based algorithm. With this evaluation, we show that the new mutation operator PM proposed in this paper has good performance. The results are described in Sec. IV-C. In our experiment, we use Alarm Network [8] shown in Fig. 6, which is a Bayesian network model frequently used as a benchmark problem in this area of study. We create a set of 1000 observations according to the structure and the conditional probability of Alarm Network, and then learn Bayesian network models from the observations using those two algorithms. As the evaluation criterion, we use AIC, one of the representative criterion in this area. Namely, we compare the AIC values in order to evaluate how good is the Bayesian

B. Result 1: Performance of PBIL-based Algorithms
The first result is shown in Fig. 7, which indicates the AIC score of the best Bayesian network model found with the growth of generations. In this figure, the AIC score of the original Alarm Network, which is the optimal score, is denoted by "BASE MODEL." The proposed algorithm with probability mutation (represented as PBIL in the figure) converges to the optimal score as time passes, whereas K2-GA stops improving in the early stage. We can conclude that the performance of the PBIL-based algorithm is better than the conventional algorithm in that the PBIL-based algorithm computes better Bayesian network models according to time taken in execution. Note that the running time per generation in the proposed method is far shorter than K2-GA; the difference is more than 250 times in our implementation. Fig. 8 and 9 show the performance of the proposed algorithm with variation of learning ratio α and mutation probability mut p in 10,000 generations. These results show that the performance of the proposed method depends on α and mut p , which indicates that we should care for these values to improve the performance of the proposed algorithm. Note that, from these results, we have the best-performance values α =0. 2 and 0.002 mut p  , which are used as the default values in our experiment.

C. Result 2: Comparison of Mutation Variations
We further compared the performance of the PBIL-based algorithm with three mutations, bitwise mutation (BM), transpose mutation (TM), and probability mutation (PM). Facing on this experiment, we carefully choose the mutation probability of each method through preliminary experiments. For BM, we examined the performance of the mutation probability in range [0.001:0.2], and chose the value of the best performance, 0.005. For TM, we similarly tried the performance of the mutation probability in range [0.05:0.5], and chose 0.1 as the best value. For PM, from the result shown in Fig. 9, we chose the mutation probability 0.002, which is the same value as our first result shown in Fig. 7.
The result is shown in Fig. 10. We see that BM and PM continue improving as generation passes, whereas TM stops improving at the early stage of generation. Also, we see that the curve of BM and PM are slightly different where BM reach better scores in the early stage while PM outperforms BM in the late stage. This result shows that the newly proposed mutation operator PM is also useful especially in long-term learning of Bayesian network models under PBIL-based algorithms.

V. CONCLUSION
In this paper, we introduced the literature of PBIL-based learning of Bayesian network models, and proposed a new mutation operator called probability mutation that manipulates probability vector of PBIL. Through evaluation of these algorithms, we found that (i) the PBIL-based algorithm outperforms K2-based traditional algorithms with the longterm continuous improvement, and (ii) probability mutation works well under PBIL-based algorithms especially in longterm computation to obtain high-quality Bayesian network models. Designing more efficient search algorithms based on EDA is one of the most attractive future tasks.