Multilayer Perceptron: Architecture Optimization and Training

— The multilayer perceptron has a large wide of classification and regression applications in many fields: pattern recognition, voice and classification problems. But the architecture choice has a great impact on the convergence of these networks. In the present paper we introduce a new approach to optimize the network architecture, for solving the obtained model we use the genetic algorithm and we train the network with a back-propagation algorithm. The numerical results assess the effectiveness of the theoretical results shown in this paper, and the advantages of the new modeling compared to the previous model in the literature.


I. Introduction
I N recent years, neural networks have attracted considerable attention as they proved to be essential in applications such as contentaddressable memory, pattern recognition and optimization.
Learning or training of ANN is equivalent to finding the values of all weights such that the desired output is generated to corresponding input, it can be viewed as the minimization of error function computed by the difference between the output of the network and the desired output of a training observations set [1].
Multilayer Perceptron is the most utilized model in neural network applications using the back-propagation training algorithm. The definition of architecture in MLP networks is a very relevant point, as a lack of connections can make the network incapable of solving the problem of insufficient adjustable parameters, while an excess of connections may cause an over-fitting of the training data [3]. Especially, when we use a high number of layer and neurons this is our case in this paper.
Optimizing the number of connection and hidden layer for establishing a multilayer Perceptron to solve the problem remains one of the unsolved tasks in this research area Multilayer Perceptron consists of input layer, output layer and hidden layers between these two layers. The number of these layers is dependent on the problem [8]. In this work, we optimize the number of hidden layers and the number of neurons in each hidden layer and process of to deal with a few connection to increase the speed and efficiency of the neural network. We model this problem of neural architecture in terms of a mixed-integer non-linear problem with non-linear constraints.
The next section present and discuss related works on neural network architecture optimization. Section 3 describes the artificial neural networks. In Section 4, we present the problem of optimization neural architecture and a new modeling is proposed. And before concluding, experimental results are given in the section 4.

II. Related Works
A number of approaches in the literature have taken account the architecture optimization. This section describes only those works that are more or less similar to our work.
Global search may stop the convergence to a non-optimal solution and determine the optimum number of ANN hidden layers. Recently, some studies in the optimization architecture problems have been introduced in order to determine neural networks parameters, but not optimally [3].
Traditional algorithms fix the neural network architecture before learning [4]. Others studies propose constructive learning [5]- [6]. It begins with a minimal structure of hidden layers; these researchers initialized the hidden layers, with a minimal number of hidden layer neurons. The most of researchers treat the construction of neural architecture (structure) without finding the optimal neural architecture [7].
T.B Ludermir et al [14] propose an approach for dealing with a few connections in one hidden layer and training with deferent hybrid optimization algorithms.
In our previous work we take account optimization of hidden layers with introducing one decision variable for layer [1], and in another work we have taken account the hidden node optimization in layers, for training this two models we have used a back-propagation algorithms [2].

III. Units Feed-Forward Neural Networks for Pattern Classification
A data set for pattern classification consists of a number of patterns together with their correct classification. Each pattern consists of a number of measurements (i.e., numerical values).
The goal consists in generating a classifier that takes the measurements of a pattern as input, and provides its correct classification as output. A popular type of classifier is feed-forward NNs [9].
A feed-forward NN consists of an input layer of neurons, an arbitrary number of hidden layers, and an output layer. Feed-forward NNs for pattern classification purposes consist of as many input neurons as the patterns of the data set have measurements, i.e., for each measurement there exists exactly one input neuron. The output layer consists of many neurons as the data set has classes. Given the weights of all the neurons connections, in order to classify a pattern, one provides its measurements as input to the input neurons, propagates the output signals from layer to layer until the output signals of the output neurons are obtained. Each output neuron is identified with one of the possible classes. The output neuron that produces the highest output signal classifies the respective pattern. The neurons number in the input layer equal to the number of measurement for the pattern problem and the neurons number in the output layer equal to the number of class, for the choice of layers number and neurons in each layers and connections called architecture problem, our main objectives is to optimize it for suitable network with sufficient parameters and good generalisation for classification or regression task.

B. Back-propagation and Learning for de MLP
Learning for the MLP is the process to adapt the connections weights in order to obtain a minimal difference between the network output and the desired output, for this raison in the literature some algorithm are used such as Ant colony [11] but the most used called Back-propagation witch based on descent gradient techniques [12]. Assuming that we used an input layer with n 0 neurons and a sigmoid activation function (1) To obtain the network output we need to compute the output of each unit in each layer: Now consider a set of hidden layers . Assuming that are the neurons number by each hidden layer .
For the output first hidden layer (2) The outputs of neurons in the hidden layers are computing as flows: , is the weight between the neuron in the hidden layer and the neuron in the hidden layer , is the number of the neurons in the ith hidden layer, The output of the ith can be formulated as following: (4) The network output are computing by (5) Where is the weight between the neuron in the Nth hidden layer and the neuron in the output layer, is the number of the neurons in the Nth hidden layer, is the vector of output layer, is the transfer function and is the matrix of weights, it's defined as follows: To simplify we can take for all hidden layers.
Where is the input of neural network and is the activation function and is the matrix of weights between the hidden layer and the hidden layer for is the matrix of weights between the input layer and the first hidden layer, and is the matrix of weights between the Nth hidden layer and the output layer.

IV. Proposed Model to Optimize the MLP Weights and Architectures
The MLP architecture definition depends on the choice of the number of layers, the number of hidden nodes in each of these layers and the objective function but another approach which is introduced in this paper allows to control all connections between layers and to delete some of them, when there are no connections between the node and layer and when all neurons are deleted in layers we delete it.
In this work, we assign to each connection a binary variable which takes the value 1 if the connection exists in the network and 0 otherwise. Also we associate another binary variable for the hidden layers. We computed the output of neural network by the following formulation:

A. Output of first hidden layer
The neurons of first hidden layer are directly connected to the input layer (data layer) of the neural network.
The output for each neuron in the first hidden layer is calculated by:

B. Output for the hidden layer i = 2,…, N
To calculate the output of each neuron for the hidden layer i, where , we propose the rule:

C. Output for the neural network (layer N+1)
The output of the neural network is defined by the following expression:

D. Objective function
The objective function of the proposed model such as in the previous work [2] is the error calculated between the obtained output and desired output: (10) But we propose to modify this objective function and adding one term for error connections regularized by in order to control variations of weights in training and optimization phase.

E. Constraints
• The first constraint guarantees the existence of the hidden layers. (11) • These constraints insured communication between neurons, connections and layers (12) (13) • The weights values are the real number.

V. Implementation And Numerical Results
To illustrate the advantages of the proposed approach, we apply our algorithm to a widely used dataset, Iris dataset for classification [13]. It consists of three target classes: Iris Setosa, Iris Virginica and Iris Versicolor. Each species contains 50 data samples. Each sample has four real-valued features: sepal length, sepal width, petal length and petal width. By doing this, all features have the same contribution to Iris classification.
For this task an MLP with sigmoid activation function and a backpropagation training after obtaining an optimal architecture by Genetic algorithm is used, in this section we present a parameters setting, implementations procedures and final results.

A. Parameters setting
We use the genetic algorithm to solve the architecture optimization problems. To this end, we have coded individual by one chromosomes; moreover, the fitness of each individual depends on the value of the objective function obtained by two terms one for network error and the second for the network connections.
The individual of the initial population are randomly generated, and , take the value 0 or 1, and the weights takes random values in space . After creating the initial population, each individual is evaluated and assigned a fitness value according to the fitness function.
The fitness suggested in our work is the following function: (14) We applied crossover and mutation operator, in this step, new individuals called children are created by individuals selected from the population called parents for more exploring the researches spaces of solution.
In Table I we present all parameters used in our experiments

B. Results for the optimization methodology
After determining the optimal number of hidden layer, in this case three, and the total number of connection. We use an architecture contains four neurons in each layer and we can initialize the neural networks by value of weights obtained with Genetic algorithm. Our algorithm tested on instances for Iris data.  Table II presents the obtained clustering results of training and testing data. We remark that the proposed method permits to classify all the training data only one from Versicolor and two from the same type for testing data.  We Remark that the obtained clustering results of testing data shows that our method gives the good results, because all the testing data were correctly classified except two. In fact; these elements (misclassified) are from the Versicolor class. From Tables above we can see that the proposed method gets a higher average classification accuracy rate and we have used a few connections than the existing methods. And we can conclude that the proposed approach in this paper gives better results compared to the neural methods optimizing neurons hidden layers.

VI. Conclusion
A Model is developed to optimize the architecture of Artificial Neural Networks. The Genetic Algorithm is especially appropriate to obtain the optimal solution of the nonlinear problem. This method is tested to determine the optimal number of hidden layers and connection weights in the Multilayer Perceptron and the most favorable weights matrix after training. We have proposed a new modeling for the multilayer Perceptron architecture optimization problem as a mixedinteger problem with constraints. Depending on the Iris data, the results obtained demonstrates the good generalization of neural networks architectures. In conclusion, the optimal architecture of artificial neural network can play an important role in the classification problem. We can call the proposed approach to solve our model with other metaheuristics and we are going to tray with many other data base for real problems: Diabetes, Thyroid, Cancer….

Mohamed Ettaouil is a Doctorate Status in Operational
Research and Optimization, FST University Sidi Mohamed Ben Abdellah USMBA, Fez. PhD in Computer Science, University of Paris 13, Galilee Institute, Paris France. He is a professor on the Faculty of Science and technology of Fez FST, and he was responsible of research team in modelization and pattern recognition, operational research and global optimization methods. He was the Director of Unit Formation and Research UFR: Scientific computing and computer science, Engineering Sciences. He is also a responsible of research team in Artificial Neural Networks and Learning, modelization and engineering sciences, FST Fez. He is an expert in the fields of the modelization and optimization, engineering sciences.