Deep Belief Network and Auto-Encoder for Face Classification

The Deep Learning models have drawn ever-increasing research interest owing to their intrinsic capability of overcoming the drawback of traditional algorithm. Hence, we have adopted the representative Deep Learning methods which are Deep Belief Network (DBN) and Stacked Auto-Encoder (SAE), to initialize deep supervised Neural Networks (NN), besides of Back Propagation Neural Networks (BPNN) applied to face


I. Introduction
R ecently, there have been a lot of applications which rely on the face classification. For instance, a security system that allows access only to a people who are members of a certain group or a surveillance system that can give an alert to law enforcement agencies of the presence of a person who belongs or has a link with an international terrorist group. Each of these applications relies on the integration of a face classification system. This article is devoted to the problem of face classification based on feature extraction and classification task [1] [32]. Therefore, the objective of the feature extraction procedure is to extract invariant features representing the face information. Many different kinds of features have been used in previous works, such as hand-crafted features presented by the Local Binary Pattern (LBP) [2] [3] [31], Discrete Cosine Transform (DCT) [4] and Discrete Wavelet Transform (DWT) [5], etc. With the extracted features, general machine learning algorithms such as Support Vector Machine (SVM) [6] and Artificial Neural Network [36] can be used to discriminate between faces and non-faces.
Nowadays, the Deep Learning becomes a most powerful subfield of machine learning that focuses on learning deep hierarchical models which widely applied in a variety of areas such as speech recognition, pattern recognition and computer vision. Many Deep Learning algorithms were proposed and successfully applied as Convolutional Neural Network (CNN) which used it to extract distinctive features for face patterns and classification [7] [8]. The Restricted Boltzmann Machines (RBMs) are usually used as feature extractors for another algorithm or to keep a good initialization for deep feed-forward neural network classifiers. Deep Belief Network (DBN) and Deep Boltzmann Machines (DBM), which are powerful techniques in pattern recognition task [9] [10]. In the latest progress of Deep Learning, researchers have achieved new records in face verification by exploiting different CNN structures [11], [8], [12], [13], [14].
In this paper, we intend to perform a new contribution based on the Deep Learning m odels for the face classification task by investigating two different kinds of Deep Learning architectures which consist of DBN and Stacked Auto-Encoder (SAE) besides of Back Propagation Neural Networks (BPNN) in order to explicitly capture various latent facial features. These representations offer several advantages over those obtained through hand-crafted descriptors, because, they can capture higher-order statistics such as corners and contours.
The rest of the paper is organized as follows: In Section II, we describe briefly some Deep Learning architectures, which are DBN and SAE, besides of BPNN. The proposed approach is presented in Section III. The experimental results of the proposed contribution along with comparative analysis are discussed in Section IV. Finally, we draw conclusions and give avenues for future work in Section V.

A. Deep Learning
Deep Learning is one of the most powerful parts of the wider machine learning field focused on learning representations and abstraction of data. The idea behind the Deep Learning advanced from the study of Artificial Neural Networks (ANNs) [15]. This network operates a large number of successive transformations, making it possible to discover more and more abstract representations. These transformations make it possible to represent the data at different levels of abstraction, for example, the first layer corresponds to the raw data, the last layer corresponds to the outputs and each successive layer uses the output from the previous layer as input. Then, the core philosophy within the Deep Learning framework is to let the network directly learn the useful feature representations and at the same time train with the prediction tasks from end to end, which helps the Deep Learning technique to set a new records for many vision tasks. Further, there are various Deep Learning architectures such as, CNNs, DBN, and SAE. These architectures are oftentimes built with a ravening layer-by-layer technique.

B. Feature Learning
Feature learning [16] is a set of methods that permits a system to spontaneously learn the representations required for feature detection or classification of raw data which influences the final result. This technique replaces the hand-crafted descriptors as LBP [2] [3], DCT [4] and DWT [5], it permits a machine to learn the features in cooperation and use them to carry out a particular task as classification, detection or recognition. We generally can get a satisfactory result based on these feature learning, which yields a good classification rate for face classification. Therefore, the most important function of Deep Learning is to pick up the most pertinent features by means of a great number of training data to build a machine learning model that has an interesting number of hidden layers in order to improve the accuracy of the classification.

1) Back-Propagation Neural Network
A Back Propagation Neural Network (BPNN) is a multilayered, feed forward Neural Network [23]. It is a kind of Neural Network (NN) architecture trained with Back Propagation (BP) algorithm. This last is an efficient gradient descent algorithm which represents a big importance in NN. In fact, the BP is a popular technique that has been known for its exactness, because, it enables itself to learn and enhancing itself, for this reason, it can accomplish a higher precision [17]. Therefore, the principle of this algorithm is to optimize the parameters of the NN. Further, when a result is obtained, the classification error is calculated. Subsequently, this error is propagated back from one layer to the other starting from the output layer until the minimal Mean Squared Error (MSE) is achieved, so that the weights can be modified according to the obtained error. In brief, the BP is a training algorithm which includes two processes [19] [28]. The first one is feed forward the values to generate output. The second one is to calculate the error between the output and the target output, then, propagate it back to the earlier layers in order to update weights and biases. The BPNN is used as a classifier basically due to its capacity to produce complex ruling limits in the characteristic features space [29]. There are different activation functions which can be utilized in BPNN. Among the more prevalent is the sigmoid, expressed by (1): The BPNN comprises of three layers, which are input layer, hidden layer and output layer. Then, each layer has a specific number of neurons depending upon the complexity of the problem. Fig. 1 illustrates the structure of standard BPNN.
To adjust the weight of one neuron j, we used the following equations: With: � � : The output error for the neuron j in the output layer defined by (4).
� : The output of neuron j presented by (5).
α: The learning rate.
(4) (5) With: We took only an example of one neuron j, but, the same procedure is done for all used neurons. According to the experiments, the optimal parameters of the BPNN are reported in Table I.

2) Deep Belief Network
A DBN is a deep structured learning architecture introduced by Hinton since 2006 [20]. It is a generative model interpretable as a stack of the Restricted Boltzmann Machine (RBM). The DBN is composed of multiple layers of stochastic and latent variables and can be regarded as a special form of the Bayesian probabilistic generative model. When we used the DBN for classification task, it is treated as a MultiLayer Perceptron (MLP), by adding a logistic regression layer on top, but DBN is more effective than MLP. To develop the pliability of DBN, a novel model of Convolutional Deep Belief Networks (CDBNs) was introduced by Arel [21]. The first step in the training of DBN is to learn a layer of features from the visible units, using Contrastive Divergence (CD) algorithm [18], the following step is to treat the activations of previously trained features as visible unites and learn features in a second hidden layer. At last, the whole DBN is trained when the learning for the last hidden layer is getting at. According to the experiments, the optimal parameters of the DBN are depicted in Table II.

3) Stacked Auto-Encoder
A SAE is a Neural Network consisting of multiple layers of Auto-Encoders in which the outputs of each layer are wired to the inputs of the successive layer. The SAE enjoys all the benefits of any deep network of greater expressive power. A SAE is a building of deep network architecture by hierarchically training layers of Auto-Encoders.
The Auto-Encoders have been proposed in [22] [23], as methods for dimensionality reduction. In addition, it often captures a "hierarchical grouping" or a useful "partial-total decomposition" of the input. The first layer of a SAE tends to learn first-rate features in the raw input (such as the edges of an image). The second layer tends to learn secondorder characteristics corresponding to patterns in the appearance of the first order features. Upper layers of SAE tend to learn higher order characteristics in order to form a good representation of its input.
The SAE model is characterized by 5 parameters. A series of tests is performed to determine these optimal parameters, as depicted in Table III. In brief, the SAE is based on unsupervised layer learning, which drives each Auto-Encoder independently, like the case of DBN, in which each neuron has a weight vector associated with it, which will be set to respond to a particular visual characteristic. Fig. 4 and 5 illustrate the representation of these facial characteristics for both databases BOSS and MIT respectively, based on the SAE algorithm. But, according to Fig. 2, 3, 4 and 5, the DBN learns useful representations compared to SAE model which demonstrates the robustness of our proposed approach for face classification, especially on BOSS database.

III. Proposed Approach
Among the important stages of the face classification system is to characterize all face images as a feature vector. Subsequently, we can apply many learning algorithms, for instance, SVM [24] [25] in order to perform the classification task. Thus, the efficiency of face classification algorithm primarily focuses on the useful features. Such as, for feature extraction step, many comparative studies suggested varied hand-crafted features as LBP and Scale Invariant Feature Transform (SIFT), but in our work, we exploit a robust feature learning method to discriminate between faces and non-faces in order to ensure an efficient classification. Further, the Deep Learning models are adopted to replace and overcome the limitations of the hand-crafted methods, which enhance the features by the high level presentation of each face image.
The main objective of our work is to test and compare the effectiveness and the robustness of our proposed approach of face classification, so, it consists to adopt the Deep Learning as a technique of feature extraction and classification based on its representative models, which are DBN, SAE and BPNN.
After the preprocessing step, including the normalization of our data, the Deep Learning models, which are, DBN, SAE, and BPNN, are applied on two facial databases, BOSS and MIT, in order to extract the high level features. Thus, the different feature vectors that result from the different operations of extracting the deep features will be used as inputs of the NN classifier as presented in Fig. 6, 7 and 8, which describe the adopted procedure in detailed way.
Oftentimes to assess the performance, there are diverse particular elements that show the system classification performance. For instance, the accuracy (Acc) and the error (Er). The equations of these metrics are defined by (6) and (7) as follows: The training holds the key to an accurate solution, so, the criterion to stop training must be more described. In general, it is known that a network with enough weights will always learn the training set better as the number of iteration is increased [33], for this reason, we stopped at 5 epochs when the error stopped decreasing.
The weights of the system are continually adjusted to reduce the difference between the output of the system and the desired response, the difference is referred to as the error and can be measured in a different way. The Training Error (TE) is the error that we get when we run trained model back on the training data in order to update the weights. Then, the MSE is the most common measurement which computes the average squared difference between the desired output and each predicted output. Besides of MSE, there are many kinds of errors, including Root-Mean-Square Error (RMSE), Sum of Squared Errors (SSE), Mean Absolute Percent Error (MAPE), which can be used for measuring the error of Neural Network [33] [34].

A. Description of BOSS Database
BOSS is a new database of faces and non-faces. Currently, our database is private and can be delivered upon request. Most face images were captured in uncontrolled environments and situations, such as, illumination changes, facial expressions (neutral expression, anger, scream, sad, sleepy, surprised, wink, frontal smile, frontal smile with teeth, open / closed eyes,), head pose variations, contrast, sharpness and occlusion. Besides, the majority of individuals is between 18-20 years old, but some older individuals are also present with distinct appearance, hair style, adorns and wearing a scarf. The database was created to provide more diversity of lighting, age, and ethnicity than currently available landmarked 2D face databases. All images were taken with 26 ZOOM CMOS digital camera of full HD characteristics. The majority of images were frontal, nearly frontal or upright. Fig.  9 shows some people images of BOSS database. We detect people faces in our BOSS database by using the cascade detected of Viola-Jones algorithm. All the faces are scaled to the size 30*30 pixels. This database contains 9619 with 2431 training images (with 771 faces and 1660 non-faces) and 7188 test images (178 faces and 7010 nonfaces). The face images are stored in JPG format. Fig. 10 presents some detected face images of BOSS database.

B. Description of MIT Database
The proposed approach has been tested on the famous MIT database. This database is available publicly [35]. It is a database of faces and non-faces that has been used extensively at the Center for Biological and Computational Learning at MIT. It has 6977 training images (with 2429 faces and 4548 non-faces) and 24045 test images (472 faces and 23573 non-faces). The face images are stored in PGM format. The size of each image is 19*19 pixels, with 256 level gray scales. An example of some samples of MIT database is presented in Fig. 11. Fig. 11. Some images of MIT facial database.

C. Results and Discussion
Among the preprocessing of our data is the normalization of training and testing sets, then, we standardize all values, and Randomize data within batches to ensure class stratification, and Data augmentations.
We train typically several epochs, but we choose only 5 epochs when the validation error is no longer decreasing. Therefore, this problem is dependent of the number of data samples, and the batch size.
We have carried out several experiments on both databases. BOSS and MIT to evaluate the impact of the high level features on the face classification and to prove the effectiveness of our proposed approach. We can observe that the performance of the three methods including (SAE,NN), (DBN,NN) and BPNN during 5 epochs is slightly different. Although, for BOSS, we can find that the BPNN outperforms the combination of (SAE,NN) and (DBN,NN) during the 5 epochs, in the same time, the methods of (SAE,NN) and (DBN,NN) depict a more competitive performance. In addition, as regards MIT database, we notice that the combination of (DBN,NN) performs better than the methods (SAE,NN) and BPNN during the 5 iterations in terms of MSE and TE.

D. Comparison of MIT and BOSS Facial Databases
In this section, we made a comparative study between two facial databases. BOSS and MIT, based on the three Deep Learning models, which are DBN, SAE and BPNN. To better demonstrate the performance of the three Deep Learning models on BOSS and MIT, we used the classification error rate of the previous models as an evaluation criterion. The Table IV presents  The results on MIT are reported in Table V, which illustrates the performance of BPNN, (DBN,NN) and (SAE,NN). Therefore, the superiority of (DBN,NN) is clearly demonstrated compared to BPNN and (SAE,NN) obtaining 1.96% in terms of error rate.
In brief, the combination of (DBN, NN) achieves always the best result in terms of classification error rate based only on two hidden layers.

E. Comparison with the State-of-the-Art Methods
This section provides a comparative study of our approach and multiple well-known techniques recently published regarding face classification by using both facial databases, BOSS and MIT. Referring to Table VI, the result clearly shows that the classification accuracy is good for the majority of algorithms. For fair comparisons, it is clear that our contributions of (DBN,NN) produced the best classification accuracy compared to other published methods which we get 98.86% and 98.04% in terms of classification rate on BOSS and MIT respectively. According to this result, our proposed approach yields a significant improvement over the state-of-the-art methods.

V. Conclusion
In this paper, we investigate the Deep Learning models based on the face classification task. We have progressively improved its performance by investigating three different kinds of Deep Learning models, which are DBN, SAE and BPNN using sigmoid activation function. The Deep Learning algorithms are used to extract high-level features in order to use them as inputs of NN classifier for the purpose of discriminating faces and non-faces. We perform an experimental evaluation on two facial databases, BOSS and MIT. The results indicate the robustness and the efficiency of our proposed approach (DBN,NN) for face classification, obtaining 1.14% and 1.96% in terms of error rate with BOSS and MIT respectively.