Multilayer Feedforward Neural Network for Internet Traffic Classification

method contains only the desired input and there are no output labels. A common unsupervised method is known as clustering [7] , [8]. The supervised method is applied to analyze the learning data. Supervised learning method has both desired inputs and desired outputs. One of the initial important works in this area is the study in [9] which classifies the network traffic by applying the supervised naive bayes method on the flow statistical features. Classification of internet traffic is done by analyzing a packet via the capturing tool and determining the 248 features in [10]. Later, traffic classification is done using support vector machine with an accuracy of 90% on various types of internet traffic datasets [11]. Bayesian neural networks is applied for Cambridge dataset and achieved a class wise accuracy using 2441 flows for 7 classes ranging from 91% to 99.9% of accuracy [12] . Various classification algorithms are compared and evaluated in [13] . Such classification algorithms are: Naïve Bayes using both Naïve Bayes Descritization (NBD) and Kernel density estimation (NBK), C4.5 decision tree, Bayesian Network (BayesNet) and Naïve Bayes Tree (NBTree). Experimentation result shows that the C4.5 identifies the network flows faster than any other


I. Introduction
R ECENTLY, the internet traffic classification has been frequently used in order to improve service quality in enterprise networks, developing new service packets, sharing the bandwidth source and performing traffic analysis effectively. There are various methods of internet traffic classification such as port based, payload based and flow statistical based approach. Port based classification is very simple but produces inaccurate results because of dynamic assignment of protocols [1], [2]. Payload based method have been proposed and presented by many researchers, which analyzes packet content to identify the internet traffic. Payload method is popular but has few limitations like lawful inspection, encrypted data cannot be handled and so on. To overcome these issues, a flow statistical based approach is used for IP traffic classification [3], [4]. The above said classification problems can be solved using machine learning approaches.
Machine learning consists of supervised learning and unsupervised learning. In unsupervised learning, class labels are unknown and a common technique of this type is called as clustering. In clustering, identical features are grouped as cluster. On the other hand, the classification can be performed using supervised learning. Supervised learning tries to figure out classes from the training data, and test data have been measured by studying the learning data [5]. The most used data discriminators in flow based method is given in [6] for the assessment of the classification task.
The paper is organized as follows: Section II presents the literature, section III depicts the proposed model, section IV discusses the results and section V presents the conclusion with future work.

II. Related Work
A lot of network applications run on the Internet and generate a large amount of traffic every day. It is a daunting task to manage the huge and mixture traffic. Network traffic classification is more and more important along with the development of Internet.
The basic variants of machine learning approaches are supervised and unsupervised learning. Unsupervised method contains only the desired input and there are no output labels. A common unsupervised method is known as clustering [7], [8]. The supervised method is applied to analyze the learning data. Supervised learning method has both desired inputs and desired outputs. One of the initial important works in this area is the study in [9] which classifies the network traffic by applying the supervised naive bayes method on the flow statistical features. Classification of internet traffic is done by analyzing a packet via the capturing tool and determining the 248 features in [10]. Later, traffic classification is done using support vector machine with an accuracy of 90% on various types of internet traffic datasets [11]. Bayesian neural networks is applied for Cambridge dataset and achieved a class wise accuracy using 2441 flows for 7 classes ranging from 91% to 99.9% of accuracy [12]. Various classification algorithms are compared and evaluated in [13]. Such classification algorithms are: Naïve Bayes using both Naïve Bayes Descritization (NBD) and Kernel density estimation (NBK), C4.5 decision tree, Bayesian Network (BayesNet) and Naïve Bayes Tree (NBTree). Experimentation result shows that the C4.5 identifies the network flows faster than any other four algorithms with respect to computational performance. Further, other studies reported using the machine learning algorithms Support Vector Machine (SVM) and Decision Tree. The result shows that; decision tree is computationally efficient compared to SVM as seen in [14]. Identification of Voice over Internet Protocol (VoIP) using machine learning approaches such as C5.0, AdaBoost and Genetic Programming is presented in [15]. Result shows that, C5.0 performs better compared to other two algorithms.
Artificial Immune System (AIS) is proposed and used to classify the malicious packets in Intrusion Detection System [16]. Inspired by this approach, AIS based algorithms are proposed to classify the flow of internet traffic and the performance of the proposed method is evaluated with and without using kernel function [17]. The size of the dataset used is 1000 flows. The range of the accuracy achieved is between 81.8% and 91% without using kernel function. Using linear kernel, the performance achieved was 92.3% of accuracy. Other two algorithms such as Naïve Bayes and SVM achieved an accuracy of 82.2% and 44.10% respectively [18]. Imbalance nature of the dataset used for the experimentation affects the algorithm proposed for the few classes as stated in [19]. Data Gravitation based Classification (DGC) model is proposed in [20] and 12 UCI datasets were used for the experimentation. DGC achieves better accuracy. However, it achieved less accuracy for imbalanced data sets. Hence, DGC suffers in using the unbalanced datasets. To address this problem, Imbalanced DGC (IDGC) model is proposed and Amplified Gravitation Coefficient (AGC) is used to compute the gravitation [21]. Based on the results of the proposed models, it can be concluded that most of the standard machine learning classification models are not much effective with imbalanced internet traffic data [22]. To enhance the performance of imbalance classification, Under-sampling DGC is proposed in [23]. The proposed method generates a set of training groups for rebalancing. Generated new training set are selected and employed to DGC model to get better classification accuracy. Later, voting policy is used to obtain the classification model. Therefore, it gives promising result for the imbalanced dataset. The proposed method is applied on 22 less imbalanced and 22 high imbalanced datasets. Class Oriented Feature Selection (COFS) is proposed to improve the recall and precision for minority classes maintaining overall high accuracy in [24]. Cost sensitive learning is studied for multiclass imbalance problem in [25] and misclassification cost with respect to dependent class is investigated. Weighted Cost Matrix (WCM) is proposed and compared with Flow rate based Cost Matrix (FCM). The obtained result gives better accuracy for the proposed method.

A. Motivation
The work is carried out for two reasons such as, a) To build multi layer neural network and b) to propose a model which can handle the class imbalance problem. Authors in [26] has proposed Extreme Learning Machine (ELM) which is used to classify the internet traffic. Extreme Learning Machine is a fast learner neural network model which randomly takes hidden layer weights. It was proposed for the training of single hidden layer feedforward neural networks. The weight and threshold values in traditional single hidden layer feedforward neural networks should be renewed by gradient-based learning algorithms. To provide a good solution, this situation may cause both an extension of the learning process and the error may be encountered in a local point. The change can be made in the momentum value for the problem of errors being encountered in a local point. However the change will not make a contribution to the learning process. The weights and thresholds in a single hidden layer feedforward network do not affect the performance of the network.
As the application of neural networks is increasing day to day, we exploited the application of multilayer feed-forward neural network to solve the problem of Internet traffic classification. The proposed network is used towards the basic deep neural networks and thus we used multiple layers in our architecture. The main objective of using the architecture of multi layer feedforward neural network is because of the capability of handling large amount of datasets. Multilayer feedforward has the ability to detect complex nonlinear relationships between dependent and independent variables. Further, Multilayer feedforward exhibits the various advantages like adaptability in learning, non-linearity property, mapping between input and output, robustness in accommodating the noise. Using machine learning with feedforward neural network with multiple layers and with other parameters would be able to construct the deep learning architecture in near future.

III. Proposed Methodology
In this model, we are proposing a multi layer feedforward neural network for internet traffic classification. The main objective of using this network is that, it has high learning rate on non linear data [27]. It is quite obvious that, all the attributes of the dataset description are not same. Hence, it is highly difficult to classify without normalizing the data. Therefore, we first normalize the data before sending it to the proposed model. Normalizing is done using the eq. (1): Where, x is the column vector (features).
µ is the mean of x. σ is the standard deviation of x.
After normalizing the features, the new features will be in the range of [-1, 1]. This helps us to classify the data without having bias towards the high biased features.
The proposed methodology uses multilayer feedforward neural network with six layers. Out of six layers, four are hidden layers, one is input layer, and one is output layer. Finally we use the softmax activation function for classification.  Assuming the preprocessed features are encoded, we first extract the features and make it 1000 dimensions by decoding the features. After the model understands the data deeply, it is again transformed to the lower dimensions to get only the required features by encoding. This process can be understood by the loss functions. In this procedure, all the hidden layers can be viewed as feature transformational units.
The first layer as mentioned earlier is known as the input layer whose number of neurons are always equal to the number of attributes in the dataset. The dataset consists 248 attributes.
The next layer is the first hidden layer. In this layer, all the neurons of the previous layers are connected to all the neurons of the next layer through dot product of the weights (eq. (2)).
where, w is the weights of the current layer x is the output of the previous layer (here it the input features) b is the bias.
Initially weights are initialized with normal Gaussian values. This type of weight initialization is used across various neural network descendants like Convolution Neural Network, Recurrent Neural Network, etc. The next layer i.e., the first hidden layer consists of 500 neurons. To achieve better performance, the number of hidden layers in multi layer perceptron should be as twice the number of neurons which takes the advantage mentioned in [28]. Therefore, we double and round off the number of input features to get 500 neurons. Further, we apply eq. (2) and we pass it to the activation function or kernel. The authors in [26] used wavelet kernel as their activation function. The proposed model uses Rectified Linear Unit (ReLU shown in eq. (3) as the activation function as it is good in capturing data with moderate and large number of features [29]. Further it also helps in faster convergence and it is by far simpler than wavelet kernel. where, x is the input to the ReLU function Fig. 2 shows the graph of the ReLU activation function, which squashes every negative input to zero and keeps every positive input as the same. The next layer, i.e., second hidden layer consists of 1000 neurons as it is still needed to extract the features from the data that are going to double the number of neurons from the previous layer. In particular, this layer helps in understanding the data much deeper. In this work, ReLU is used as the activation function. The following layer is the same as previous layer, which also contains 1000 neurons and uses ReLU as the activation function.
In the third hidden layer, we tend to project it to the lower dimensional to reduce the number of neurons and hence it contains 500 neurons. It also uses ReLU as the activation function. The main objective of doing this is to extract only the relevant features.
The last layer is the output which contains 10 neurons as there are 10 classes. After the output layer we have the softmax layer which is used as the loss function and is given by: The softmax function is chosen among other loss functions because the range of the softmax is [0, 1] which does not require more space and the sum of all the scores is equal 1. Hence it is a probabilistic score and losses can be predicted relatively.
The back propagation algorithm is used to back propagate through the network updating the weights through adam optimizer [30]. It is used for optimization which automatically adapts and updates its parameters. This model is executed for 1000 epochs with a batch size of 2486. The increase in number of epochs leads to the over fitting problem. Therefore epoch was set to 1000.

A. Experimental Setup
The experiment was carried on a machine with the operating system ubuntu version 16.04 with 12 GB RAM and the secondary storage capacity of 1TB. The processor is Intel Core i7-7700HQ CPU at 2.80GHz × 8. Implementation is done using a 64 bit machine.

B. Dataset
The dataset used is a publicly available Cambridge dataset which consists of 248 features. The total number of classes are 10. Two variants of the same dataset are used. One standard dataset is highly imbalanced and the other is less imbalanced, which is a derived dataset. In the standard dataset, flows are more biased towards WWW class and least biased towards multimedia class. The dataset is divided in the ratio of 70% for training and 30% for testing. Table I shows the characteristics of datasets used for experimentation.

C. Results and Discussions
In the following section, we discuss the results obtained on the dataset01 which is the high imbalanced dataset. The main objective of this experimentation is to overcome the problem of imbalance dataset which is one of the major challenges in the research. The proposed model avoids the bias result and gives the better result.
As we can see from Table II, which contains the confusion matrix for imbalanced dataset, almost all the classes have been classified correctly which lead to an accuracy of 99.08%, which is around 3% greater than the accuracy achieved in [26]. Hence the proposed model achieves a greater result. We can also see that the number of misclassification samples of each class is far less than the number of samples correctly classified samples.   The table III shows that we have compelling precision and recall values. As we can see the average precision is of 0.87, which is quite high. The average recall value is 0.91 that is also very high, which shows low false-negative rate. The F 1 score of 0.89 shows a good result for a dataset01 shown in Table I. The Precision, Recall and F 1 score show that the model is successful in handling even the less balanced data.
Further, we discuss the results achieved using the dataset02 which is a less imbalanced (derived) dataset.  As we can see from Table IV, the confusion matrix for dataset02, almost all the classes have been classified correctly, which leads to an accuracy of 97.25%, which is again much greater than existing in [26].  Table V presents the result of Precision, Recall and F 1 Score. From the results, we can observe that the average Precision is of 0.95, which is quite high for imbalanced dataset. The average Recall value of 0.96 is also very high, which shows low false-negative rate. The F 1 score of 0.96 shows a good result for a highly balanced dataset.  Table VI shows that ELM neural network with Sigmoid kernel and using one hidden layer achieved an accuracy of 79.16% and ELM with Tangent Sigmoid kernel achieved 92.11% of accuracy. ELM with Radial basis attained 87.30% accuracy and ELM with Wavelet kernel achieved 96.57% of accuracy. Further, the proposed model out performed with an accuracy of 99.08% for dataset01 and 97.25% accuracy for dataset02, which is more than the accuracy reported in [26].
The increase in accuracy is due to the design architecture of neural network model discussed in section III. The model first extracts the number of features and projects it to the higher dimensions. Later, the model understood the data deeply and it is transformed to the lower dimensions to get only the required features. Hence it is a fact that the proposed classifier has successfully solved the problem of internet traffic classification for both less imbalanced and highly imbalanced dataset. Even though the model is intended for high imbalanced dataset, table IV, V and VI justifies that it also performed well for a less imbalanced dataset.

V. Conclusion
Currently, internet traffic classification is becoming popular due to growth in the internet usage and utility applications. However, network management is a key challenge to the service providers. Efficient network management leads to provide quality of service, ensure security and high computing performance of the network. Another issue with the internet traffic is about handling imbalance dataset. Hence there is a need of addressing the existing problem, i.e internet data that comes with imbalance in particular. Thus, to solve this issues, we proposed a multi layer feedforward neural network model with four hidden layers (mountain mirror network). The effective transformation of the dimensions of the features helps in accurate classification and thereby increasing the accuracy. In future, we can also try with various kernels to compare the results achieved. Further, we are also intending to use dimensionality reduction technique to reduce the higher dimensional internet traffic data to lower dimensions. To test the scalability of our proposed method, we also plan to work on datasets of massive traffic in real networks.