Anomaly based Intrusion Detection using Modified Fuzzy Clustering

[23] developed a Fuzzy C-Means (FCM) algorithm to detect intrusions. The intrusion detection model was built through carrying out fuzzy partition and clustering of data. The experimental result shows that the algorithm can effectively separates normal and abnormal data. To overcome cluster centre initialization and convergence problem of FCM, Wang et al., [29] proposed a hybrid algorithm for intrusion detection system. This hybrid method combines FCM with Quantam behaved Particle Swarm Optimization. The Particle Swarm Optimization algorithm is used to overcome the drawback of FCM and to achieve global optimization and fast convergence. Guorui et al., [10] developed a semi supervised Fuzzy C-Means clustering algorithm for intrusion detection. This method overcomes the drawbacks of FCM i.e Sensitivity to the initial values and converging to the local minima by using few labelled data to improve the learning ability of the Fuzzy C-Means. Sampat and Sonawani [25] developed an intrusion detection system using Improved Dynamic Fuzzy C-Means (IDFCM) clustering. The IDFCM is a variant of the traditional FCM which adaptively updates the cluster centres. Experimental result shows the IDFCM gives better detection accuracy rate than traditional FCM. Hameed et al., [11] developed an hybrid clustering algorithm for intrusion detection. This hybrid algorithm combines Modified Fuzzy Possiblistic C-Means (MFPCM) and symbolic fuzzy clustering. This method uses 30 features with optimal sensitivity and highest discriminatory power. Ganapathy et al., [9] proposed a intrusion detection system based on Weighted FCM and Immune Genetic Algorithm (GA). The Weighted FCM is a modification of FCM which builds a system for more accurate

I. Introduction C omputer security has become an increasingly vital field in computer science in response to the proliferation of private sensitive information. The term "Intrusion" refers to any unauthorized access which attempts to compromise confidentiality, integrity and availability of information resources [1] [14] [32]. Traditional intrusion prevention techniques such as firewalls, access control and encryption have failed to fully protect systems from sophisticated attacks. As a result, Intrusion Detection System has become an indispensable component of computer security which is used to detect the aforementioned threats. In 1987, Denning [7] first proposed an intrusion detection model. Since then many researchers have been focusing on developing efficient and accurate Intrusion Detection System (IDS) models. The intrusion detection techniques fall under two types, Misuse or Signature Based and Anomaly Based methods. Signature based methods detect only known intrusion attacks whose signatures are stored in the database. These methods fail to detect unknown intrusions. On the other hand, anomaly based methods detect the attacks based on the signature deviation.
In early days, intrusion detection is done using rule based approaches, where experts define a set of rules for normal and abnormal conditions. These systems work better for known attacks but fail to detect unknown attacks. In later 1990's researchers concentrated to develop automatic intrusion detection methods. Many researchers used data mining and machine learning algorithms to detect unknown attacks. Among various intrusion detection techniques, Fuzzy Logic based methods play a very important role. From literature review it is found that clustering methods are widely used approaches in intrusion detection system. Jianliang et al, [13] developed an intrusion detection system using K-means clustering algorithm. The experimentation was carried out on standard KDD-99 dataset. Cluster to class mapping, No class and Class Dominance are the key problems in K-means clustering. To overcome these drawbacks, Bharti et al., [4] developed two variants of traditional K-means algorithm. Ren et al., [23] developed a Fuzzy C-Means (FCM) algorithm to detect intrusions. The intrusion detection model was built through carrying out fuzzy partition and clustering of data. The experimental result shows that the algorithm can effectively separates normal and abnormal data. To overcome cluster centre initialization and convergence problem of FCM, Wang et al., [29] proposed a hybrid algorithm for intrusion detection system. This hybrid method combines FCM with Quantam behaved Particle Swarm Optimization. The Particle Swarm Optimization algorithm is used to overcome the drawback of FCM and to achieve global optimization and fast convergence. Guorui et al., [10] developed a semi supervised Fuzzy C-Means clustering algorithm for intrusion detection. This method overcomes the drawbacks of FCM i.e Sensitivity to the initial values and converging to the local minima by using few labelled data to improve the learning ability of the Fuzzy C-Means. Sampat and Sonawani [25] developed an intrusion detection system using Improved Dynamic Fuzzy C-Means (IDFCM) clustering. The IDFCM is a variant of the traditional FCM which adaptively updates the cluster centres. Experimental result shows the IDFCM gives better detection accuracy rate than traditional FCM. Hameed et al., [11] developed an hybrid clustering algorithm for intrusion detection. This hybrid algorithm combines Modified Fuzzy Possiblistic C-Means (MFPCM) and symbolic fuzzy clustering. This method uses 30 features with optimal sensitivity and highest discriminatory power. attack detection. Immune GA improves the performance, probability of gaining the global optimal values and solves the high dimensionality problem. Khazaee and Rad [17] developed a novel method based on Fuzzy C-Means for improving the intrusion detection performance. The experiments were conducted on KDD Cup-99 dataset. Kumar et al., [19] developed a novel method of intrusion detection which involves fuzzy feature clustering. In this method fuzzy features clustering method is used to reduce the dimensionality of system calls. Pandeeswari and Kumar [21] developed hybrid detection system for cloud environment. This hybrid model works in three steps. In the first step, to improve the learning capability of ANN, fuzzy clustering is used. In the second step, various ANN modules are trained according to their cluster values. In the last step, fuzzy aggregation module is used to combine the results of various ANN.
Karthik and Nagappan [16] developed an intrusion detection system using Kernel Fuzzy C-Means and Bayesian Neural Network. This hybrid model consists of two step. In the first step, Fuzzy Bisector Kernel Fuzzy C-means is used to obtain the cluster centers. In the second step the centroids obtained from previous step are used for the learning of Bayesian network. The experiments were conducted on standard KDD Cup 99 Dataset. Rustam and Talita [24] proposed an intrusion detection algorithm based on Fuzzy Kernel C-Means (KFCM). Khazaee and Faez [18] developed an hybrid classification method for network intrusion detection. This hybrid model combines fuzzy clustering with multilayer perceptron neural network. Here training samples are initially clustered using fuzzy clustering and the inappropriate data will be detected and moved to another dataset. Further, Multilayer Perceptron will be trained using new labels. To classify outlier fuzzy ARTMAP neural network is employed. Surana., [27] developed an hybrid algorithm which combines Fuzzy C-Means and Neural Network for intrusion detection. This approach divides the training data into smaller groups using Fuzzy C-Means. Later, Neural Networks are trained using these subsets. Finally individual neural network results are aggregated. Kumar and Harish [2] proposed Robust Spatial Fuzzy C-Means algorithm which considers spatial information and uses kernel distance metric. Xie et al., [31] developed an intrusion detection using hybrid clustering method. This hybrid method is the combination of Fuzzy C-Means Clustering, Average Information Entropy, Support Vector Machine and Fuzzy Genetic algorithm.
It is evident from the above discussion that, the researchers have been attempting to come out with more efficient and robust intrusion detection techniques. Further, most of the existing methods are based on supervised and unsupervised learning approaches. Supervised learning requires a large volume of training samples and they fail to detect unknown attacks. On the other hand, unsupervised learning has got the advantage of detecting unknown attacks. On the contrary, the main limitations of the unsupervised learning are as follows: higher False Alarm Rate (FAR), fails to identify the specific type of attacks and worstly affected by curse of dimensionality. Further, most of the existing unsupervised methods uses euclidean as a distance metric. Unfortunately, euclidean metric is very sensitive to noise which results in degrading the system accuracy.
With this backdrop, to eliminate the above said problems in this paper, we present network anomaly detection method based on fuzzy clustering. The proposed method consists of three steps: Pre-Processing, Feature Selection and Clustering. To evaluate the proposed method, we conducted experiments on standard intrusion detection dataset and compared the results with other variants of FCM methods.
In summary, the main contributions of this paper are as follows: • A Modified Fuzzy Clustering method (RSKFCM) for anomaly detection is presented. The method can also identify a specific type of attacks.
• To handle curse of dimensionality, we employed Principal Component Analysis (PCA) as feature selection method.
• To overcome the existing drawbacks of FCM methods, we considered neighborhood information which in turn reduces the false alarm rate and increases the system accuracy.
• We used Gaussian kernel as distance measure to compute the distance between cluster center and samples. The advantage of using Gaussian kernel is that it reduces the effect of noise.
• We validated the proposed method on standard intrusion dataset using four cluster validity functions, accuracy, and false alarm rate. Further, we also compared our results with the contemporary methods.
The rest of the paper is organized as follows: Section 2 presents details about the KDD Cup 99 Dataset. Traditional Fuzzy C-Means is presented in Section 3. Section 4 presents the proposed method. Experimental setup, dataset used for experimentation and results are presented in section 5. Conclusions are drawn in section 6.

II. KDD Cup 99 Dataset
Since 1999, Many researchers evaluated the intrusion detection models on the KDD Cup 99 dataset [26]. This dataset was originally created by 1998 DARPA intrusion detection evolution program. The dataset contains five and two millions of training and testing samples respectively. Each sample as a set of 41 features derived from each connection and a label which specifies connection as normal or attack. These 41 features can be categorized into 4 groups.
• Basic features: These features can be derived from packets header without inspecting the payload.
• Content features: These features contains the information present in the payload of the original TCP packets and extracted using domain knowledge.
• Time based Traffic features: These features contains the service information which inspect the connection in the past two seconds that have the same service as the current connections.
• Host based Traffic features: These features examine the connections which established in the past two seconds and which have the same destination host as the current connections.
The dataset contains 21 different types of attacks which can be categorized into 4 types.
• Denial of Service (DoS): Attacker tries to forbid the legitimate users from utilizing the requested services/resources • Probe (PRB): Attackers attempt to gain information about the target host.
• Remote to Local (R2L): Attackers do not have the account in victim computer, so they try to get access to the computer.
• User to Root (U2R): Unauthorized user tries to gain access to local super user(root) through the network.

III. Traditional Fuzzy C-Means
Fuzzy C-Means clustering algorithm is based on the traditional fuzzy set which was proposed by Bezdek [3]. FCM is the improvement of K-means algorithm which groups the data points based on the membership value. In FCM, the membership of a data point depends on the similarity of the data point to a particular class relative to all other class. Let  (1).
where ij u is the degree of membership of j x in the cluster i, i v is the i th cluster center, . is a distance metric and m is a constant which controls the fuzziness of the resulting partition. Using the Lagrangian method Bezdek derived two necessary condition for minimizing the objective function J as follows: The Clustering process begins by randomly choosing the c cluster centers. Further, the membership are calculated based on the relative distance of data point j x to the centroid i v using equation (2). The data points which are close to centroids are assigned highest membership value, where as data points far from the centroids are assigned low membership value. After computing the membership of all the data points, the cluster centers are updated using equation (3). The clustering process stops when the successive difference of objective function is less than the pre defined threshold value ( ε ).

IV. Proposed Method
The technique of fuzzy clustering has become very important in the application of intrusion detection. This is due to the large role of uncertainty nature of an attack. Motivated by this, in this paper we proposed fuzzy clustering based anomaly intrusion detection method. The proposed method consists of three steps: Pre-processing, Feature Selection and Clustering. Fig. 1 shows the block diagram of the proposed system.

A. Pre-processing
The pre-processing step is performed to make the dataset convenient to be use in the clustering step. The pre-processing will have great impact on the intrusion detection efficiency. In this step we performed two pre-processing steps that are: removing duplicate samples and filling missing values. Dataset contains a large number of duplicate samples. If the dataset contains duplicate samples, then clustering takes more time and also gives inefficient results. To achieve accurate results we removed duplicate samples from the dataset. To fill the missing values, first we divide the whole dataset according to their class and compute the mean value for each feature. Further the missing value is replaced with corresponding feature's mean value.

B. Feature Selection
In this paper, we employed Principal Component Analysis (PCA) [12][15] to select the most discriminative features. PCA is a widely used feature selection method and it is based on linear transformation, which maps data from a high dimensional feature space to lower dimension. The first PCA as the highest contribution to the variance in the original dataset and each succeeding components have remaining variance as possible. The primary advantages of PCA are it is simple, non paramteric and it can preserve large percentage of the total variance with only a few components. Thus, these characteristics motivated us to adopt PCA for feature selection.
Let us consider a training network sample Next we compute the covariance matrix i.e where i Φ is the standard deviation which is computed as indicated in (6): In the next step we employed the Robust Spatial Kernel FCM (RSKFCM) to cluster these reduced feature matrix.

C. Robust Spatial Fuzzy C-Means
Traditional Fuzzy C-Means (FCM) leads to its non robust result mainly due to: not utilizing the neighbourhood information and use of Euclidean distance. To overcome these problems, Kumar and Harish [2] proposed a Robust Spatial Kernel FCM (RSKFCM) technique. RSKFCM incorporates spatial information to the conventional FCM membership function and uses kernel distance metric. Experimental results reveal that, RSKFCM gives better clustering results than other FCM variants. Thus, inspired by the good performance presented in [2], we applied RSKFCM method to detect network anomalies.
The main aim of the RSKFCM is to minimize the following objective function J : Where Φ is an implicit nonlinear map, and To compute the neighbourhood membership function, we calculated distance from each samples to other samples and considered k nearest samples. The neighbourhood memebrship function is defined as follows: where ( ) j NK x represents a array of k nearest samples from j x .
This spatial function represents the probability of sample j x belongs to i th cluster. In new membership function p and q parameters controls the relative importance of both functions.
The RSKFCM algorithm is carried out through an iterative optimization of the objective function shown in equation (10) with the update of membership value and cluster centers. The cluster centers are updated using equation (14): The clustering process stops when the successive difference of the objective function is less than the pre defined threshold value ( ε ). The individual stages of Robust Spatial Kernel Fuzzy C-Means (RSKFCM) are described in Algorithm 1.

Input: Intrusion Samples
Output: Label Initialize cluster centers, ε , m

{
Step 1 Compute all membership values of each sample against centers as: Step 2 Compute the new membership value using equation (11) Step 3 Calculate the objective function J as follows: (16) Step 4 Calculate new cluster center values i v according to expression in (17).

V. Experimental Results
To evaluate the proposed method, we conducted experiments on EDA dataset [5]. This dataset is the variant of original KDD dataset. Since, in original KDD dataset smurf, neptune and normal traffic represents 99.3% of the total samples, EDA dataset contains only these three classes.
To handle categorical features, this dataset adds one dummy variable per category into the original KDD dataset, which in turn increases the feature size into 122.
The performance of the proposed method is evaluated using four cluster validity indices, accuracy and false positive rate. For all algorithms in comparison, we set the fuzzy co-efficient m to widely used value 2. All the cluster centers are initialized randomly. We set stopping criteria ε = 0.0001. We implemented and simulated all the algorithms with matlab2013.

A. Evaluation using Cluster Validity Indices
In this section we evalated performance of the propsoed method using four cluster validity indices. These cluster validity indices help to validate whether clustering method accurately presents the structure of the data set or not. Wide varieties of cluster validity indices are proposed in the literature. In this paper we have used four widely used cluster validity functions, namely Partition Coefficient ( pc V ), Partition Entropy ( pe V ), Fukuyama-Sugeno function ( fs V ), and Xie- Bezdek [20] [22] proposed Partition Coefficient ( pc V ) and Partition Entropy ( pe V ) which uses only the membership values to evaluate the -cluster validity as indicated in (18) and (19): The value of pc V varies between V uses both the membership information and input data. When fs V value is minimum, the better clustering results are achieved.
The Xie-Beni function ( ) function, which was initially proposed by Xie-Beni (XB) in [30] and modified by Pal and Bezdek in [18], is defined as indicated in equation (21): In xb V the numerator indicates the compactness of the fuzzy partition and denominator indicates the strength of the separation between clusters. When xb V is minimal, the best clustering result is achieved.
To evaluate the performance, we compared our proposed algorithm with traditional Fuzzy C-Means (FCM), Kernel FCM (KFCM) and Spatial FCM (SFCM) methods. Table I presents the performance comparison of the proposed method.

B. Performance Comparison with State-of-the-art Methods
We compared our proposed method with six unsupervised anomaly detection methods. The methods used in comparison are as follows: K-Means [28], Improved K-Means [28], K-Medoids [28], Expectation Maximization [28], Fuzzy C-Means [6], Fuzzy Rough Clustering [6]. We used accuracy and False positive rate as evaluation metrics. Fig.  2 shows the comparison of the proposed method with other methods using accuracy. Fig. 3 shows the comparison of the proposed method with other methods using False Positive Rate.

C. Discussion
In Table 1, Fig. 2 and Fig. 3 we can observe that our proposed method outperforms all other methods. As mentioned earlier, our proposed method uses neighborhood information and kernel distance metric. The other methods in comparison, cluster a given sample based on membership value or distance value. Whereas our proposed method considers neighborhood membership value along with the membership value of that sample. This neighborhood information increases the accuracy of the proposed method. On the other hand, other methods use Euclidean distance which is very sensitive to noise. Whereas our proposed method uses Gaussian kernel distance metric. This reduces the noise effect and in turn increases the accuracy.

VI. Conclusion
The existing computer security technologies fail to prevent the threats completely. As a result, Intrusion Detection System (IDS) becomes important component in network security. IDS reduces the manpower needed for monitoring and increases the detection efficiency. In this paper, we presented Fuzzy C-Means based intrusion detection system. Principal Component Analysis is employed to select the most discriminate features. Afterwards a Robust Spatial Kernel FCM is used to cluster the network samples. To evaluate the efficiency of the proposed method, we conducted experiments on standard dataset and compared the results with variants of traditional Fuzzy C-Means methods and other clustering methods. The results inferred that the proposed method outperforms the other methods. The advantage of the proposed clustering method is it considers the neighbourhood membership value and uses kernel distance metric which increases the clustering accuracy and reduces the noise effect. However, the performance of RSKFCM algorithm depends on the cluster center initialization. In future work, Evolutionary algorithm can be used to initialize the cluster centers.