N-grams Based Supervised Machine Learning Model for Mobile Agent Platform Protection against Unknown Malicious Mobile Agents

From many past years, the detection of unknown malicious mobile agents before they invade the Mobile Agent Platform has been the subject of much challenging activity. The ever-growing threat of malicious agents calls for techniques for automated malicious agent detection. In this context, the machine learning (ML) methods are acknowledged more effective than the Signature-based and Behavior-based detection methods. Therefore, in this paper, the prime contribution has been made to detect the unknown malicious mobile agents based on n-gram features and supervised ML approach, which has not been done so far in the sphere of the Mobile Agents System (MAS) security. To carry out the study, the n-grams ranging from 3 to 9 are extracted from a dataset containing 40 malicious and 40 non-malicious mobile agents. Subsequently, the classification is performed using different classifiers. A nested 5-fold cross validation scheme is employed in order to avoid the biasing in the selection of optimal parameters of classifier. The observations of extensive experiments demonstrate that the

the agent or discard it; and to decide the trust level, privileges, resources and services that should be acknowledged to the agent if it is permitted to. Path History contains the identities of the current platform as well as the next platform in the itinerary. Cao et al. [17] proposed the use of agent path history information on the role activation and permission activation. The roles activated for an agent will be filtered by path patterns whereas the permissions for roles will be finely tuned by a set of host patches. Furthermore, Idrissi et al. [34,36] proposed an authentication process based on Diffie-Hellman Key Exchange integrated with digital signature DSA to prevent the vulnerabilities arisen due to the unavailability of authentication, which makes it well resistant to the Man in the Middle attack; as well as another mobile agent platform security technique based on Elliptic Curve Cryptography (ECC) and dynamic role assignments using Role Based Access Control (RBAC) policy.
Venkatesan et al. [18] proposed Malicious Identification Police (MIP) that uses Attack Identification Scanner (AIS) to scan the incoming agent byte code in order to diagnose the maliciousness in it. In Policy Based MIP proposed by Venkatesan et al. [19], the privileges of an agent are also checked in addition to AIS [18], to know if it wants to do more than the privileges granted to it. Otherwise, Intelligent AIS (IAIS) decides to start the lexical analyzer by its own decision, where agent byte code is turned into tokens and diagnose the non-match tokens by comparing with tokens present in the Knowledge Base (KB). Afterwards, unknown tokens are executed and tested in an isolated environment to check for their malicious intentions and updates the KB containing malicious codes, with the newly diagnosed (if any) vicious code. In order to fend off this waiting time of the agent, Venkatesan et al. [20] further included the agent clones to handle multiple incoming agents simultaneously. Additionally, the pipelining concept was introduced by separating the operations i.e. the tasks of scanning, pattern extracting and detecting unknown codes are performed by different agents, which ultimately reduces the time complexity. Clearly, many researchers have been buckled down in the field of MAP security. However, the unknown malicious mobile agent detection before invading the MAP is still a challenge and a concern owing to the growth of malicious agents in recent years.
Nowadays, malicious code detection techniques employ one of these two approaches: Signature-based or Behavior-based. Signaturebased methods involve the identification of distinctive tokens in the binary code [5]; whereas Behavior-based methods rely on the rules created by the experts that define the malicious behavior or nonmalicious behavior of code [6]. While being very precise, signaturebased methods are unable to diagnose previously unknown malicious codes whereas behavior-based methods can only detect the presence of malicious content after the code has been executed [7]. Realizing the necessity of a detection method for the unknown malicious code, in recent years, the machine learning algorithms or Classification Algorithms were magnificently employed which was highly inspired by the Text categorization problem [8]- [9], [23]- [25].
In this paper, an attempt has been made for detecting unknown malicious mobile agents using Machine Learning algorithms, which represents a novel contribution in the field of MAP security as per the survey done by the authors. This attempt addresses several facets of the detection challenge: mobile agent representation, classification and performance evaluation. The present work is also influenced by the objective to achieve very high classification accuracy rate while maintaining the low false negatives (i.e. misclassifying malicious agent as non-malicious). Though there are various representations of executable files: "Portable Executable (PE)", "Byte Sequence n-grams", and "plain-text string features" [9]; in this paper, n-gram representation of the agent executable is considered to be used as features for the classification process, since an extensive n-grams analysis is also one of the major focuses of this paper. N-grams are overlapping substrings obtained in a sliding window fashion [10].
The extracted n-gram features are then fed into four commonly used Classification algorithms: Naive Bayesian, SMO, IBK, J48 Decision Tree, for discriminating between two categories of agent classification (malicious mobile agent and non-malicious mobile agent), which is supported by WEKA tool [21]. The extensive experiments are performed on a collection of 80 files, in which half of the total files are malicious. The experimental results are evaluated based on standard performance evaluation measures such as "Sensitivity Rate", "Specificity Rate", "Positive Predictive Value", "Negative Predictive Value", "F-score", "Receiver Operating Characteristics -Area Under Curve", "Miss Rate", "Fall out" and "Accuracy Rate", while employing the 5-fold nested cross validation scheme.

A. Dataset Used
To the best of author's knowledge, there is no standard data set available for the detection of malicious mobile agents. Therefore, the benchmark dataset of malicious files known as CSDMC20101 1 API sequence corpus, containing Windows API/System-Call trace files, is selected for the purpose of classification. The dataset contains 388 files involving 320 malware traces as well as 68 benign traces (considered as non-malicious in this paper). For the training dataset, only 40 malicious files and 40 non-malicious files are collected after random sampling (equal number for malicious and non-malicious files is considered in order to avoid the Class-imbalance problem). This standard dataset is preferable for the proposed approach since agent byte code can be viewed as a sequence of agent API function calls. This assumption is made on account of the previous studies of extracting API call sequences from byte codes [31], [32].

B. Performance Evaluation Measures
To evaluate the classification performance of detecting malicious mobile agents successfully, it is necessary to identify appropriate performance metrics. The measures derived from the Confusion Matrix ( Figure 1) to calculate and be applied to classifier evaluation are described in Table I [26]. The confusion matrix indicates the correct and incorrect classification outcomes predicted by the classifier when compared with the actual classification outcome. The measures other than Accuracy Rate and Misclassification Rate are considered to figure out whether the present framework holds good for the classification of either malicious mobile agents or non-malicious mobile agents or both.

C. Methodology
The major objectives of present methodology are as follows: • To automate the detection of malicious mobile agents before they conquer the Mobile Agent Platform.
• To evaluate the performance of n-gram representation of mobile agent.
• To use Machine Learning algorithms for the task of unknown malicious mobile agent detection.
• To scrutinize the performance of various classifiers for classifying the mobile agents.
The methodology used in this paper is shown in Figure 2. It mainly consists of two consecutive steps: n-gram feature extraction of mobile agent and classification. These steps are described in detail in subsequent sub-sections.

Mobile Agent Representation using Byte n-grams -Data Preparation
A standard n-gram analysis is used to extract features from the malicious and non-malicious files. This method is purely machinelearning based method and exploits Natural Language Processing (NLP) also [30]. The n-grams are extracted in a sliding-window fashion, where a window of fixed length (n) slides one byte at a time. In general, n-grams are all substrings of a larger string with length "n" [24]. In present context, byte n-grams are viewed as API call based features. Many researches in recent years have released the importance of n-gram based methods in malware detection, since this technique of extracting features is simple and easy to implement. Each n-gram is analogous to a word or a term of a text document in the Text Categorization problem. For instance, there are eight 3-grams in the text "abc_dabc_e": "abc", "bc_", "c_d", "_da", "dab", "abc", "bc_" and "c_e". For the preparation of data, the unique n-grams are identified in all the mobile agent files and are merged together. In above example, there are only six distinct n-grams i.e. "abc", "bc_, "c_d", "_da", "dab", "c_e". The procedure of n-gram extraction repeats for different values of n. To limit the experiments for present study, the varying n-grams are employed with the value of n ranging from 3 to 9 only. This is because if the value of n increases, the number of unique n-gram features also increases. The number of distinct n-grams extracted from dataset files is 1403, 2236, 3074, 4055, 5137, 6445 and 7727 for 3-gram, 4-gram, 5-gram, 6-gram, 7-gram, 8-gram and 9-gram respectively.

Classification
Since the unknown mobile agent can be classified either malicious or non-malicious, the Binary Classification is taken into account. The standard commonly used classification algorithms such as Naïve Bayesian [22], Instance based Learner [22], Sequential Minimal Optimization [27]- [29], and J48 Decision Tree [22], are implemented. These classification algorithms differ in performance within different domains. In this paper, the best fitted algorithm for the dataset has been identified by the experimentation as shown in the subsequent section.

i) IBK
IBK is a WEKA implementation of k-Nearest Neighbor (k-NN). In general, the nearest neighbor classifiers compare a given test tuple with the identical training tuples. The training tuples are characterized using n features. Each tuple represents a point in an n-dimensional space. Hence, all the training tuples are exemplified in an n-dimensional feature space. When an unknown tuple is given as an input, a k-NN classifier explores the feature space for the closest k training tuples to the unknown tuple [22].
The closeness is defined in terms of Distance Metrics such as Chebyshev distance, Manhattan distance, and Euclidean distance. The unknown tuple is labeled with the most common class among its k-nearest neighbors. The value of k is usually an odd number to avoid tied votes; however, choosing the value of k is very analytical. The smaller value of k indicates the higher influence of noise on the result whereas the larger value of k makes the classification computationally very expensive. The pseudo code of IBK is shown in Algorithm 1.

Algorithm 1. IBK classification algorithm
Algorithm IBK(k, X, Y, x) //Input: k-an integer odd value (number of nearest neighbors), X-Training data consisting of n tuples, Y-Class Labels of X, x-Unknown tuple //Output: Class label of x 1. for i ß 1 to n do 2. compute distance (X i, x) 3. end for 4. sort the distances in ascending order 5. select the first k points from the sorted list (these are the k nearest training tuples to unknown tuple) 6. return class label that belongs to the majority of k selected tuples

ii) Naïve Bayesian
This classifier is so called because it relies on the Bayesian Theorem. Moreover, it is called "Naive", because it assumes the independence between every pair of attributes (or features), which is known as "class conditional independence" [22]. The classifier takes an unknown tuple as an input, for which the class label is not known, and returns a class label as an output for which the maximum probability is obtained as per the probabilistic calculations. This classifier is particular suited for the higher dimensionality of features. The pseudo code of Naïve Bayesian is shown in Algorithm 2. (2) and,in Equation (2), x k refers to the value of attribute A k for tuple X 3. end for 4. return class label with maximum ( | )

iii) J48 Decision Tree
A Decision tree classifies the tuples as per a set of tree-structured if-then-rules. Each internal node represents a test on an attribute (or a feature), each branch represents the outcome of test, whereas each leaf node holds a class label. J48 is a WEKA implementation of the C4.5 decision tree. It consists of two steps: the decision tree induction and the tree pruning [22]. The pseudo code of decision tree algorithm is shown in Algorithm 3. return N as a leaf node labeled with class C and terminate. 4. if A is empty, then 5.
return N as a leaf node labeled with the most common class in D (Majority Voting) 6. apply Gain Ratio feature selection method to find the best criterion 'a' ∈ A 7. label N with 'a' 8. for each value j of 'a' do 9. grow a branch from N with condition a=j 10.
let Dj be the set of tuples in D with a=j 11.
if Dj is empty then 12.
add a leaf node labeled with most common class in D to N 13. else add the node returned by decisionTree(Dj, A-a) to N 14. end for 15. return N iv) SMO SMO is a WEKA implementation of Support Vector Machine (SVM). SVM was originated from statistical learning theory with the objective to find the solution of interested problem without solving a more difficult problem as an intermediate step [27]. The statistical learning theory offers a framework that helps to choose the hyper plane space such that it diligently symbolizes the underlying function in the target space [28].
SVM avoids over-fitting to the training dataset.
Given a training set of instance-label pairs (x i , y i ), i=1,… .,N where x i ∈ R n and y ∈ {1,-1} l , SVM finds the solution of an optimization problem as shown in Algorithm 4 [29]. The training tuples x i are represented into a higher (possibly infinite) dimensional space using function ∅. In other words, the kernel ∅ is used to transform data from input to the feature space. There are four basic kernels: Linear, Polynomial, Radial Basis Function (RBF) and Sigmoidal.

Algorithm 4. SMO Optimization Problem
Optimization Problem: Given Constraints in Equation (3): q Minimize the Error Function as mentioned in Equation (4): where, C>0 is the penalty parameter of the error term, and is called Capacity Constant. The choice of C is made carefully in order to avoid over-fitting. Larger the C, more the error is penalized. w is the vector of coefficients. b is a constant.
ξ represents parameters for handling non-separable data (inputs) i labels the N training cases x i represents the independent variables.

III. Results and Discussions
The classification of mobile agent into two categories based on their n-gram features has been performed on 80 agent files of dataset of API calls sequence. An extensive setting of parameters is done to optimize the performance of each classification algorithm (NB, SMO, IBK and J48), such as "value of k", "distance measure", or "nearest neighbor search algorithm" in IBK, "pruning", or "confidence factor" in J48 decision tree, "complexity parameter", or "kernel" in SMO. The nested five-fold cross validation scheme is performed to obtain unbiased evaluation results [33]. In nested five-fold cross validation method, the data is randomly divided into five disjoint folds. The four folds are used for tuning of classifier parameters (using cross validation scheme) and then the tuned classifier is validated on left out fold. This procedure repeats for five times, each time with different left-out folds. This nesting of cross validation loops avoids so-called resubstitution-bias [33]. Additionally, the standard parameters such as Sensitivity, Specificity, ROC, PPV, NPV, FNR, FPR, F-measure and Accuracy, evaluate the performance results and the results of all iterations are averaged to get the final outcome. It has been evidenced that the performance of present work highly depends on the choice of classifier.
The results of various classifiers (NB, SMO, IBK and J48) for different values of n for n-grams i.e. n=3 to 9 have been investigated and are presented in Table II. To help with nested cross validation, WEKA tool has been used to adjust the classifier settings repeatedly in order to get the results on suitable parameter values. The results demonstrate that IBK classifier, J48 classifier, SMO classifier and NB classifier gives maximum accuracy rate of 96.25%, 97.50%, 95.00% and 93.75% respectively (as shown in Figure 3), while maintaining the miss rate of 2.50%, 2.50%, 5.00% and 7.50% for trigrams, 9-grams, 5-grams to 9-grams and 7-grams respectively (as depicted in Figure 4) in distinguishing malicious files from non-malicious. The value for Area under ROC curve is more than 0.92 for each classifier. IBK gives maximum sensitivity of 97.50 % and specificity of 95.00% for 3-gram features. J48 gives maximum sensitivity of 97.50 % as well as specificity of 97.50% for 9-gram features as shown in Figure 5 and Figure 6. Moreover, the highest values of PPV and NPV (97.50% each) belong to J48 classifier for 9-grams.   This means IBK and J48 decision tree are the best in distinguishing actual malicious agents (positives) and actual non-malicious agents (negatives) using 3-gram and 9-gram features respectively. The performance of other classification algorithms reduces with the increase in number of features.
It is shown in Figure 5 that the sensitivity rate increases up to n=7 using NB classifier and then decreases. Using SMO, the sensitivity rate increases up to n=4 and then remains the same with further increase in value of n. Using J48 and IBK classifiers, sensitivity rate is minimum for n=3, which decreases and then again increases with increase in value of n. The Figure 6 indicates that the minimum specificity rate (85.00%) is obtained using NB classifier for 3-grams, which further increases with the increase in value of n. SMO and J48 classifiers provide the minimum specificity rate of 92.50%. From 5-gram to 9-grams, specificity is constant with SMO classifier whereas from 6 to 8-grams, IBK increases with increase in value of n.

IV. Conclusions and Future Scope
This paper aims for probing the suitability of machine learning algorithms for the task of classification of mobile agent either malicious or non-malicious in a Mobile Agent Environment using a specific dataset. In particular, n-grams are used as features during the classification process. Different classification algorithms used are Naïve Bayesian, Sequential Minimization Optimization, Instance Based Learner, and J48 Decision Tree. J48 decision tree algorithm gives higher accuracy rate of 97.50% and miss rate of 2.50% for 9-grams whereas IBK gives higher accuracy rate of 96.25% and miss rate of 2.50% for 3-grams, as compared to other classification algorithms. In other words, out of all the classifiers, J48 gives better result with 7727 features whereas IBK with 1403 features, but J48 increases the computational cost due to large feature space. Therefore, IBK is declared to be the best classifier. Clearly, the optimistic results boost the use of present research for MAP protection.
In near future, the work can be extended with the use of more different classifiers with higher values of n for n-gram features in order to evaluate the performance of classification task. Moreover, large number of n-gram features burdens the classification process; therefore, feature selection methods can be applied in future. The work can even be done on different datasets, since the classifiers may give different results on different datasets.