Automatic Irony Detection using Feature Fusion and Ensemble Classifier

.


I. Introduction
I N the era of internet, micro-blogging sites like twitter, facebook, and review forums allow users to express their sentiment or opinion [56]. The sentiment or opinion in micro-blogs may relate to product, event or political debate in the form of text, image and video clips, where text plays an essential role in expressing opinions. Micro-blogging textual information is rich, progressively expanding in large volume of data, with a variety of information ranging from product to political events [1]. This textual information plays a vital role in determining sentiment of the population. An enormous amount of textual information provides valuable insight to governments, business organizations and individual decision makers [2]. The manual summarization of micro-blog textual information is time consuming. Hence, the automatic summarization of subjective information is very essential to determine polarity of the population [3]. The automatic text polarity identification process is known as Sentiment Analysis (SA) or Opinion Mining. Sentiment Analysis (SA) aims to classify a given text into positive, negative or neutral polarity [4] [53]. There are many challenges related to SA which need to be addressed and resolve. Some of the challenges are: (a) language utterances, (b) punctuation marks to express sentiments, (c) shorten form of words (mainly in micro-blogs), (d) sarcasm/irony present in text snippet and many more. Sarcasm/irony detection in text is one of the major challenges in sentiment analysis.
The sarcasm has been studied by multidisciplinary endeavors such as sociology [5], psychologists [6], linguists [7] and computer scientists [8] for different types of text: twitter tweets, product reviews, internet dialogs, etc. [19]. Over the time, human have developed the ability to recognize sarcastic/ironic intent in utterances from childhood through social interaction [10]. Sarcasm is portrayed as ironic, intended to insult, mock or amuse. However, irony or sarcasm is a complicated mode of communication, which is informally connected with the expression of feelings, attitudes and emotions [11]. Sarcasm is closely related to irony [16]. Irony shifts the polarity of an apparently positive/ negative utterances into its opposite [12]. Human intervention to recognize irony is extremely studious and time consuming. Due to this, researchers aim to develop an automatic system to recognize the ironic utterances present in the text.
Understanding ironic utterances from stance of both semantic and grammatical is another practice of Nature Language Processing [13]. Irony detection techniques are roughly categorized into machine learning and lexicon based approaches [14]. A lexicon based approach uses dictionary/corpus using statistical and semantic features to detect ironic utterances in a given text. On the other hand, a machine learning approach uses text features to classify ironic utterances using supervised/unsupervised techniques based on label or unlabeled text. Both approaches perform well in detecting ironic utterances present in sentences [47].
In irony detection, feature extraction and selection plays a vital role in determining the ironic utterances present in sentences. The features are extracted based on linguistic and content based approaches. A linguistic approach is an extremely broad phrase to extract textual features such as lexical, hyperbole and pragmatic features. The lexical approach uses text properties such as unigram, bigram, n-grams, etc. for detecting irony in text [42]. In a lexical approach, the dictionary or corpus related to vocabularies of words are used to identify irony present in text snippet [15]. Similarly, hyperbole is another key feature often used in irony detection from textual data. A hyperbolic text contains interjection (wow, aha, etc.), punctuation marks (question marks and exclamation mark), quotes (' ', " ") and intensifiers (noun, adverbs, adjectives) to detect irony in tweets [43]. The pragmatic feature includes symbolic or figurative texts such as emoticons of happy, sad, laughing, and crying etc., expressed in the sentences [15]. Researchers [17] [47] [51] [52] used various linguistic features to detect ironic utterances in short texts. However, identifying appropriate patterns to detect ironic utterances remains an open challenge.
In addition to the wide range of linguistic features, many researchers [9] [22] [46] [55] studied the content based approach i.e., presence or absence of term/features in reviews. In the content based approach, the number of features plays a vital role in accurate classification. The high dimensionality and sparsity is one of the major challenges faced during classification task. To curse the dimensionality, many researchers reported in [48] [49] [50] used feature selection methods to select discriminative features from a high dimensional feature space. The conventional feature selection methods such as Chisquare (χ 2 ), Information Gain (IG), Mutual Information (MI) are used to select discriminative features from high dimensional feature spaces. However, the selected feature subset may have features which convey similar information [26]. On this line, we propose a two stage feature subset selection using conventional feature selection methods and a clustering method to select the most discriminative features from a high dimensional feature space. On the other hand, linguistic features are extracted to detect sarcastic utterances in short text. The five groups of linguistic features that are extracted viz: Rating Feature, Word Feature, Acronym Feature, Symbol Feature and Emoticon Feature. Further, features are fused to capture various dimensions of characteristics of review. The fused features are classified using various classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF) and Decision Tree (DT). To enhance the performance of the classifiers, we ensemble Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF) and Decision Tree (DT) using weighted majority voting schema.
The main contribution of the paper is as follows: • Five types of features are extracted using linguistic approach viz: Rating, Word, Acronym, Symbol and Emoticon Features.
• Two stage feature subset selection to select the most discriminative features.
• Features are fused (linguistic and content based feature subsets) to capture various dimensions of characteristics of review.
• Use of weighted majority voting schema to ensemble decision of each classifier.
The remainder of the paper is structured as follows: Section II depicts related works on irony detection. Section III presents the methodology of the proposed work. Experimentation and related results along with discussion are presented in Section IV. Finally, the work is concluded with future research directions in Section V.

II. Related Work
In recent decades, prominent research works are carried out by various researches [17]- [21] [27] for automatic detection of irony in various micro-blogs such as twitter, product reviews and movie reviews. A brief survey on automatic sarcasm detection by Joshi et al. [16] described various datasets, approaches, trends and issues in sarcasm detection. Some of the related works in literature are reviewed based on supervised, semi-unsupervised and rule based approaches. Similarly, in [13] Wicana et al., described sarcasm detection from the machine-leaning perspective. The research tried to explore supervised, unsupervised, rule based approaches and hybrid approaches to process data. Dave and Desai in [14], examine various lexicons based and machine leaning techniques for sarcasm detection on textual data. The comprehensive survey highlights the use of hybrid techniques, i.e. usage of both lexicons based and machine leaning techniques together for sarcasm classification.
Ravi and Ravi in [17], proposed a framework to automatically detect satire, sarcasm and irony found in news and customer reviews. The framework extracts features based on linguistic, semantic, psychological and unigram features. The various feature selection techniques are used to select the relevant feature subset from unigram features. The extracted and selected features subsets are fused and classified using Support Vector Machine (SVM) with various kernels, Logistic Regression (LR), Random Forest (RF), Naive Bayes (NB), Multilayer perceptron (MLP), etc. Similarly, Buschmeier et al. [18], described impact of features in a classification approach to detect irony in product review [20]. The method uses 29 special features such as positive/negative imbalance of reviews, hyperbole, positive/negative word with punctuation, quotes etc., along with bag-of-word features (21,773 features). The various features set comparison are drawn on different classifiers such as Linear SVM, LR, Decision Tree (DT), RF, and NB classifiers. Filatova in [19], identifies the sentiment shifts in sarcasm product review dataset [20]. The method demonstrated sentiment flow shifts (from negative to positive and likewise) using bi-gram feature along with 8 classification features (very negativepositive, very negative-very positive, negative-positive, negative-very positive and likewise). Justo et al., [21] proposed to detect sarcasm and nastiness in the social web. The various features such as mechanical turk, statistical cues, linguistic information, semantic information using Linguistic Inquiry Word Count (LIWC) and n-gram distribution of Part-of-Speech (POS) taggers are used to extract features. The feature subset is selected using the chi-square (χ 2 ) feature selection method. The binary classification was performed using rule-based and NB classifiers.
In literature, many researchers proposed various feature extraction techniques to classify ironic review. The existing methods uses NLP and machine learning approaches to extract various patterns to classify ironic content in reviews. However, some of the observations made from literature are as follows: (a) the number of features was too large, (b) feature extraction using POS tagger, and (c) searching each word in sentiment dictionary is clumsy. Hence, the proposed research developed a new approach to address these issues such as (a) Feature extraction (use of linguistic and content based features), (b) Applying feature selection methods to select discriminative features in content based approach and (c) Fusion of both features to provide useful insights of ironic contents. The content based feature subset is selected using a two stage feature selection method. In the first stage, the conventional feature selection methods such as Chi-square ( χ 2 ), Information Gain (IG) and Mutual Information (MI) are used to select relevant feature subsets from a high dimensionality feature space. The selected feature subset may have features which convey similar information. Due to this reason, the features exhibiting similar information are grouped and features belonging to each group are selected. The second stage of feature selection is used to select the representative feature from each first stage feature subset. In this second stage, the features are grouped based on features exhibiting similar information. The clustering algorithm is used to cluster or group the similar information features subset. In this work, k-means clustering algorithm is used to cluster such that the members in each group are as similar (close) as possible to one another. The feature nearer to each cluster center is consider as the representative feature among other features within the cluster. On the other hand, the linguistic features are extracted and categorized into five groups such as Rating, Word, Acronym, Symbol and Emoticon Features. Overall twenty special features are extracted using linguistic based feature extraction and categorized into these groups. The special features symbolize frequency of occurrences of the each feature in a review. Further, the special feature and content based feature subset are fused and evaluated using various classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF) and Decision Tree (DT) on product review dataset [20]. Furthermore, we construct an ensemble of four classifiers: SVM, LR, RF and DT based on combination rule to enhance the classification of the classifiers using weighted majority voting scheme.

III. Methodology
The proposed approach is a hybrid feature fusion method, which integrates linguistic features and content based text features. The proposed approach is used to classify product reviews into ironic or nonironic content based on a feature fusion method. The general architecture of the proposed feature fusion approach is outlined in Fig. 1. As depicted in Fig. 1, the proposed approach comprises three phases: (i) feature extraction and selection, (ii) feature fusion and (iii) ensemble classification. The details of product review dataset are presented in section IV (A) and the rest of the above are described in this section. In order to identify ironic customer reviews, we developed feature fusion of linguistic and content based features. The fused features are classified using individual and ensemble classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF) classifiers.

A. Feature Extraction and Selection for Irony Detection
In the classification of irony on product reviews, feature extraction and selection plays a vital role. Usually, ironic utterances expresses opposite meaning of the intended content. To extract ironic utterances present in reviews, linguistic based feature extraction is used. On the other hand, text features present in reviews yield promising results in the area of text classification [28] and sentiment analysis [25]. Hence, content based feature extraction and selection of features are used to detect ironic utterances in text.

B. Linguistic Feature Extraction
In this work, the features are extracted using a linguistic approach (hereafter, special features) to identify ironic utterances in text. To extract special features from sentences, we use syntactic information such as interjections, pragmatic, intensifier and many more. The special features are grouped into five sets of features. Table I provides an overview of group of features extracted using the linguistic approach.
In order to extract ironic feature, in [22] briefed that irony always express opposite of its actual content. The rating feature set groups the star rating (i.e 1* to 5* rating) for the reviews and the imbalance feature.
The imbalance between star rating and over-all polarity of words in the review are considered. The work in [18], assumes there is imbalance when the star rating (i.e., 4* and 5*) is considered as positive review but polarity of reviews are negative. Similarly, imbalance exists when star-rating (1* and 2*) is considered as negative review but polarities of reviews are positive. This imbalance between the start rating and polarity of the text is considered as imbalance feature. The polarity of the review text is determined based on dictionary of [23], which consists of 6,800 words with positive and negative polarity words. In general, user tends to exaggerate his/her sentiment through quoting certain words in sequences or between symbols. The word feature set consists of the features related to polarity of words present in the reviews. The feature hyperbole [24] implies the exaggeration present in sentences, which are extracted, based on appearance of three consecutive positive or negative words in a row. The feature quotes considers two consecutive intensifiers such as noun, adverbs and adjectives, which have positive or negative polarity in quotation marks [18]. In linguistic, ellipsis refers to the omission of words rather than repeating them unnecessarily and it is represented as three consecutive dots ("..."). In this work, feature ellipsis is considered as any positive or negative word end with an ellipsis. The feature punctuation considers positive or negative word with punctuation mark such as question marks and exclamation mark. The feature interjection indicates the occurrence of terms such as "wow", "ah", "aha" and many more in the sentences.
The use of language is constantly changing across space and across social group. The usage of acronymic word in micro blog text has grown enormously over the time. The feature such as acronym for laughter (lol, lawl, luls and many more) is used as a short form of laugh. The Onomatopoeia feature mimic the verbal conversation for laugh such as "haha", "mu-ba", "hehe", "hihi" and many more in the sentence. The feature acronym for Grin (*g*, *gg* and many more) depicts expression smiling broadly. These features are grouped into acronym feature set to describe emotions jargon present in reviews.
Usually, user tends to highlight his/her emotion by making more intense through exclamation mark, question mark, ellipsis and combination of these symbols. The symbol feature set groups these set of features to intensify the sentiment present in review text. The feature exclamatory mark ("!") symbolizes to express strong emotions in a review. Similarly, Question marks ("?") represent uncertainty about something in the sentence. Hence, exclamatory mark and question mark symbols are considered as set of features. The ellipsis only feature indicates the situation in which words are left out of a sentence but the sentence can still be understood by its context. The consecutive three dots ("...") are scrutinized as ellipsis feature. Further, the combination of ellipsis and punctuation (ellipsis followed by multiple exclamation or combination of exclamation and question mark), question mark and exclamation mark are considered as another set of features.

C. Content based Feature Extraction
In addition to special features, the content based features are extracted. To extract content based features, raw data are preprocessed by removing non-informative and trivial information. The review text consists of digits, punctuation, HTML tags and stop words, which occur more often and do not contribute to the analysis [57]. The preprocessed text are represented using unigram features with term frequency (tf ) schema. The work in [32] suggests that unigram with term frequency (tf ) performs well on sentiment analysis for microblogging data. Hence, we considered, unigram with term frequency (tf ) schema to represent review text into machine understandable form. The extracted features are high in dimension feature space, which need to be reduced to low dimensionality feature space by applying various feature selection methods. The aim of feature selection methods is to select relevant and non-redundant features from high dimensionality feature space. In this work, the two stages of Feature Subset Selection (FSS) are used to select the most discriminative feature from high dimensionality feature space.

D. Feature Subset Selection (FSS)
The feature subset selection consists of two stages: In the first stage, the conventional feature selections method is used to select relevant feature subsets from a high dimensionality feature space. The selected feature subset may have features which convey similar information. Due to this reason, the features exhibiting similar information are grouped and feature belonging to each group are selected. The second stage feature selection is used to select the representative feature from each first stage FSS group.
In the first stage, the conventional feature selection methods such as CHI-square ( χ 2 ), Information Gain (IG) and Mutual Information (MI) feature selection methods are used. These features selection methods are widely used to select relevant feature subset in text processing domain. The Chi-square ( χ 2 ) is a statistical method used to test specific feature correlated with the class. The higher value of feature (χ 2 ) score indicates the likelihood of feature occurrence is highly dependent on the occurrence of the class. The IG is frequently used in the field of machine learning to determine the term of goodness criterion. IG measures the information that is gained by knowing the value of the attribute, which is the difference between the entropy of the distribution before the split and the entropy of the distribution after the split. The higher IG value indicates features contribute with more information for category prediction of the review, whereas lesser IG value indicates they do not add much information. Similarly, MI measures how much information a feature contains about a class. If the feature distribution is the same in intra-class and inter-class, the MI value reaches the minimum. Otherwise, MI value reaches the maximum when feature distribution is in intra-class only.
Let there be m the number of reviews and n the number of total features in the feature space. The preprocessed texts are represented in the form of a Document Term Matrix DTM (m × n). The feature selection methods are applied to select relevant features among the features space. The conventional feature selection methods generate scores (S) corresponding to each feature. The feature scores are arranged in descending order to select top rated feature scores. The feature subsets are selected by fixing threshold value (l) empirically. The first stage feature subset selection using conventional feature selection methods has been shown in Algorithm 1. Step 2: S = [S 1 , S 2 ,...... S n ] //n = total number of features Step 3: Sort S in descending order // to select top ranked feature scores Step 4: Select first l number of features from S, the selected feature subset l is represented using Document Term Matrix DTM (l × m), where m represents the number of reviews and l indicates the number of feature subset selected from feature selection methods (l << n) .
The work presented in [26] suggested that features may convey similar information in the feature space. In conjecture to that, features evininge similar information are grouped to select the most representative features from each group. Due to this reason, the second stage feature selection is applied to select the most representative features from the feature subset obtained from the first stage.
In the second stage, the features are grouped based on features exhibiting similar information. The clustering algorithm is used to cluster or group the similar information feature subset. In this work, k-means clustering algorithm is used to group or cluster, such that the members in each group are as similar (close) as possible to one another. The k-means clustering algorithm works iteratively to assign features to one of the k clusters based on the similar information features. To determine the optimal number of clusters (k) as mentioned in [26] [29], number of cluster (k) is varied from l to 2 l . The feature nearer to each cluster centers is considered as the representative feature among other features within the cluster. The cosine similarity measure is used to determine the similarity between the features and cluster center.
The DTM (m × l) is obtained from the first stage of FSS and is transposed and represented as Term Document Matrix TDM (m × l), where l indicates the number of features in the subset and m represents the number of reviews. The algorithm 2 presents the second stage of FSS to select the most discriminative features among the features exhibiting similar information. Step 1: for i ← 1 to k do Step 2: for j ← 1 to l do //compute distance between cluster center U i and feature F j Step 3: Assign F j to nearest cluster center u i // Update cluster center Step 4: for i ← 1 to k do //find representative feature for each cluster center Step 5: for i ← 1 to k do Further, the feature subset F, which consists of k number of discriminative features is represented in TDM' (k × m), where m is the number of reviews and k is the reduced number of features k < l < n.

E. Feature Fusion
In addition to linguistic features, content based features play a significant role in sentiment analysis [17]. In order to capture various dimensions of characteristics of review, linguistic features and subset of content based features are fused. Overall twenty special features are extracted using linguistic based feature extraction and categorized into five groups. The special features symbolize frequency of occurrences of each feature in a review. On the other hand, content based feature subsets selections are applied to select the most discriminative features from a high dimensionality feature space. The special feature and content based Feature Subset Selection (FSS) are fused and evaluated using various classifiers.

F. Ensemble Classifiers
In this work, the predictive classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF) are used to evaluate the efficiency of the proposed method. The SVM classifier works by finding the hyper-plane, which maximizes the margin between the two classes. The vectors that define the hyper-plan are known as support vectors. The SVM is used for both classification and regression problems [34]. The LR is a statistical method used for binary classification problems (problems with two class values) [33]. The DT is a non-parametric approach used to construct a tree in top-down and recursive divide-and-conquer manner [17]. DT is mainly used for classification and regression problem. Similarly, RF is an automatic learning technique which combines the concepts of random subspaces and bagging [36]. Random Forest operates by constructing a multitude of decision trees at training time and outputting the class based on decision of individual trees [35]. These four classifiers are widely used in classification of ironic content in reviews [18] [30].
On other hand, work of [37] [38] emphasizes that individual prediction of various classifiers can be ensemble so that a more robust and accurate classification model can be built. The ensemble learning plays a vital role in recent research activity of pattern recognition and machine learning [38]. The main aim of the ensemble learning is to weigh several individual classifiers and combine the prediction of the multiple classifiers, which outperforms prediction of individual classifiers [31]. The majority voting and weighted majority voting are the most popular combination schemas, which are widely used in ensemble classification [39]. The simple majority voting schema selects one of many alternatives of the predicted classes with the most votes [40]. The weighted majority voting schema assigns weight for each prediction of the classifiers based on the performance of the classifier [41]. In this work, weighted majority voting schema is used to assign the weight for each decision of the classifiers. For weighted majority voting schema, first let there be (H 1 , H 2 ,......, H I ) classifiers with accuracies (A 1 , A 2 ,......, A I ), respectively. Then, h i,j be defined as the decision of i th classifier which chooses j th label from class C. In weighted majority voting schema, weights w i are assigned to the individual decision of the classifiers h i . The ensemble classifier decision using weighted majority voting schema H(X ) is as follows: In weighted majority voting schema, the optimal weight w i for the classifiers are assigned based on the accuracy of the classifiers h i , i.e., (1 ) where a i is the accuracy of the independent classifiers h i . In [54], the weight w i equip each classifier output with different weights between 0 and 1 value (0 ≤ w i ≤ 1) and sum of the weight w i is equal to 1. Hence, weights w i are assigned based on the performance of the individual classifiers as indicated in [54].

IV. Experimental Results and Discussion
In this section, the effectiveness of feature fusion is examined on a publically available product review dataset [20].

A. Dataset Description
The Amazon product review dataset consists of 1,254 reviews, consists of 437 ironic and 817 non-ironic or regular reviews created by [20]. The structure of the dataset contains * rating ranging from 1* to 5* star, along with labeled ironic and regular review content. Table II depicts the distribution of reviews by star-rating.

B. Experimental Setup
In this work, we conducted an experiment based on 80 training and 20 testing splits on product review dataset [20]. The experiments are conducted similarly to [18], [30] to give comparison with existing methods. The experiments are conducted based on two baselines: Feature Subset Selection (FSS) that accomplishes two stages feature selection and Feature Fusion baseline that exploits feature fusion of linguistic features and FSS. Initially, twenty Special Feature (SF) are extracted using the linguistic approach from each review. On the other hand, various preprocessing techniques are applied to extract content based features in review text. The 20,985 distinct features are extracted and represented using unigram term frequency schema. Further, extracted features are processed to select discriminative feature using FSS. The Feature Subset Selection (FSS) consists of two stages: In the first stage, conventional feature selection methods such as Chi-square, IG and MI are used to select feature subsets. The features are arranged in descending order based on the scores obtained from individual feature selection methods. The threshold value is used to select top scored features from the feature space. The threshold value is varied between 1,000 to 20,000 numbers of features by empirically. In the second stage, k-means clustering is applied to select the most discriminative features from first stage feature subsets. The optimal number of cluster is determined based on [26] and it is explained in previous section. In are ensembled based on weighted majority voting schema as explained in section III (F). The efficiency of the proposed method is evaluated based on Precision, Recall and F-measure obtained from individual classifiers and ensemble classifier.

C. Experimental Results
To report the performance of the proposed method, the experiments are conducted in two baselines. In the first baseline, FSS consists of the two stage feature selection method: in the first stage threshold values are varied from 1,000 to 20,000 features and empirically is found that 10,000 features depict competitive results for various classifiers. Table III elucidates the comparisons of overall features with conventional feature selection method of the first stage of FSS using SVM, DT and RF classifiers. The Mutual Information (MI) with 10,000 features outperforms the other selection methods in accuracy, precision, recall and f-measure using SVM, DT and RF classifiers. The Information Gain (IG) with 10,000 feature subset achieves 0.658 recall compared to other feature selection methods using DT classifier. Overall from Table III, we can observe that feature selection method plays a vital role in reducing high dimensional feature space and maximizing the performance of the classifiers. Further, the second stage of FSS is applied on 10,000 features obtained from various feature selection methods.
In the second stage of FSS method, k-means clustering algorithm is used by varying cluster numbers with difference of 500 from 100 to 5000 features. The k-means clustering with cosine distance yields promising results compared to Euclidean, City block, Correlation and Hamming Distance metrics [44] [45]. Hence, k-means clustering with cosine distance is used to select most discriminative features from the first stage FSS method. As explained in Algorithm 2, similar features are clustered and the features nearer to the cluster center are considered as the most discriminative feature subset. Fig. 2 depicts the classification accuracy of the proposed method with varying feature subsets using SVM, RF, LR and DT classifiers. Fig. 1 shows the feature subset ranging from 1000 to 5000 features with differences of 500 features. However, feature subset less than 1000 does not yield promising results. Hence, the feature subsets less than 1000 are excluded in Fig. 2. From Fig. 2 (a), we observe that the FSS of MI with k-means gradually increases and reaches maximum accuracy for 5000 features using SVM classifier. Similar variation is observed in Fig. 2 (b) and Fig. 2 (c), where MI with k-means clustering reaches maximum accuracy for 5000 features using RF and LR classifiers, respectively. In Fig. 2 (d), CHI-square with k-means clustering achieves maximum accuracy for 2500 features using DT classifiers. Overall, performance of MI with k-means feature subset increases with increasing number of feature subset using SVM, RF, LR and DT classifiers. The feature fusion baseline consists of Special Features (SF) and FSS features. Both features are in same weighting scheme i.e., frequency of occurrence of each feature. From Fig. 2, we observe that feature subset ranging from 4000 to 5000 yield promising result using various classifiers. Hence, feature fusion experiments are conducted by varying FSS from 4000 to 5000 features with 20 Special Features using 10 fold cross validation method. The number of features considered for feature fusion (FSS+SF) experiments are 4020, 4520 and 5020. In Fig. 3, the box plot representation is given to describe the variations observed in F-measure of feature fusion with various FSS using different classifiers. The Fig. 3 represents minimum, maximum and mean of 10 fold cross validation using SVM, LR, RF and DT classifiers. Fig. 3 (a), (b) and (c) present the result of feature fusion of FSS (Chi) + SF feature subsets using SVM, LR, RF and DT classifiers. In Fig. 3 (b), the feature fusion of FSS (Chi-square) achieves maximum F-measure for 4520 features using LR classifier. Fig. 3 (d), (e) and (f) present F-measure results of feature fusion of FSS (IG) + SF feature subsets. Fig. 3 (e) shows feature fusion of FSS (IG) + SF achieving maximum F-measure for 4520 feature subset using RF classifier. Fig. 3 (g), (h) and (i) present the results of feature fusion of FSS (MI) + SF using SVM, LR, RF and DT classifiers. From Fig. 3 (h), we observe that feature fusion of FSS (MI) + SF features achieve maximum F-measure for 4520 feature subset using the LR classifier. From Fig. 3, we can conclude that feature fusion (FSS (MI) + SF) for 4520 feature subset achieves maximum F-measure using SVM, LR, RF and DT classifiers. Hence, we consider feature fusion method (FSS (MI) + SF) with 4520 features for comparing with other existing methods.
Further, the classifiers are ensemble to seek collective opinions of the classifiers to enhance the performance of the models. Here, weighted majority voting schema is used to assign the weight for each decision of the classifiers. As explained in section III (F), weights are varied between 0 and 1, where the sum of the weights are equal to 1. The best combinations of weights are assigned empirically for each model. The Table IV presents the maximum accuracy obtained in 10 folds cross validation for the proposed feature fusion method (FSS (MI) + SF) with 4520 features using various classifiers. From Table IV, it can be inferred that SVM, DT and RF outperform LR classifier by a good margin in terms of accuracy, precision, recall and F-measure. The LR model performs least among other models due to the possibility of overfitting. Hence, the best possible weights assigned are 0.3 for SVM, 0.1 for LR, 0.3 for RF and 0.3 for DT classifiers, respectively. Table IV  is considered for comparison with existing methods on product review dataset [20].

D. Comparison with Existing Methods
From literature, Buschmeier et al. [18] and Reganti et al., [30] describe irony/satire detection on product review dataset [20] using SVM, LR, DT and RF classifiers. The similar set of experiment is conducted and compared the results. The Table V elucidates the comparisons of the proposed method with [18] and [30] method. In [18], models are evaluated based on precision, recall and F-measure using linear SVM, LR, DT and RF classifiers. The features are extracted and fused using lexicon based (29 special features) and content based approach (21744 Bag of Words features). The total number of features considered for the model are 21773 features. On the other hand, Reganti et al. [30] performed feature fusion on 42 special features and baseline features. The baseline features such as character n-gram, word n-gram and word skip gram are used but the total number of features are not stated clearly. The fused features are classified using linear SVM, LR, DT and RF classifiers. Further, the decisions of individual classifiers are ensembled using weighted majority voting schema by varying weighted from 0 to 1 value. The performance of the model was evaluated using F-Measure. Hence, we conducted our experiment on same lines with 20 Special Features (SF) and 4500 text features using individual classifiers and ensemble classifiers using weighted majority voting schema. However, product review dataset [20] is used to describe the impact of sarcasm detection in sentiment shift [19]. Therefore, the proposed method is compared with the methods proposed in [18] and [30]. From Table V, it can be observed that the proposed feature fusion method outperforms the existing methods in terms of Precision, Recall and F-measure using individual classifiers and ensemble classifiers.

E. Discussion
Feature fusion is a process of integrating various characteristics of features to produce more consistent, accurate, and useful information. In irony detection, feature extraction and selection plays a vital role in determining the ironic utterances present in sentences. In this work, linguistic based and content based Feature Subset Selection (FSS) are extracted and features are fused to provide various characteristics of features of product review dataset [20]. The Table II elucidates distribution of product reviews by its star-rating. The 20 linguistic features are extracted and represented based on frequency of occurrence of features in each review. Table I presents  20 Special Feature (SF) extracted using linguistic approach. These features are used to detect ironic utterance present in reviews. On the other hand, content based features play a vital role in accurate classification of ironic reviews. The high dimensionality and sparsity is one of the major challenges faced during classification task. To curse the dimensionality, two stages of content based Feature Subset Selection (FSS) is proposed. In the first stage FSS, conventional Feature Selection Method (FSM) is applied to select relevant features set. In this experiment, conventional FSM such as Chi-square, IG and MI are used to select the relevant feature set from original features based on scores of individual FSM. It is noticeable from Table III, that first stage FSS using conventional FSM reduces high dimensionality feature space by selecting a subset of relevant features. The MI feature selection method yields maximum performance for 10,000 features compared to other FSM using SVM, LR, DT and RF classifiers, respectively. However, selected feature subset may have features which convey similar information. Due to this reason, second stage FSS is applied to select more discriminative features among feature subset. In the second stage, k-means clustering algorithm is used to group the features which convey similar information. Fig. 2 (a), (b), (c) and (d) present the classification accuracy of the proposed method with varying number of features (1000 to 5000) using SVM, RF, LR and DT classifiers. The MI with k-means consistently outrages other feature selection combination using various classifiers because MI compares the probability of observing features and class together (joint probability) instead of observing independently. MI reaches its maximum value, when the feature is a perfect indicator for class membership. On the other hand, k-means groups the similar information. The feature closer to cluster center is considered as the most representative feature within each cluster. The representative features are considered as discriminative features among other clusters. Hence, the combination of these methods exhibits maximum accuracy on all the classifiers. Further, features are fused to detect ironic utterances present in review using linguistic and content based FSS. A crucial observation is noted from Fig. 3 (a) -3 (i), which depicts the F-measure of 10 folds cross validation of feature fusion using various classifiers. Fig. 2 (h) elucidates maximum F-measure on feature fusion (FSS (MI + k-means) +SF) of 4520 feature set using various classifiers. Further, the prediction of each classifier is ensemble to enhance the classification performance. The weighted majority voting scheme is used to ensemble different classifier based on the prediction of each classifier. Table IV elucidates the performance comparison of ensemble model with individual classifiers. The ensemble model outperforms other classifiers with a maximum margin of Accuracy, Precision, Recall and F-measure. To evaluate the effectiveness of the proposed method, the proposed method is compared with existing methods present in the literature. From Table V, it can be observed that the proposed method outperforms other existing methods in terms of minimal number of feature set and the maximum classification performance of individual and ensemble classifiers.

V. Conclusion and Future Work
In this paper, we employed feature fusion to capture various dimensions of characteristics of the reviews. The proposed approach extracts features based on linguistic and content based text features. The five types of features are extracted using linguistic approach, viz: Rating, Word, Acronym, Symbol and Emoticon Features. On the other hand, content based text feature consists of two stages of Feature Subset Selection (FSS) to select the most discriminative features. Both the features are fused and classified using SVM, LR, RF and DT classifiers. With the series of experimentation, we demonstrated the proposed approach has an ability to capture ironic utterances present in the reviews. To enhance the performance of the classifiers, we make use of weighted majority voting schema to create an ensemble from the decision of each classifier. The results show that the proposed feature fusion out-performs the existing methods on benchmark dataset.
In future, the proposed approach can be extended to (i) extract more number of linguistic features to identify ironic utterances present in text, (ii) use of more sophisticated feature selection methods and (iii) employing various clustering algorithms along with ensemble methods which enhances the classification accuracies. Further, the proposed approach can be also extended to various fields such as sentiment classification, spam detection and many more.