Recognition of Emotions using Energy Based Bimodal Information Fusion and Correlation

Multi-sensor information fusion is a rapidly developing research area which forms the backbone of numerous essential technologies such as intelligent robotic control, sensor networks, video and image processing and many more. In this paper, we have developed a novel technique to analyze and correlate human emotions expressed in voice tone & facial expression. Audio and video streams captured to populate audio and video bimodal data sets to sense the expressed emotions in voice tone and facial expression respectively. An energy based mapping is being done to overcome the inherent heterogeneity of the recorded bi-modal signal. The fusion process uses sampled and mapped energy signal of both modalities’s data stream and further recognize the overall emotional component using Support Vector Machine (SVM) classifier with the accuracy 93.06%.


I. INTRODUCTION
ULTI-SENSOR information fusion is a rapidly developing area of research and development which forms the foundation of intelligent robotic control. It comprises of methods and techniques which collect input from multiple similar or dissimilar sources and sensors, extract the required information and fuse them together to achieve improved accuracy in inference than that could be achieved by the use of a single data source alone. In this contribution, we discuss a novel approach to fuse heterogeneous datasets obtained from multiple sensors with the aim of analyzing the human's emotional behavior.
Emotions play an important role in human-to-human communication and interaction, allowing people to express themselves beyond the verbal domain. The ability to understand human emotions is desirable for the computer in some applications such as computer-aided learning or userfriendly on-line help. During an interaction, an individual uses multiple modalities such as eye gaze, hand gestures, facial expressions, body posture, and tone of voice. Human behavior is thus, inherently multimodal. In addition to its multimodal nature, the emotional state of an individual is also an integral component of human experience and plays a significant role in developing intelligent systems for human computer communication. It influences numerous phenomenons such as cognition, perception, learning, creativity and decisionmaking. Besides the problem solving, reasoning, perception and cognitive tasks, emotion recognition also plays a pivot role in functions which are essential for artificial intelligence.
Considering these two aspects of human behavior, we have designed and developed a technique to analyze and correlate bimodal data sets and further recognize the emotional component from these fused data sets. This new technology ensures a proper balance between emotion recognition and cognition tasks.
The existing fusion methods as listed in the section-related work, do not address how to bridge the heterogeneity present in the captured data, which corresponds to the individual modality. The energy based mapping method inspired from how the different sensed stimuli signals by humans, mapped to the corresponding energy onto designated areas of the brain. This method brings homogeneity among heterogeneous emotional cues by transforming them onto their corresponding energy levels. The achieved fusion accuracy of 93.06% can ensures a proper balance between emotion recognition and cognition tasks.
The rest of the paper has been organized in the following manner: in Section II, along with existing fusion approaches, we discuss an energy based method for fusion of multimodal data sets. In Section III, we explain the architectural framework of our model. The implementation of the solution is delineated in Section IV. Section V outlines the applications of this model. Lastly, we conclude the research study in Section VI.

II. RELATED WORK
The wide use of multimodal data fusion technologies in versatile areas of application has invoked an ever increasing interest of researchers all over the globe. Multimodal data fusion techniques are used in numerous areas such as intelligent systems, robotics, sensor networks, video and image processing and many more. Multimodal data fusion can be performed at three levels: feature, decision and hybrid level fusion.
Feature level fusion has been used in [1] for fusing range of spatial cues with the relative assignment of linear weight to them. But they have unable to resolve the issue of how weights should be assigned to justify relevance and importance of different cues.
Neti [2] have been performed decision level fusion for speaker recognition and speech event detection. They have analyzed audio features (e.g. phonemes) and visual features (e.g. visemes) independently to arrive at recognized decision according to single modality. Thereafter they have employed a linear weighted sum strategy to fuse these individual decisions. The authors have used the training data to determine the relative reliability of the different modalities and accordingly adjusted their weights. Where as in [3], for speaker identification, they have considered the results of different classifier at decision level fusion. From the speech corpora, a set of patterns are identified for each speaker on the basis of predefined features by two different classifiers. The majority decision regarding the identity of the unknown speaker is obtained by fusing the output scores of all the classifiers using a late integration approach.
A multimodal integration approach using custom defined rules has been suggested by, Holzapfel et al. [4]. They have shown smooth human -robot interaction in the kitchen setting by fusing results of speech and 3D pointing gestures. This multimodal fusion which is performed at the decision level based on the n-best lists generated by each of the event parsers. A close correlation in time of speech and gesture has been proved by this approach, but this is leading to the process time overhead to determine the best action based on n-best fused input.
In [5] two techniques viz (1) Gradient-descent-optimization linear fusion (GLF) and (2) the super-kernel nonlinear fusion (NLF) are suggested. Each of which does the optimal combination of multimodal information for video concept detection. In GLF, an individual kernel matrix is first constructed and then fused together based on a weighted linear combination scheme. Unlike GLF, the NLF method does nonlinear combination of multimodal information.
In [6], the authors have used NLF method and first construct an SVM for the individual modality as a classifier. Thereafter, for optimal combination of the individual classifier models a super kernel non-linear fusion is applied. Experiments conducted on TREC-2003 Video Track benchmark shows NLF has on average 3.0% better performance than GLF.
To classify image, Zhu et al. [7] have given a hybrid level multimodal fusion framework. They have used SVM to classify the images with embedded text within their spatial coordinates. The fusion process is done in two steps. Firstly, on the basis of low-level visual features, a bag-of-words model [8] is used to classify the given image. At the same time, the text detector records the existence of text in the image using text color, size, location, edge density, brightness, contrast, etc. In the second step, for fusing the visual and textual features together a pair-wise SVM classifier is used.
A time-delayed neural network employed by Cutler and Davis [9] for feature level multimodal data fusion in for locating the speaking person in the scene. This is being done by identifying the correlation between audio and visual streams.
In another work, related to detecting human activities Gandetto et al. [10] have used the Neural Network decision level fusion method to combine sensory data. An environment equipped with a heterogeneous network of state sensors for sensing CPU load, login process, and network load and cameras for sensing observation along with computational units working together in a LAN is considered for the experiment. Human activity is monitored by fusing the data from these two types of sensors at the decision level.
A framework is given in [16] which fuse textual and visual information. Author has proposed additional preprocessing before combining these modalities in a linear weighted fashion at the feature and scoring levels. The pre-processing called as latent semantic mixing, takes care about overlapping information among both modalities by mapping the bimodal feature space onto low dimensional semantic space.
In this paper, we propose a feature level linear weighted fusion model based on a human-inspired concept of brain energy mapping model. Humans collect sensory data via human biological senses (sight, hearing, touch, smell and taste) and map this data as energy stimuli onto designated regions of the brain. The brain then fuses them together to obtain an inference. This analogy is employed in designing the architectural framework of our work. This phenomenon is depicted in

Stage I -Data Pre-Processing
The bimodal inputs obtained and then split into two componentsaudio and video. Thereafter, audio processing and video processing is performed simultaneously and in synchronization. The synchronization is necessary to ensure that no data is lost and the audio and video samples at any particular instance are processed simultaneously.
The audio component is segmented at the rate of 20 samples (utterances) per second. Video Sampling is done at the rate of 20 frames per second.

Stage II -Feature Extraction
The prosodic feature mean intensity of the audio component between time 't1' and 't2' is computed as: (1) where x(t) is intensity as function of time (in dB).
To compute the intensity, the values in the sound are first squared, then convolved with a Gaussian analysis window (Kaiser-20; sidelobes below -190 dB). The effective duration of this analysis window is 3.2 / (minimum_pitch), which guarantee that a periodic signal is analysed as having a pitchsynchronous intensity ripple not greater than 0.00001 dB.
The processing of video frames is done in two steps:  Facial Feature extraction using the Discrete Area Filters (DAF) Library used to extract coordinates of 15 facial feature points. [13]  Energy (gradient) computation (Fig 3) of extracted facial features co-ordinates using OpenCV Library. In Fig. 3, each lowercase letter represents the brightness (sum of the red, blue, and green values) of the corresponding pixel. To compute the energy of edge pixels, we consider that the image is surrounded by a 1 pixel wide border of black pixels (with 0 brightness).

Stage III -Fusion via Energy Mapping
Our framework uses the technique of feature level linear weighted fusion. Consider a feature set <Ev, Ea>, where Ev is the total energy of video features and Ea is the total energy of audio features. The feature set is computed at intervals of 1 second for the bi-modal input. ( where, Eais the audio energy of the audio sample of 1sec duration.
Eai is the audio energy of i th sub-sample w is the weight assigned to each sub-sample. n is the number of sub samples = 20 where, Evi is the energy of i th frame of the video sub-sample. w1 is weight assigned to left eye fiducial points. w2 is weight assigned to right eye fiducial points. w3 is weight assigned to nose fiducial points. w4 is weight assigned to mouth fiducial points. (4) where, Ev is the combined energy of the video sub-samples. Evi is the energy of i th video sub-sample. w is the weight assigned to each sub-sample. n is the number of sub samples = 20 We further label the feature set with appropriate class (1 -Happy, 2 -Anger, 3 -Fear) depending on the emotional state of the user. This feature set is then used to train the machine model designed for predicting the mood of the user.

Stage IV -Emotion Prediction
The Support Vector Machine (SVM) classifier is used to predict the emotional state of the bimodal input. We use LibSVM [14] to develop the C-SVC (C -Support Vector Classification) SVM having RBF (Radial Basis Functionexp(-gamma*|u-v|^2) ) kernel.
We further develop a machine model using C-SVC SVM and train it using the feature sets obtained in Stage III. The feature sets of the input to be tested are then labeled with an arbitrary label. These are then tested using the trained machine model. Finally, the predicted emotional state of the bi-modal input is displayed to the user.

IV. RESULTS AND ACCURACY CALCULATION
The energy based bimodal data fusion model was tested for the eNTERFACE [15] database with 3 discrete emotions that are happy, anger and fear. The specifications of the database are as follows: -648 samples -43 subjects enacting 5 sentences of each of 3 emotions (happy, anger and fear) -Samples having both male and female subject -Frontal views with moderate lighting conditions -Single person input The 80: 20 ratios of the training and testing samples are considered for cross validation. The total samples for each emotion are 215. Three times process has been repeated with different sets of training and testing in the ratio of 80:20. On average in each round 485 samples are classifies correctly on the basis of 518 samples. The average percentage of classification for the three emotions is shown in the table 1. The model shows 93.06% accuracy for emotion recognition of Happy, Anger and Fear Emotions using energy mapping model.

V. CONCLUSION
In this research study, we have developed a tool to analyze and correlate bimodal data sets of emotional cues using energy based fusion model and further recognized the emotional component from these bimodal data sets using Support Vector Machine classifier. We have mapped the audio and video features of bimodal input to their corresponding energy levels. The model is tested for eNTERFACE 2005 database and an accuracy of 93.06% is obtained recognition of happy, anger and fear emotions. The tool developed for bimodal energy based fusion model can further be used as a wrapper tool to develop intelligent applications which require multimodal data fusion and emotion recognition, such as Real Time Emotion Recognition, Expressive Embodied Conversational Agent, Virtual Tutor, Questionnaire which analyse verbal and nonverbal behavior.