A Robust Framework for Speech Emotion Recognition Using Attention Based Convolutional Peephole LSTM.

Ramya Paramasivam; K. Lavanya; Parameshachari Bidare Divakarachari; David Camacho

doi:10.9781/ijimai.2025.02.002

Authors

Ramya Paramasivam Mahendra Engineering College Autonomous (India).
K. Lavanya Velammal Engineering College (India).
Parameshachari Bidare Divakarachari Nitte Meenakshi Institute of Technology (India).
David Camacho Universidad Politécnica de Madrid

DOI:

https://doi.org/10.9781/ijimai.2025.02.002

Keywords:

Attention Mechanisms, Convolutional Peephole Long Short-Term Memory, Feature Selection, Improved Jellyfish Optimization Algorithm, Speech Emotion Recognition

Supporting Agencies

This work has been supported by the project PCI2022-134990-2 (MARTINI) of the CHISTERA IV Cofund 2021 program; by MCIN/AEI/10.13039/501100011033/ and European Union NextGenerationEU/PRTR for XAI-Disinfodemics (PLEC 2021-007681) grant, by European Comission under IBERIFIER Plus - Iberian Digital Media Observatory (DIGITAL-2023-DEPLOY-04-EDMO-HUBS 101158511); and by EMIF managed by the Calouste Gulbenkian Foundation, in the project MuseAI.

Abstract

Speech Emotion Recognition (SER) plays an important role in emotional computing which is widely utilized in various applications related to medical, entertainment and so on. The emotional understanding improvises the user machine interaction with a better responsive nature. The issues faced during SER are existence of relevant features and increased complexity while analyzing of huge datasets. Therefore, this research introduces a wellorganized framework by introducing Improved Jellyfish Optimization Algorithm (IJOA) for feature selection, and classification is performed using Convolutional Peephole Long Short-Term Memory (CP-LSTM) with attention mechanism. The raw data acquisition takes place using five datasets namely, EMO-DB, IEMOCAP, RAVDESS, Surrey Audio-Visual Expressed Emotion (SAVEE) and Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D). The undesired partitions are removed from the audio signal during pre-processing and fed into phase of feature extraction using IJOA. Finally, CP LSTM with attention mechanisms is used for emotion classification. As the final stage, classification takes place using CP-LSTM with attention mechanisms. Experimental outcome clearly shows that the proposed CP-LSTM with attention mechanism is more efficient than existing DNN-DHO, DH-AS, D-CNN, CEOAS methods in terms of accuracy. The classification accuracy of the proposed CP-LSTM with attention mechanism for EMO-DB, IEMOCAP, RAVDESS and SAVEE datasets are 99.59%, 99.88%, 99.54% and 98.89%, which is comparably higher than other existing techniques.

Downloads

Download data is not yet available.

References

A. A. Abdelhamid, E. S. M. El-Kenawy, B. Alotaibi, G. M. Amer, M. Y. Abdelkader, A. Ibrahim, and M. M. Eid, “Robust speech emotion recognition using CNN+ LSTM based on stochastic fractal search optimization algorithm,” IEEE Access, vol. 10, pp. 49265-49284, 2022, https://doi.org/10.1109/ACCESS.2022.3172954.

Z. Chen, J. Li, H. Liu, X. Wang, H. Wang, and Q. Zheng, “Learning multi-scale features for speech emotion recognition with connection attention mechanism,” Expert Systems with Applications, vol. 214, p. 118943, 2023, https://doi.org/10.1016/j.eswa.2022.118943.

A. Aggarwal, A. Srivastava, A. Agarwal, N. Chahal, D. Singh, A. A. Alnuaim, and H. N. Lee, “Two-way feature extraction for speech emotion recognition using deep learning,” Sensors, vol. 22, no. 6, p. 2378, 2022, https://doi.org/10.3390/s22062378.

S. Kumar, M. A. Haq, A. Jain, C. A. Jason, N. R. Moparthi, N. Mittal,and Z. S. Alzamil, “Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance,” Computers, Materials & Continua, vol. 75, no. 1, 2023, https://doi.org/10.32604/cmc.2023.028631.

S. Jothimani and K. Premalatha, “MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network,” Chaos, Solitons Fractals, vol. 162, p. 112512, 2022, https://doi.org/10.1016/j.chaos.2022.112512.

D. Yang, S. Huang, Y. Liu, and L. Zhang, “Contextual and cross-modal interaction for multi-modal speech emotion recognition,” IEEE signal processing letters, vo. 29, pp. 2093-2097, 2022, https://doi.org/10.1109/LSP.2022.3210836.

J. Lei, X. Zhu, and Y. Wang, “BAT: Block and token self-attention for speech emotion recognition,” Neural Networks, vol. 156, pp. 67-80, 2022, https://doi.org/10.1016/j.neunet.2022.09.022.

B. Maji and M. Swain, “Advanced fusion-based speech emotion recognition system using a dual-attention mechanism with conv-caps and bi-gru features,” Electronics, vol. 11, no. 9 p. 1328, 2022, https://doi.org/10.3390/electronics11091328.

Y. Liu, H. Sun, W. Guan, Y. Xia, and Z. Zhao, Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework,” Speech Communication, vol. 139, pp. 1-9, 2022, https://doi.org/10.1016/j.specom.2022.02.006

S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, Self supervised adversarial domain adaptation for cross-corpus and cross-language speech emotion recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 1912-1926, 2022, https://doi.org/10.1109/TAFFC.2022.3167013.

S. Kakuba, A. Poulose, and D. S. Han, “Attention-based multi-learning approach for speech emotion recognition with dilated convolution,” IEEE Access, vol. 10, pp. 122302-122313, 2022, https://doi.org/10.1109/ACCESS.2022.3223705.

P. R. Prakash, D. Anuradha, J. Iqbal, M. G. Galety, R. Singh, and S. Neelakandan, “A novel convolutional neural network with gated recurrent unit for automated speech emotion recognition and classification,” Journal of Control and Decision, vol. 10, no. 1 pp. 54-63, 2023, https://doi.org/10.1080/23307706.2022.2085198.

R. K. Das, N. Islam, M. R. Ahmed, S. Islam, S. Shatabda, and A. M. Islam, “BanglaSER: A speech emotion recognition dataset for the Bangla language,” Data in Brief, vol. 42, p. 108091, 2022, https://doi.org/10.1016/j.dib.2022.108091.

B. B. Al-onazi, M. A. Nauman, R, Jahangir, M. M. Malik, E. H.Alkhammash, and A. M. Elshewey, “Transformer-based multilingual speech emotion recognition using data augmentation and feature fusion,” Applied Sciences, vol. 12, no. 18 p. 9188, 2022, https://doi.org/10.3390/app12189188.

J. Li, X. Zhang, L. Huang, F. Li, S. Duan, and Y. Sun, “Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network,” Applied Sciences, vol. 12, no. 19, p. 9518, 2022, https://doi.org/10.3390/app12199518.

J. Ancilin and A. Milton, “Improved speech emotion recognition with Mel frequency magnitude coefficient,” Applied Acoustics, vol. 179, p. 108046, 2021, https://doi.org/10.1016/j.apacoust.2021.108046.

H. A. Abdulmohsin, “A new proposed statistical feature extraction method in speech emotion recognition,” Computers & Electrical Engineering, vol. 93, p. 107172, 2021, https://doi.org/10.1016/j.compeleceng.2021.107172.

C. Zhang and L. Xue, “Autoencoder with emotion embedding for speech emotion recognition,” IEEE access, vol. 9, pp. 51231-51241, 2021, https://doi.org/10.1109/ACCESS.2021.3069818.

P. T. Krishnan, A. N. Joseph Raj, and V. Rajangam, “Emotion classification from speech signal based on empirical mode decomposition and non-linear features: Speech emotion recognition,” Complex & Intelligent Systems, vol. 7, pp. 1919-1934 2021, https://doi.org/10.1007/s40747-021-00295-z.

K. S. Chintalapudi, I. A. K. Patan, H. V. Sontineni, V. S. K. Muvvala, S. V. Gangashetty, and A. K. Dubey, “Speech Emotion Recognition Using Deep Learning,” In 2023 International Conference on Computer Communication and Informatics (ICCCI), IEEE, Coimbatore, India, 2023, pp. 1-5, https://doi.org/10.1109/ICCCI56745.2023.10128612

S Yildirim, Y. Kaya, and F. Kılıç, “A modified feature selection method based on metaheuristic algorithms for speech emotion recognition,” Applied Acoustics, vol. 173, p. 107721, 2021, https://doi.org/10.1016/j.apacoust.2020.107721.

G. Agarwal and H. Om, “Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition,” Multimedia Tools and Applications, vol. 80, pp. 9961-9992, 2021, https://doi.org/10.1007/s11042-020-10118-x.

K. Manohar and E. Logashanmugam, “Hybrid deep learning with optimalfeature selection for speech emotion recognition using improved meta-heuristic algorithm,” Knowledge-Based Systems, vol. 246, p. 108659, 2022, https://doi.org/10.1016/j.knosys.2022.108659.

Mustaqeem and S. Kwon, “Optimal feature selection based speech emotion recognition using two‐stream deep convolutional neural network,” International Journal of Intelligent Systems, vol. 36, no. 9 pp. 5116-5135, 2021, https://doi.org/10.1002/int.22505.

S. Chattopadhyay, A. Dey, P. K. Singh, A. Ahmadian, and R. Sarkar, “A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm,” Multimedia Tools and Applications, vol. 82, no. 7, pp. 9693-9726, 2023, https://doi.org/10.1007/s11042-021-11839-3.

S. Kanwal and S. Asghar, Speech emotion recognition using clustering based GA-optimized feature set,” IEEE Access, vol. 9, pp. 125830-125842, 2021, https://doi.org/10.1109/ACCESS.2021.3111659.

N. Barsainyan and D. K. Singh, “Optimized cross-corpus speech emotion recognition framework based on normalized 1D convolutional neural network with data augmentation and feature selection,” International Journal of Speech Technology, vol. 26, no. 4, pp. 947-961, 2023, https://doi.org/10.1007/s10772-023-10063-8.

C. Sun, H. Li, and L. Ma, “Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network,” Frontiers in Psychology, vol. 13, p. 1075624, 2023, https://doi.org/10.3389/fpsyg.2022.1075624.

L. T. C. Ottoni, A. L. C. Ottoni, and J. D. J. F. Cerqueira, “A Deep Learning Approach for Speech Emotion Recognition Optimization Using Meta-Learning,” Electronics, vol. 12, no. 23, p. 4859, 2023, https://doi.org/10.3390/electronics12234859.

N. T. Pham, D. N. M. Dang, N. D. Nguyen, T. T. Nguyen, H. Nguyen, B. Manavalan, V. P. Lim, and S. D. Nguyen, “Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition,” Expert Systems with Applications, vol. 230, p. 120608, 2023,

Z. Chen, J. Li, H. Liu, X. Wang, H. Wang, and Q. Zheng, “Learning multi-scale features for speech emotion recognition with connection attention mechanism,” Expert Systems with Applications, vol. 214, p.118943, 2023.

M. Khan, W. Gueaieb, A. El Saddik, and S. Kwon, “MSER: Multimodal speech emotion recognition using cross-attention with deep fusion,” Expert Systems with Applications, vol. 245, p.122946, 2024.

F. A. Dal Rì, F. C. Ciardi, and N. Conci, “Speech Emotion Recognition and Deep Learning: An Extensive Validation using Convolutional Neural Networks,” IEEE Access, vol. 11, pp. 116638-116649, 2023.

Y. Du, R. G. Crespo, and O. S. Martínez, “Human emotion recognition for enhanced performance evaluation in e-learning,” Progress in Artificial Intelligence, vol. 12, no. 2, pp. 199-211, 2023.

S. Debnath, P. Roy, S. Namasudra, and R. G. Crespo, “Audio-Visual Automatic Speech Recognition Towards Education for Disabilities,” Journal of Autism and Developmental Disorders, vol. 53, no. 9, pp. 3581-3594, 2023.

L. Pipiras, R. Maskeliūnas, and R. Damaševičius, “Lithuanian speech recognition using purely phonetic deep learning,” Computers, vol. 8, no. 4, p. 76, 2019.

A. Lauraitis, R. Maskeliūnas, R. Damaševičius, and T. Krilavičius, “Detection of speech impairments using cepstrum, auditory spectrogram and wavelet time scattering domain features,” IEEE Access, vol. 8, pp. 96162-96172, 2020.

Link for EMO-DB dataset: https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb

Link for RAVDESS dataset: https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio

Link for IEMOCAP dataset: https://sail.usc.edu/iemocap/

Link for SAVEE dataset: http://kahlan.eps.surrey.ac.uk/savee/

Link for CREMA-D dataset: https://github.com/CheyneyComputerScience/CREMA-D

M. M. Rahman and F. H. Siddiqui, “Multi‐layered attentional peephole convolutional LSTM for abstractive text summarization,” Etri Journal, vol. 43, no. 2, pp. 288-298, 2021.