Speech Emotion Recognition Using Convolutional Neural Network and Data Augmentation Technique

Shafieian, Masoume; Ahmadian, Vahid; Behdad, Majid

Speech Emotion Recognition Using Convolutional Neural Network and Data Augmentation Technique

Document Type : research article

Authors

Masoume Shafieian ¹

Vahid Ahmadian ²

Majid Behdad ³

¹ Department of Technology and Media Engineering, IRIB University, Tehran, Iran

² IRIB University

³ Assistant Professor,, IRIB University

Abstract

The purpose of speech emotion recognition systems is to create an emotional connection between humans and machine, since recognizing human emotions and goals helps improve interactions between humans and machines. Recognizing emotions through speech has been a challenge for researchers over the past decade. But with advances in artificial intelligence, these challenges have faded.

In this study, we took steps to improve the efficiency of these systems by using deep learning methods. In the first step, three-dimensional Convolutional neural networks are used to learn the spectral-temporal Features of speech. In the second step, to strengthen the proposed model, We use the New pyramidal Concatenated three-dimensional Convolutional neural networks, Which is a multi-scale architecture of three-dimensional Convolutional neural networks on input dimensions. Finally, to obtain the ability of learning the spectral-temporal features extracted from the New Pyramidal Concatenated 3D CNN Approach, we used the temporal capsule network, so could be called consider the spatial and temporal relationship of the data. Finally, we named the proposed structure, which is a powerful structure for spectral-temporal feaures, the MSID 3DCNN + Temporal Capsule.

The final model has been applied on a combination of two speech and song databases from the RAVDESS database. comparing the results of the proposed model with the conventional models, shows the better performance of our approach. The proposed SER model has achieved an accuracy of 81.77% for six emotional classes by gender.

Keywords

Speech Emotion Recognition

three-dimensional Convolutional neural network

Temporal Capsule

RAVDESS

20.1001.1.23831839.1401.11.21.6.3

Subjects

Speech signal processing

[1] Akçay, Mehmet Berkehan, and Kaya Oğuz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers", Speech Communication, 2020, Vol.116, pp.56-76.

[2] Imani, Maryam, and Gholam Ali Montazer, "A survey of emotion recognition methods with emphasis on E-Learning environments", Journal of Network and Computer Applications, 2019, Vol.147, p.102423.

[3] Lugović, S., I. Dunđer, and M. Horvat, "Techniques and applications of emotion recognition in speech. 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2016-Proceedings, 1278–1283", Google Scholar Google Scholar Cross Ref Cross Ref (2016).

[4] Swain, Monorama, Aurobinda Routray, and Prithviraj Kabisatpathy, "Databases, features and classifiers for speech emotion recognition: a review", International Journal of Speech Technology, 2018, Vol.21, no.1, pp.93-120.

[5] France, Daniel Joseph, Richard G. Shiavi, Stephen Silverman, Marilyn Silverman, and M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk", IEEE transactions on Biomedical Engineering, 2000, Vol.47, no.7, pp.829-837.

[6] Pao, Tsang-Long, Chun-Hsiang Wang, and Yu-Ji Li, "A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition", In 2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming, IEEE, 2012, pp.157-162.

[7] Ting, K.M. Confusion Matrix. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. 2011.

[8] Tamulevičius, Gintautas, Gražina Korvel, Anil Bora Yayak, Povilas Treigys, Jolita Bernatavičienė, and Bożena Kostek, "A study of cross-linguistic speech emotion recognition based on 2D feature spaces", Electronics, 2020, Vol.9, no.10, p.1725.

[9] Nguyen, Dung, Kien Nguyen, Sridha Sridharan, David Dean, and Clinton Fookes, "Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition", Computer Vision and Image Understanding, 2018, Vol.174, p.33-42.

[10] Iqbal, Aseef, and Kakon Barua, "A real-time emotion recognition from speech using gradient boosting", In 2019 international conference on electrical, computer and communication engineering (ECCE), IEEE, 2019, pp.1-5.

[11] Chapaneri, Santosh V., and Deepak D. Jayaswal, "Multi-taper spectral features for emotion recognition from speech", In 2015 International Conference on Industrial Instrumentation and Control (ICIC), IEEE, 2015, pp.1044-1049.

[12] Badshah, Abdul Malik, Jamil Ahmad, Nasir Rahim, and Sung Wook Baik, "Speech emotion recognition from spectrograms with deep convolutional neural network", In 2017 international conference on platform technology and service (PlatCon), IEEE 2017, pp.1-5.

[13] Kumbhar, Harshawardhan S., and Sheetal U. Bhandari, "Speech emotion recognition using MFCC features and LSTM network", In 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA), IEEE, 2019, pp.1-3.

[14] Etienne, Caroline, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, and Benoit Schmauch, "Cnn+ lstm architecture for speech emotion recognition with data augmentation", arXiv preprint arXiv:1802.05630, 2018.

[15] Guizzo, Eric, Tillman Weyde, and Jack Barnett Leveson, "Multi-time-scale convolution for emotion recognition from speech audio signals", In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 202, pp.6489-6493.

[16] Li, Chao, Jinlong Jiao, Yiqin Zhao, and Ziping Zhao, "Combining gated convolutional networks and self-attention mechanism for speech emotion recognition", In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), IEEE, 2019, pp.105-109.

[17] Stolar, Melissa N., Margaret Lech, Robert S. Bolia, and Michael Skinner, "Real time speech emotion recognition using RGB image classification and transfer learning", In 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), IEEE, 2017, pp.1-8.

[18] Livingstone, Steven R., and Frank A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English", PloS one, 2018, Vol.13, no.5, p.e0196391.

[19] Venkataramanan, Kannan, and Haresh Rengaraj Rajamohan, "Emotion recognition from speech", arXiv preprint arXiv:1912.10458, 2019.

[20] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, "Learning spatiotemporal features with 3d convolutional networks", In Proceedings of the IEEE international conference on computer vision, 2015, pp.4489-4497.

[21] Demir, Fatih, Muammer Turkoglu, Muzaffer Aslan, and Abdulkadir Sengur, "A new pyramidal concatenated CNN approach for environmental sound classification", Applied Acoustics, 2020, Vol.170, p.107520.

[22] Sankisa, Arun, Arjun Punjabi, and Aggelos K. Katsaggelos, "Temporal capsule networks for video motion estimation and error concealment", Signal, Image and Video Processing, 2020, Vol.14, no.7, pp.1369-1377.

Volume 11, Issue 21 - Serial Number 21
September 2022
Pages 85-98

XML

PDF 679.79 K

Article View 565
PDF Download 352

Journal of Vibration and Sound

Speech Emotion Recognition Using Convolutional Neural Network and Data Augmentation Technique

Volume 11, Issue 21 - Serial Number 21September 2022Pages 85-98

Files

Share

How to cite

Statistics

Volume 11, Issue 21 - Serial Number 21
September 2022
Pages 85-98