بازشناسی احساسات از روی گفتار بر پایه بهره‌گیری از شبکه‌های عصبی پیچشی و تکنیک افزایش دادگان

نوع مقاله : مقاله پژوهشی

نویسندگان

1 دانشکده فنی و مهندسی رسانه، دانشگاه صدا و سیما، تهران، ایران

2 کارشناسی ارشد، دانشگاه صدا و سیما

3 دانشگاه صدا و سیما

چکیده

هدف از سیستم­های بازشناسی احساس از روی گفتار ایجاد ارتباط عاطفی بین انسان و ماشین است. چراکه بازشناسی احساس و اهداف انسان از روی گفتار، به بهبود تعاملات بین انسان و ماشین کمک می­کند. بازشناسی احساس از روی گفتار برای محققان در دهه گذشته یک مسأله چالش‌برانگیز بوده است. اما با پیشرفت در حوزه هوش مصنوعی این چالش­ها کم‌رنگ­تر شدند. هدف از این پژوهش، استفاده از روش­های یادگیری عمیق در جهت بهتر کردن کارایی این سیستم­ها است. کار انجام شده از چندین مرحله تشکیل شده است. در مرحله اول از شبکه­های عصبی پیچشی سه بعدی برای یادگیری ویژگی­های طیفی زمانی گفتار استفاده شده است. در مرحله دوم برای قدرتمند کردن مدل پیشنهادی از ساختار هرمی جدید شبکه­های عصبی پیچشی سه بعدی اتصال داده شده؛ که یک معماری چند مقیاسه از شبکه­های عصبی پیچشی سه بعدی روی ابعاد ورودی است، بهره گرفته شد. در نهایت برای یادگیری ویژگی­های طیفی زمانی استخراج شده از ساختار جدید (ساختار جدید هرمی شبکه­های عصبی پیچشی سه بعدی) با درنظر گرفتن رابطه مکانی و زمانی اطلاعات به‌صورت کامل، از شبکه کپسول زمانی استفاده شد. در نهایت بر ساختار پیشنهادی که یک ساختار قدرتمند برای ویژگی­های طیفی زمانی است نام  MSID 3DCNN + Temporal Capsule   نهاده شد. پژوهش انجام شده و مدل نهایی بر روی ترکیب دو پایگاه داده گفتار معمولی و گفتار آوازی از پایگاه داده راودیس که یک پایگاه داده چند حالته است انجام شد. نتایجی که با استفاده از مدل پیشنهادی به‌دست آمد؛ نسبت به مدل­های مرسوم، قابل توجه است. در این پژوهش برای شش کلاس احساسی به تفکیک جنسیت، دقت 77/81 درصد به‌دست آمد.


 


 

 
 

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Speech Emotion Recognition Using Convolutional Neural Network and Data Augmentation Technique

نویسندگان [English]

  • Masoume Shafieian 1
  • Vahid Ahmadian 2
  • Majid Behdad 3
1 Department of Technology and Media Engineering, IRIB University, Tehran, Iran
2 IRIB University
3 Assistant Professor,, IRIB University
چکیده [English]

The purpose of speech emotion recognition systems is to create an emotional connection between humans and machine, since recognizing human emotions and goals helps improve interactions between humans and machines. Recognizing emotions through speech has been a challenge for researchers over the past decade. But with advances in artificial intelligence, these challenges have faded.

In this study, we took steps to improve the efficiency of these systems by using deep learning methods. In the first step, three-dimensional Convolutional neural networks are used to learn the spectral-temporal Features of speech. In the second step, to strengthen the proposed model, We use the New pyramidal Concatenated three-dimensional Convolutional neural networks, Which is a multi-scale architecture of three-dimensional Convolutional neural networks on input dimensions. Finally, to obtain the ability of learning the spectral-temporal features extracted from the New Pyramidal Concatenated 3D CNN Approach, we used the temporal capsule network, so could be called consider the spatial and temporal relationship of the data. Finally, we named the proposed structure, which is a powerful structure for spectral-temporal feaures, the MSID 3DCNN + Temporal Capsule.

The final model has been applied on a combination of two speech and song databases from the RAVDESS database. comparing the results of the proposed model with the conventional models, shows the better performance of our approach. The proposed SER model has achieved an accuracy of 81.77% for six emotional classes by gender.

کلیدواژه‌ها [English]

  • Speech Emotion Recognition
  • three-dimensional Convolutional neural network
  • Temporal Capsule
  • RAVDESS
[1] Akçay, Mehmet Berkehan, and Kaya Oğuz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers", Speech Communication, 2020, Vol.116, pp.56-76.
[2] Imani, Maryam, and Gholam Ali Montazer, "A survey of emotion recognition methods with emphasis on E-Learning environments", Journal of Network and Computer Applications, 2019, Vol.147, p.102423.
[3] Lugović, S., I. Dunđer, and M. Horvat, "Techniques and applications of emotion recognition in speech. 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2016-Proceedings, 1278–1283", Google Scholar Google Scholar Cross Ref Cross Ref (2016).
[4] Swain, Monorama, Aurobinda Routray, and Prithviraj Kabisatpathy, "Databases, features and classifiers for speech emotion recognition: a review", International Journal of Speech Technology, 2018, Vol.21, no.1, pp.93-120.
[5] France, Daniel Joseph, Richard G. Shiavi, Stephen Silverman, Marilyn Silverman, and M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk", IEEE transactions on Biomedical Engineering, 2000, Vol.47, no.7, pp.829-837.
[6] Pao, Tsang-Long, Chun-Hsiang Wang, and Yu-Ji Li, "A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition", In 2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming, IEEE, 2012, pp.157-162.
[7] Ting, K.M. Confusion Matrix. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. 2011.
[8] Tamulevičius, Gintautas, Gražina Korvel, Anil Bora Yayak, Povilas Treigys, Jolita Bernatavičienė, and Bożena Kostek, "A study of cross-linguistic speech emotion recognition based on 2D feature spaces", Electronics, 2020, Vol.9, no.10, p.1725.
[9] Nguyen, Dung, Kien Nguyen, Sridha Sridharan, David Dean, and Clinton Fookes, "Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition", Computer Vision and Image Understanding, 2018, Vol.174, p.33-42.
[10] Iqbal, Aseef, and Kakon Barua, "A real-time emotion recognition from speech using gradient boosting", In 2019 international conference on electrical, computer and communication engineering (ECCE), IEEE, 2019, pp.1-5.
[11] Chapaneri, Santosh V., and Deepak D. Jayaswal, "Multi-taper spectral features for emotion recognition from speech", In 2015 International Conference on Industrial Instrumentation and Control (ICIC), IEEE, 2015, pp.1044-1049.
[12] Badshah, Abdul Malik, Jamil Ahmad, Nasir Rahim, and Sung Wook Baik, "Speech emotion recognition from spectrograms with deep convolutional neural network", In 2017 international conference on platform technology and service (PlatCon), IEEE 2017, pp.1-5.
[13] Kumbhar, Harshawardhan S., and Sheetal U. Bhandari, "Speech emotion recognition using MFCC features and LSTM network", In 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA), IEEE, 2019, pp.1-3.
[14] Etienne, Caroline, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, and Benoit Schmauch, "Cnn+ lstm architecture for speech emotion recognition with data augmentation", arXiv preprint arXiv:1802.05630, 2018.
[15] Guizzo, Eric, Tillman Weyde, and Jack Barnett Leveson, "Multi-time-scale convolution for emotion recognition from speech audio signals", In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 202, pp.6489-6493.
[16] Li, Chao, Jinlong Jiao, Yiqin Zhao, and Ziping Zhao, "Combining gated convolutional networks and self-attention mechanism for speech emotion recognition", In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), IEEE, 2019, pp.105-109.
[17] Stolar, Melissa N., Margaret Lech, Robert S. Bolia, and Michael Skinner, "Real time speech emotion recognition using RGB image classification and transfer learning", In 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), IEEE, 2017, pp.1-8.
[18] Livingstone, Steven R., and Frank A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English", PloS one, 2018, Vol.13, no.5, p.e0196391.
[19] Venkataramanan, Kannan, and Haresh Rengaraj Rajamohan, "Emotion recognition from speech", arXiv preprint arXiv:1912.10458, 2019.
[20] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, "Learning spatiotemporal features with 3d convolutional networks", In Proceedings of the IEEE international conference on computer vision, 2015, pp.4489-4497.
[21] Demir, Fatih, Muammer Turkoglu, Muzaffer Aslan, and Abdulkadir Sengur, "A new pyramidal concatenated CNN approach for environmental sound classification", Applied Acoustics, 2020, Vol.170, p.107520.
[22] Sankisa, Arun, Arjun Punjabi, and Aggelos K. Katsaggelos, "Temporal capsule networks for video motion estimation and error concealment", Signal, Image and Video Processing, 2020, Vol.14, no.7, pp.1369-1377.