Journal of Vibration and Sound

Journal of Vibration and Sound

Improving the Quality of Bandwidth Extended Speech Signal by Increasing Noise Robustness, using Convolutional Neural Networks

Document Type : research article

Authors
1 Department of Technology and Media Engineering, IRIB University, Tehran, Iran
2 Audio Engineering Department, Faculty of Technology and Media Engineering, IRIB University, Tehran, Iran
Abstract
The purpose of bandwidth Extension or Expansion (BWE) is to increase the frequency range of the input audio signal. Because increasing the bandwidth of the sound will actually increase the quality and clarity of the sound, this is also called Speech Super Resolution (SSR).

One of the new methods in the field of speech signal bandwidth expansion, which has been used in recent years, is the use of neural networks. One of these types of networks is Convolutional Neural Network, which is used in the field of speech processing, image processing, and data science. At first, as a pre-processing step, additive noise is given to the DAPS data set as a destructive factor of the signal, and then the audio signal is denoised by using the wavelet transform noise reduction method (wavelet denoising algorithm) , and then the signal is sent to the convolutional neural network. Huber loss function is used in CNN to produce the expanded wideband signal. Finally, the proposed model was evaluated in terms of SNR, LSD, PESQ, and STOI criteria and compared with several similar methods in the field of audio bandwidth expansion. In all criteria, despite external noise, the network was able to bring its results closer to other networks, and without the presence of noise, both in terms of evaluation and learning speed, the network was able to achieve better results.
Keywords
Subjects

[1]    Prasad, N., and T. Kishore Kumar. "Bandwidth extension of speech signals: A comprehensive review." International Journal of Intelligent Systems and Applications 8, no. 2 (2016): 45-52.
[2] Epps, Julien, and W. Harvey Holmes. "A new technique for wideband enhancement of coded narrowband speech." In 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No. 99EX351), pp. 174-176. IEEE, 1999.
[3] Vaseghi, Saeed, Esfandiar Zavarehei, and Qin Yan. "Speech bandwidth extension: Extrapolations of spectral envelop and harmonicity quality of excitation." In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 3, pp. III-III. IEEE, 2006.
[4] Han, Jinyu, Gautham J. Mysore, and Bryan Pardo. "Language informed bandwidth expansion." In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pp. 1-6. IEEE, 2012.
[5] Seo, Hyunson, Hong-Goo Kang, and Frank Soong. "A maximum a posterior-based reconstruction approach to speech bandwidth expansion in noise." In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6087-6091. IEEE, 2014.
[6] Per Ekstrand. "Bandwidth extension of audio signals by spectral band replication." In Proceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio (MPCA02). Citeseer, 2002.
[7] Larsen, Erik, and Ronald M. Aarts. Audio bandwidth extension: application of psychoacoustics, signal processing and loudspeaker design. John Wiley & Sons, 2005.
 [8] Jax, Peter, and Peter Vary. "On artificial bandwidth extension of telephone speech." Signal Processing 83, no. 8 (2003): 1707-1719.
[9] Qian, Yasheng, and Peter Kabal. "Wideband speech recovery from narrowband speech using classified codebook mapping." In Australian Int. Conf. Speech Science, Technology, pp. 106-111. 2002.
[10] Nour-Eldin, Amr H., and Peter Kabal. "Mel-frequency cepstral coefficient-based bandwidth extension of narrowband speech." In Interspeech, pp. 53-56. 2008.
[11] Iser, Bernd, and Gerhard Schmidt. "Bandwidth extension of telephony speech." In Speech and Audio Processing in Adverse Environments, pp. 135-184. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008.
[12] Bachhav, Pramod B., Massimiliano Todisco, Moctar Mossi, Christophe Beaugeant, and Nicholas Evans. "Artificial bandwidth extension using the constant Q transform." In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5550-5554. IEEE, 2017.
[13] Bradbury, Jeremy. "Linear predictive coding." Mc G. Hill (2000).
[14] Tokuda, Keiichi, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. "Speech synthesis based on hidden Markov models." Proceedings of the IEEE 101, no. 5 (2013): 1234-1252.
[15] Abel, Johannes, and Tim Fingscheidt. "Artificial speech bandwidth extension using deep neural networks for wideband spectral envelope estimation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, no. 1 (2017): 71-83.
[16] Li, Kehuang, and Chin-Hui Lee. "A deep neural network approach to speech bandwidth expansion." In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4395-4399. IEEE, 2015.
[17] Feng, Berthy, Zeyu Jin, Jiaqi Su, and Adam Finkelstein. "Learning bandwidth expansion using perceptually-motivated loss." In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 606-610. IEEE, 2019.
[18]  Kuleshov, Volodymyr, S. Zayd Enam, and Stefano Ermon. "Audio super resolution using neural networks." arXiv preprint arXiv:1708.00853 (2017).
[19]  Gupta, Archit, Brendan Shillingford, Yannis Assael, and Thomas C. Walters. "Speech bandwidth extension with wavenet." In 2019 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp. 205-208. IEEE, 2019.
[20] Kuleshov, Volodymyr, S. Zayd Enam, and Stefano Ermon. "Audio super resolution using neural networks." arXiv preprint arXiv:1708.00853 (2017).
[21] Birnbaum, Sawyer, Volodymyr Kuleshov, Zayd Enam, Pang Wei W. Koh, and Stefano Ermon. "Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations." Advances in Neural Information Processing Systems 32 (2019).
[22] Nguyen, Viet-Anh, Anh HT Nguyen, and Andy WH Khong. "Tunet: A block-online bandwidth extension model based on transformers and self-supervised pretraining." In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 161-165. IEEE, 2022.
[23] Rakotonirina, Nathanaël Carraz. "Self-attention for audio super-resolution." In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1-6. IEEE, 2021.
[24] Eskimez, Sefik Emre, and Kazuhito Koishida. "Speech super resolution generative adversarial network." In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3717-3721. IEEE, 2019.
[25] Kumar, Rithesh, Kundan Kumar, Vicki Anand, Yoshua Bengio, and Aaron Courville. "NU-GAN: High resolution neural upsampling with GAN." arXiv preprint arXiv:2010.11362 (2020).
[26] Lee, Junhyeok, and Seungu Han. "Nu-wave: A diffusion probabilistic model for neural audio upsampling." arXiv preprint arXiv:2104.02321 (2021).
[27]  Lin, Ju, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. "A Two-Stage Approach to Speech Bandwidth Extension." In Interspeech, pp. 1689-1693. 2021.
[28]  Li, Yunpeng, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, and Dominik Roblek. "Real-time speech frequency bandwidth extension." In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 691-695. IEEE, 2021.
[29]  Wang, Heming, and Deliang Wang. "Time-frequency loss for CNN based speech super-resolution." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861-865. IEEE, 2020.
[30] Zhang, Kexun, Yi Ren, Changliang Xu, and Zhou Zhao. "WSRGlow: A glow-based waveform generative model for audio super-resolution." arXiv preprint arXiv:2106.08507 (2021).
[31] Liu, Haohe, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. "Neural vocoder is all you need for speech super-resolution." arXiv preprint arXiv:2203.14941 (2022).
[32] Liu, Haohe, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D. Plumbley. "AudioSR: Versatile audio super-resolution at scale." In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1076-1080. IEEE, 2024.
[33]  Hidayat, Risanuri, Agus Bejo, Sujoko Sumaryono, and Anggun Winursito. "Denoising speech for MFCC feature extraction using wavelet transformation in speech recognition system." In 2018 10th international conference on information technology and electrical engineering (ICITEE), pp. 280-284. IEEE, 2018.
[34]  Ali, M. A., and P. M. Shemi. "An improved method of audio denoising based on wavelet transform." In 2015 international conference on Power, Instrumentation, Control and Computing (PICC), pp. 1-6. IEEE, 2015.
[35]  Huber, Peter J. "Robust estimation of a location parameter." In Breakthroughs in statistics: Methodology and distribution, pp. 492-518. New York, NY: Springer New York, 1992.