Journal of Vibration and Sound

Journal of Vibration and Sound

Exploring Parametric Filters in Deep Learning Architectures for Speech Processing Applications: A Review

Authors
1 Shahid Beheshti University, Tehran, Iran
2 Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Abstract
In traditional speech processing, feature extraction and classification were conducted as separate steps. The advent of deep neural networks has enabled methods that simultaneously model the relationship between acoustic and phonetic characteristics of speech while classifying it directly from the raw waveform. The first convolutional layer in these networks acts as a filter bank. To enhance interpretability and reduce the number of parameters, researchers have explored the use of parametric filters, with the SincNet architecture being a notable advancement. In SincNet's initial convolutional layer, rectangular bandpass filters are learned instead of fully trainable filters. This approach allows for modeling with fewer parameters, thereby improving the network's convergence speed and accuracy. Analyzing the learned filter bank also provides valuable insights into the model's performance. The reduction in parameters, along with increased accuracy and interpretability, has led to the adoption of various parametric filters and deep architectures across diverse speech processing applications. This paper introduces different types of parametric filters and discusses their integration into various deep architectures. Additionally, it examines the specific applications in speech processing where these filters have proven effective.
Keywords
Subjects

 
[1]  Lyon, Richard F. Human and machine hearing: extracting meaning from sound. Cambridge University Press, 2017.
[2]  Jung, Jee-weon, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, and Joon Son Chung. "Pushing the limits of raw waveform speaker recognition." arXiv preprint arXiv:2203.08488 (2022).
[3]  Peng, Junyi, Xiaoyang Qu, Jianzong Wang, Rongzhi Gu, Jing Xiao, Lukás Burget, and Jan Cernocký. "ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform." In Interspeech, pp. 511-515. 2021.
[4]  Nunes, Joao Antônio Chagas, David Macêdo, and Cleber Zanchettin. "Additive margin sincnet for speaker recognition." In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1-5. IEEE, 2019.
[5]  Fayyazi, Hossein, and Yasser Shekofteh. "IIRI-Net: An interpretable convolutional front-end inspired by IIR filters for speaker identification." Neurocomputing 558 (2023): 126767.
[6]  Fayyazi, Hossein, and Yasser Shekofteh. "Analyzing the Use of Auditory Filter Models for Making Interpretable Convolutional Neural Networks for Speaker Identification." In 2023 28th International Computer Conference, Computer Society of Iran (CSICC), pp. 1-6. IEEE, 2023.
[7]  Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." In 2018 IEEE spoken language technology workshop (SLT), pp. 1021-1028. IEEE, 2018.
[8]  Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. "A survey of methods for explaining black box models." ACM computing surveys (CSUR) 51, no. 5 (2018): 1-42.
[9]  Ge, Wanying, Jose Patino, Massimiliano Todisco, and Nicholas Evans. "Explaining deep learning models for spoofing and deepfake detection with SHapley Additive exPlanations." In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6387-6391. IEEE, 2022.
[10] Slack, Dylan, Anna Hilgard, Sameer Singh, and Himabindu Lakkaraju. "Reliable post hoc explanations: Modeling uncertainty in explainability." Advances in neural information processing systems 34 (2021): 9391-9404.
[11] Agrawal, Purvi, and Sriram Ganapathy. "Interpretable representation learning for speech and audio signals based on relevance weighting." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2823-2836.
[12] Jiang, Junyan, Gus G. Xia, Dave B. Carlton, Chris N. Anderson, and Ryan H. Miyakawa. "Transformer vae: A hierarchical model for structure-aware and interpretable music representation learning." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 516-520. IEEE, 2020.
[13] Moore, Brian CJ. An introduction to the psychology of hearing. Brill, 2012.
[14] Palaz, Dimitri, and Ronan Collobert. "Analysis of CNN-based speech recognition system using raw speech as input." (2015).
[15] Rabiner, Lawrence, and Ronald Schafer. Theory and applications of digital speech processing. Prentice Hall Press, 2010.
[16] Loweimi, Erfan, Peter Bell, and Steve Renals. "On learning interpretable CNNs with parametric modulated kernel-based filters." In Interspeech 2019, pp. 3480-3484. International Speech Communication Association, 2019.
[17] Formby, C. "Simple triangular approximations of auditory filter shapes." Journal of Speech, Language, and Hearing Research 33, no. 3 (1990): 530-539.
[18] Johannesma, P. L. M. "The pre-response stimulus ensemble of neurons in the cochlear nucleus." In Symposium on Hearing Theory, 1972. IPO, 1972.
[19] Li, Nan, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li, and Jianwu Dang. "Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network." Speech Communication 157 (2024): 103024.
[20] Zeghidour, Neil, Olivier Teboul, Félix De Chaumont Quitry, and Marco Tagliasacchi. "LEAF: A learnable frontend for audio classification." arXiv preprint arXiv:2101.08596 (2021).
[21] Noé, Paul-Gauthier, Titouan Parcollet, and Mohamed Morchid. "Cgcnn: Complex gabor convolutional neural network on raw speech." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7724-7728. IEEE, 2020.
[22] Oglic, Dino, Zoran Cvetkovic, Peter Bell, and Steve Renals. "A deep 2D convolutional network for waveform-based speech recognition." In Interspeech 2020, pp. 1654-1658. International Speech Communication Association, 2020.
[23] Fayyazi, Hossein, and Yasser Shekofteh. "Exploiting auditory filter models as interpretable convolutional frontends to obtain optimal architectures for speaker gender recognition." Applied Acoustics 213 (2023): 109635.
Pariente, Manuel, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. "Filterbank design for end-to-end speech separation." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6364-6368. IEEE, 2020.
[25] W. Li, Z. Tan, Z. Xia, D. Wu, and J. Ning, "PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform," in International Conference on Mobile Computing, Applications, and Services, 2022: Springer, pp. 362-374.
[26] H. Fayyazi and Y. Shekofteh, "IIRI-Net: An interpretable convolutional front-end inspired by IIR filters for speaker identification," Neurocomputing, vol. 558, p. 126767, 2023.
[27] C.-L. Liu, S.-W. Fu, Y.-J. Li, J.-W. Huang, H.-M. Wang, and Y. Tsao, "Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1888-1900, 2020.
[28] F. Mathieu, T. Courtat, G. Richard, and G. Peeters, "Learning Interpretable Filters In Wav-UNet For Speech Enhancement," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023: IEEE, pp. 1-5.
[29] D. Stoller, S. Ewert, and S. Dixon, "Wave-u-net: A multi-scale neural network for end-to-end audio source separation," arXiv preprint arXiv:1806.03185, 2018.
[30] J.-w. Jung, S.-b. Kim, H.-j. Shim, J.-h. Kim, and H.-J. Yu, "Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms," arXiv preprint arXiv:2004.00526, 2020.
[31] J.-w. Jung, H.-S. Heo, J.-h. Kim, H.-j. Shim, and H.-J. Yu, "Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification," arXiv preprint arXiv:1904.08104, 2019.
[32] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, "End-to-end anti-spoofing with rawnet2," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 6369-6373.
[33] G. Wei, Y. Zhang, H. Min, and Y. Xu, "End-to-end speaker identification research based on multi-scale SincNet and CGAN," Neural Computing and Applications, vol. 35, no. 30, pp. 22209-22222, 2023.
[34] P.-C. Chang, Y.-S. Chen, and C.-H. Lee, "MS-SincResnet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification," in Proceedings of the 2021 International Conference on Multimedia Retrieval, 2021, pp. 29-36.
[35] P.-C. Chang, Y.-S. Chen, and C.-H. Lee, "IIOF: Intra-and Inter-feature orthogonal fusion of local and global features for music emotion recognition," Pattern Recognition, vol. 148, p. 110200, 2024.
[36] M. Anderson, T. Kinnunen, and N. Harte, "Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023: IEEE, pp. 1-5.
[37] S. Kulkarni, H. Watanabe, and F. Homma, "Self-Supervised Audio Encoder with Contrastive Pretraining for Respiratory Anomaly Detection," in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023: IEEE, pp. 1-5.
[38] D. Fedorishin et al., "Large-Scale Acoustic Automobile Fault Detection: Diagnosing Engines Through Sound," in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 2871-2881.
[39] W. Ghezaiel, L. Brun, and O. Lézoray, "Hybrid network for end-to-end text-independent speaker identification," in 2020 25th International conference on pattern recognition (ICPR), 2021: IEEE, pp. 2352-2359.
[40] W. Ghezaiel, B. Luc, and O. Lézoray, "Wavelet scattering transform and CNN for closed set speaker identification," in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 2020: IEEE, pp. 1-6.
[41] J. Li, Y. Tian, and T. Lee, "Learnable frequency filters for speech feature extraction in speaker verification," arXiv preprint arXiv:2206.07563, 2022.
[42] B. Desplanques, J. Thienpondt, and K. Demuynck, "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification," Interspeech 2020, 2020.
[43] Y. He, Z. Dai, N. Trigoni, L. Chen, and A. Markham, "SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network," in Proceedings of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, no. 11, pp. 12421-12429.
[44] L. Li, J. Li, D. Wang, X. Wang, and S. Qiao, "Sinc‐attention feature extraction for trivial‐event based speaker verification," Electronics Letters, vol. 59, no. 9, p. e12812, 2023.
[45] W. Yang et al., "Attention guided learnable time-domain filterbanks for speech depression detection," Neural Networks, vol. 165, pp. 135-149, 2023.
[46] T. Parcollet, M. Morchid, and G. Linares, "E2E-SINCNET: Toward fully end-to-end speech recognition," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7714-7718.
[47] Z. Du, K. Liu, X. Wan, and H. Zhou, "Joint speech activity and overlap detection with multi-exit architecture," in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022: IEEE, pp. 59-65.
[48] H. Su et al., "A multitask learning framework for speaker change detection with content information from unsupervised speech decomposition," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 8087-8091.
[49] K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, "Unsupervised speech decomposition via triple information bottleneck," in International Conference on Machine Learning, 2020: PMLR, pp. 7836-7846.
[50] Z. Yue, E. Loweimi, H. Christensen, J. Barker, and Z. Cvetkovic, "Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs," in INTERSPEECH, 2022, pp. 31-35.
[51] Y. Pan et al., "Acoustic feature extraction with interpretable deep neural network for neurodegenerative related disorder classification," in Proceedings of Interspeech 2020, 2020: International Speech Communication Association (ISCA), pp. 4806-4810.
[52] Y. Zhang, G. Wei, H. Min, and Y. Xu, "Text-Independent Speaker Identification Using a Single-Scale SincNet-DCGAN Model," in International Conference on Data Mining and Big Data, 2022: Springer, pp. 18-28.
[53] M. Lavechin et al., "End-to-end Domain-Adversarial Voice Activity Detection," in Interspeech 2020, 2020.
[54] T. Kim, J. Chang, and J. H. Ko, "Ada-vad: Unpaired adversarial domain adaptation for noise-robust voice activity detection," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 7327-7331.
[55] L. Li, Wudamu, L. Kuerzinger, T. Watzel, and G. Rigoll, "Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions," Applied Sciences, vol. 11, no. 16, p. 7564, 2021.
[56] L. Chowdhury, M. Kamal, N. Hasan, and N. Mohammed, "Curricular sincnet: Towards robust deep speaker recognition by emphasizing hard samples in latent space," in 2021 International Conference of the Biometrics Special Interest Group (BIOSIG), 2021: IEEE, pp. 1-4.
[57] N. Shome, B. Saritha, R. Kashyap, and R. H. Laskar, "A robust DNN model for text-independent speaker identification using non-speaker embeddings in diverse data conditions," Neural Computing and Applications, vol. 35, no. 26, pp. 18933-18947, 2023.
[58] B. Saritha, N. Shome, R. H. Laskar, and M. Choudhury, "Enhancement in speaker recognition using SincNet through optimal window and frame shift," in 2022 2nd International Conference on Intelligent Technologies (CONIT), 2022: IEEE, pp. 1-6.
[59] M. Guo, J. Yang, and S. Gao, "Speaker recognition method for short utterance," in Journal of physics: conference series, 2021, vol. 1827, no. 1: IOP Publishing, p. 012158.
[60] Z. Li and J. Whitehill, "Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 7163-7167.
[61] M. Tripathi, D. Singh, and S. Susan, "Speaker recognition using SincNet and X-vector fusion," in Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Zakopane, Poland, October 12-14, 2020, Proceedings, Part I 19, 2020: Springer, pp. 252-260.
[62] H. Bredin et al., "Pyannote. audio: neural building blocks for speaker diarization," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7124-7128.
[63] L. Bullock, H. Bredin, and L. P. Garcia-Perera, "Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7114-7118.
[64] H. Bredin and A. Laurent, "End-to-end speaker segmentation for overlap-aware resegmentation," arXiv preprint arXiv:2104.04045, 2021.
[65] H. Dubey, A. Sangwan, and J. H. Hansen, "Transfer learning using raw waveform sincnet for robust speaker diarization," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 6296-6300.
[66] J. M. Coria, H. Bredin, S. Ghannay, and S. Rosset, "Continual self-supervised domain adaptation for end-to-end speaker diarization," in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023: IEEE, pp. 626-632.
[67] D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes, "Attention driven fusion for multi-modal emotion recognition," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 3227-3231.
[68] A. Anand, S. Negi, and N. Narendra, "Filters Know How You Feel: Explaining Intermediate Speech Emotion Classification Representations," in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021: IEEE, pp. 756-761.
[69] Y.-J. Li, S.-S. Wang, Y. Tsao, and B. Su, "Mimo speech compression and enhancement based on convolutional denoising autoencoder," in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021: IEEE, pp. 1245-1250.
[70] K. Radha and M. Bansal, "Towards modeling raw speech in gender identification of children using sincNet over ERB scale," International Journal of Speech Technology, vol. 26, no. 3, pp. 651-663, 2023.
[71] R. E. Zezario, B.-R. B. Bai, C.-S. Fuh, H.-M. Wang, and Y. Tsao, "Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024: IEEE, pp. 831-835.
[72] C. O. Mawalim, B. A. Titalim, S. Okada, and M. Unoki, "Auditory Model Optimization with Wavegram-CNN and Acoustic Parameter Models for Nonintrusive Speech Intelligibility Prediction in Hearing Aids," in 2023 31st European Signal Processing Conference (EUSIPCO), 2023: IEEE, pp. 211-215.
[73] R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, "Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54-70, 2022.
[74] S. Mittermaier, L. Kürzinger, B. Waschneck, and G. Rigoll, "Small-footprint keyword spotting on raw audio data with sinc-convolutions," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7454-7458.
[75] D. Peter, W. Roth, and F. Pernkopf, "End-to-end keyword spotting using neural architecture search and quantization," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 3423-3427.
[76] D. Kim, K. Ko, D. K. Han, and H. Ko, "Discriminatory and orthogonal feature learning for noise robust keyword spotting," IEEE Signal Processing Letters, vol. 29, pp. 1913-1917, 2022.
[77] A. Mohanty, A. Frischknecht, C. Gerum, and O. Bringmann, "Behavior of keyword spotting networks under noisy conditions," in International Conference on Artificial Neural Networks, 2021: Springer, pp. 369-378.
[78] Y. Qian et al., "Speech-language pre-training for end-to-end spoken language understanding," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 7458-7462.
[79] L. Yang, K. Fu, J. Zhang, and T. Shinozaki, "Pronunciation Erroneous Tendency Detection with Language Adversarial Represent Learning," in INTERSPEECH, 2020, pp. 3042-3046.
[80] A. Berg, M. O'Connor, K. Åström, and M. Oskarsson, "Extending gcc-phat using shift equivariant neural networks," arXiv preprint arXiv:2208.04654, 2022.
[81] Y. He, N. Trigoni, and A. Markham, "SoundDet: Polyphonic moving sound event detection and localization from raw waveform," in International Conference on Machine Learning, 2021: PMLR, pp. 4160-4170.
[82] H.-H. Wu et al., "Multi-task self-supervised pre-training for music classification," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 556-560.
[83] X. Shi, E. Cooper, and J. Yamagishi, "Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 367-377, 2022.
[84] S. Sabesan, A. Fragner, C. Bench, F. Drakopoulos, and N. A. Lesica, "Large-scale electrophysiology and deep learning reveal distorted neural signal dynamics after hearing loss," Elife, vol. 12, p. e85108, 2023.
[85] R. V. Sharan, K. Qian, and Y. Yamamoto, "Automated Cough Sound Analysis for Detecting Childhood Pneumonia," IEEE Journal of Biomedical and Health Informatics, 2023.
[86] H. Bredin and A. Laurent, "End-to-end speaker segmentation for overlap-aware resegmentation," in Interspeech 2021, 2021.
[87] H. Li, K. Chen, and B. U. Seeber, "Auditory filterbanks benefit universal sound source separation," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 181-185.
[88] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, "End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection," arXiv preprint arXiv:2107.12710, 2021.
[89] W. Ge, J. Patino, M. Todisco, and N. Evans, "Raw differentiable architecture search for speech deepfake and spoofing detection," arXiv preprint arXiv:2107.12212, 2021.
[90] B. Wickramasinghe, E. Ambikairajah, V. Sethu, J. Epps, H. Li, and T. Dang, "DNN controlled adaptive front-end for replay attack detection systems," Speech Communication, vol. 154, p. 102973, 2023.