مروری بر روشهای نوین بازشناس گفتار

نوع مقاله : مقاله مروری

نویسندگان

1 دانشکده مهندسی کامپیوتر، دانشگاه صنعتی شریف

2 دانشیار دانشکدۀ کامپیوتر دانشگاه صنعتی شریف

چکیده

این مقاله مروری‌است بر روش‌ها‌ی سنتی و نیز روش‌های نوین بازشناسی گفتار. بازشناسی گفتار سابقه‌ای در حدود چندین دهه دارد و با روش‌های مبتنی بر پردازش سیگنال و پیچش زمانی پویا آغاز شده است. روش‌های آماری در دهه ۱۹۸۰ به بعد مورد توجه و استقبال قرار گرفت و روش‌های مبتنی بر مدل مخفی مارکوف به‌عنوان سرآمد این روش‌ها شناخته می‌شد. ولی از دهه ۲۰۰۰ میلادی به بعد روش‌های آماری کم‌کم جای خود را به مدل‌های مبتنی بر شبکه‌های عصبی دادند و با روی‌ کار آمدن شبکه‌های عصبی ژرف، نتایج بهتری از این مدل‌ها نسبت به مدل مخفی مارکوف به‌دست آمد. مدل‌های مبتنی بر شبکه‌های عصبی ژرف نیز دچار تحول شدند و انواع مختلفی از آنها ابداع گردید. سپس مدل‌های مبتنی بر مبدل‌ها و مدل‌های از پیش‌ آموزش دیده جای آنها را گرفتند و به دقت‌های بالاتری دست یافتند. در این مقاله بعد از مروری بر روش‌های مبتنی بر مدل مخفی مارکوف به روش‌های مبتنی بر شبکه‌های عصبی ژرف و ساختارهای متنوع آنها پرداخته می‌شود و در نهایت روش‌های مبتنی بر مدل‌های از پیش آموزش دیده تشریح می‌شود و آخرین روش‌های از این دست مورد بررسی قرار می‌گیرد. در انتها نیز نتایج به‌دست آمده از روش‌های تشریح شده براساس نرخ خطای کلمه ارائه می‌شود و مقایسه بین آنها صورت می‌گیرد.


 


 

 
 

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

A Review of the Recent Speech Recognition Methods

نویسندگان [English]

  • Hossein Hadian 1
  • Soroush Gooran 1
  • Sadra Sabouri 1
  • Sara Sadeghi 1
  • Yasin Amini 1
  • Hossein Sameti 2
1 Department of Computer Engineering, Sharif University of Technology
2 Department of Computer Engineering, Sharif University of Technology
چکیده [English]

This article is a review of traditional and modern methods of speech recognition. Speech recognition has a history of several decades and started with methods based on signal processing and dynamic time warping. Statistical methods were noticed and welcomed in the 1980s and the methods based on the hidden Markov models were known as the leading methods. Since the 2000s, statistical methods gradually gave way to models based on neural networks, and with the use of deep neural networks, resulted in higher performances compared to the hidden Markov models. Models based on deep neural networks also were transformed and improved immensely. In the next step, models based on transformers and pre-trained models were proposed and achieved higher accuracies. In this article, after an overview of the methods based on the hidden Markov models, the methods based on deep neural networks and their various structures are discussed, and finally, the methods based on pre-trained models are explained and the latest methods of this kind are surveyed. Finally, the results obtained from the reviewed methods are presented and compared based on the word error rate measure.

کلیدواژه‌ها [English]

  • Speech Recognition
  • Hidden Markov Model
  • Deep Neural Networks
  • Transformers
  • Pre-trained Models
[1]        X. Huang, A. Acero, H.-W. Hon, and R. Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR, 2001.
[2]        “how the human auditory system works,” 2020. https://www.khouzeyannews.ir/blog/how-the-human-auditory-system-works.
[3]        M. Gales and S. Young, “The application of hidden Markov models in speech recognition,” Found. trends signal Process., Vol.1, no.3, 2008, pp.195–304.
[4]        R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.1, 1992, pp.13–16.
[5]        C. D. Manning and H. Schütze, Foundations of statistical natural language processing, Vol.999. MIT Press, 1999.
[6]        M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” in Springer Handbook of Speech Processing, Springer, 2008, pp.559–584.
[7]        M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Comput. Speech Lang., 2002, Vol.16, no.1, pp.69–88.
[8]        D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584, IEEE Signal Processing Society, 2011.
[9]        L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, 1989, Vol.77, no.2, pp.257–286.
[10]      D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow et al., “The subspace Gaussian mixture model—A structured model for speech recognition,” Comput. Speech Lang., 2011, Vol.25, no.2, pp.404–439.
[11]      B.-H. Juang, S. Levinson, and M. Sondhi, “Maximum likelihood estimation for multivariate mixture observations of Markov chains (corresp.),” IEEE Trans. Inf. Theory, 1986, Vol.32, no.2, pp.307–309.
[12]      S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology, 1994, pp.307–312.
[13]      C. M. Bishop, Neural networks for pattern recognition. Oxford university press, 1995.
[14]      K. Vesel`y, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.,” in Proceedings of the Interspeech, 2013, pp.2345–2349.
[15]      G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., 2012, Vol.29, no.6, pp.82–97.
[16]      H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach, Vol.247. Springer Science & Business Media, 1994.
[17]      H. Bourlard and N. Morgan, “Continuous speech recognition by connectionist statistical methods,” IEEE Trans. Neural Networks, 1993, Vol.4, no.6, pp.893–909.
[18]      A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “GMM-free DNN acoustic model training,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp.5602–5606.
[19]      G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio. Speech. Lang. Processing, 2012, Vol.20, no.1, pp.30–42, 2012.
[20]      L. Deng , J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., “Recent advances in deep learning for speech research at Microsoft,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp.8604–8608.
[21]      L. Deng and D. Yu, “Deep convex net: A scalable architecture for speech pattern classification,”  in Proceedings of the Interspeech, 2011.
[22]      O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP), 2012, pp.4277–4280.
[23]      G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., 2006, Vol.18, no.7, pp.1527–1554.
[24]      Z. Huang, J. Li, C. Weng, and C.-H. Lee, “Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition.,” in Proceedings of the Interspeech, 2014, pp.1214–1218.
[25]      D. Povey, “Discriminative training for large vocabulary speech recognition,”  Ph.D. dissertation, University of Cambridge, 2005.
[26]      V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1996, Vol.2, pp.605–608.
[27]      V. Doumpiotis and W. Byrne, “Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition,” Speech Commun., 2006, Vol.48, no.2, pp.142–160.
[28]      L. Bahl, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proceedings of ICASSP, 1986, pp.701–704.
[29]      P. C. Woodland and D. Povey, “Large scale discriminative training for speech recognition,” in Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.
[30]      P. C. Woodland and D. Povey, “Large scale discriminative training of hidden {Markov} models for speech recognition,” Comput. Speech Lang., 2002, Vol.16, no.1, pp.25–47.
[31]      M. Gibson and T. Hain, “Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition.,” in Proceedings of the Interspeech, 2006.
[32]      B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp.3761–3764.
[33]      G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The IBM 2015 English Conversational Telephone Speech Recognition System,”  in Proceedings of the Interspeech, 2015.
[34]      M. J. F. Gales and S. J. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. Speech Audio Process, 1996, Vol.4, no.5, pp.352–359.
[35]      Nvidia, “Cuda C Programming guide.” https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2011.
[36]      A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. Acoust., Speech, Signal Proc., 1989, Vol.37, no.3, pp.328–339.
[37]      H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” arXiv Prepr. arXiv1507.06947, 2015.
[38]      D. Povey et al., “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proceedings of the Interspeech, 2016.
[39]      H. Su, G. Li, D. Yu, and F. Seide, “Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp.6664–6668.
[40]      A. Zeyer, E. Beck, R. Schlüter, and H. Ney, “{CTC} in the Context of Generalized Full-Sum {HMM} Training,” in Proceedings of the Interspeech, 2017, pp.944–948.
[41]      H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2018, Vol.26, no.11, pp.1949–1961.
[42]      H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end Speech Recognition Using Lattice-free MMI\.,” in Proceedings of the Interspeech, 2018, pp.12–16.
[43]      C. Bishop, and N. M. Nasrabadi. Pattern recognition and machine learning. Vol. 4, no. 4. New York: springer, 2006.
[44]      A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the 31st International Conference on Machine Learning, 2014, pp.1764–1772.
[45]      A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp.369–376.
[46]      Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp.167–174.
[47]      A. Hannun et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv Prepr. arXiv1412.5567, 2014.
[48]      A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs,” arXiv Prepr. arXiv1408.2873, 2014.
[49]      A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free conversational speech recognition with neural networks,” in Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp.345–354.
[50]      A. Graves, “Sequence transduction with recurrent neural networks,” arXiv Prepr. arXiv1211.3711, 2012.
[51]      K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with {RNN}-transducer,” in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2017, pp.193–199.
[52]      E. Battenberg et al., “Exploring neural transducers for end-to-end speech recognition,” arXiv Prepr. arXiv1707.07413, 2017.
[53]      W. Chan, N. Jaitly, Q. V Le, and O. Vinyals, “Listen, attend and spell,” arXiv Prepr. arXiv1508.01211, 2015.
[54]      J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent {NN}: first results,” arXiv Prepr. arXiv1412.1602, 2014.
[55]      D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.4945–4949.
[56]      I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp.3104–3112.
[57]      K. Cho et al., “Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734, doi: 10.3115/v1/D14-1179.
[58]      D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv Prepr. arXiv1409.0473, 2014.
[59]      O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. audio, speech, Lang. Process., 2014, Vol.22, no.10, pp.1533–1545.
[60]      A. Gulati, J. Qin, C.C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, "Conformer: Convolution-augmented transformer for speech recognition." arXiv preprint arXiv:2005.08100, 2020.
[61]      S. Schneider, A. Baevski, R. Collobert, and M. Auli, “Wav2vec: Unsupervised pre-training for speech recognition,” arXiv Prepr. arXiv1904.05862, 2019.
[62]      A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Adv. Neural Inf. Process. Syst., 2020, Vol.33, pp.12449–12460.
[63]      A. Vaswani et al., “Attention is all you need,” in Proceedings of Advances in Neural Information Processing Systems, 2017, pp.6000–6010.
[64]      W.N. Hsu, B. Bolte, Y.H.H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2021,Vol.29, pp.3451–3460.
[65]      S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, and J. Wu, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., 2022, Vol.16, no.6, pp.1505–1518.
[66]      A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv Prepr. arXiv2212.04356, 2022.
[67]      Y. Miao, M. Gowayyed, X. Na, T. Ko, F. Metze, and A. Waibel, “An empirical exploration of CTC acoustic models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.2623–2627.
[68]      J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Proceedings of the Interspeech, 2017.
[69]      P. with Code, “Automatic Speech Recognition on Librispeech (clean).” 2022, Accessed: Mar. 18, 2023. [Online]. Available: https://paperswithcode.com/sota/automatic-speech-recognition-on-librispeech-7.