[1] X. Huang, A. Acero, H.-W. Hon, and R. Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR, 2001.
[2] “how the human auditory system works,” 2020. https://www.khouzeyannews.ir/blog/how-the-human-auditory-system-works.
[3] M. Gales and S. Young, “The application of hidden Markov models in speech recognition,” Found. trends signal Process., Vol.1, no.3, 2008, pp.195–304.
[4] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.1, 1992, pp.13–16.
[5] C. D. Manning and H. Schütze, Foundations of statistical natural language processing, Vol.999. MIT Press, 1999.
[6] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” in Springer Handbook of Speech Processing, Springer, 2008, pp.559–584.
[7] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Comput. Speech Lang., 2002, Vol.16, no.1, pp.69–88.
[8] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584, IEEE Signal Processing Society, 2011.
[9] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, 1989, Vol.77, no.2, pp.257–286.
[10] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow et al., “The subspace Gaussian mixture model—A structured model for speech recognition,” Comput. Speech Lang., 2011, Vol.25, no.2, pp.404–439.
[11] B.-H. Juang, S. Levinson, and M. Sondhi, “Maximum likelihood estimation for multivariate mixture observations of Markov chains (corresp.),” IEEE Trans. Inf. Theory, 1986, Vol.32, no.2, pp.307–309.
[12] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology, 1994, pp.307–312.
[13] C. M. Bishop, Neural networks for pattern recognition. Oxford university press, 1995.
[14] K. Vesel`y, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.,” in Proceedings of the Interspeech, 2013, pp.2345–2349.
[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., 2012, Vol.29, no.6, pp.82–97.
[16] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach, Vol.247. Springer Science & Business Media, 1994.
[17] H. Bourlard and N. Morgan, “Continuous speech recognition by connectionist statistical methods,” IEEE Trans. Neural Networks, 1993, Vol.4, no.6, pp.893–909.
[18] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “GMM-free DNN acoustic model training,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp.5602–5606.
[19] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio. Speech. Lang. Processing, 2012, Vol.20, no.1, pp.30–42, 2012.
[20] L. Deng , J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., “Recent advances in deep learning for speech research at Microsoft,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp.8604–8608.
[21] L. Deng and D. Yu, “Deep convex net: A scalable architecture for speech pattern classification,” in Proceedings of the Interspeech, 2011.
[22] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP), 2012, pp.4277–4280.
[23] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., 2006, Vol.18, no.7, pp.1527–1554.
[24] Z. Huang, J. Li, C. Weng, and C.-H. Lee, “Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition.,” in Proceedings of the Interspeech, 2014, pp.1214–1218.
[25] D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, University of Cambridge, 2005.
[26] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1996, Vol.2, pp.605–608.
[27] V. Doumpiotis and W. Byrne, “Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition,” Speech Commun., 2006, Vol.48, no.2, pp.142–160.
[28] L. Bahl, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proceedings of ICASSP, 1986, pp.701–704.
[29] P. C. Woodland and D. Povey, “Large scale discriminative training for speech recognition,” in Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.
[30] P. C. Woodland and D. Povey, “Large scale discriminative training of hidden {Markov} models for speech recognition,” Comput. Speech Lang., 2002, Vol.16, no.1, pp.25–47.
[31] M. Gibson and T. Hain, “Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition.,” in Proceedings of the Interspeech, 2006.
[32] B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp.3761–3764.
[33] G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The IBM 2015 English Conversational Telephone Speech Recognition System,” in Proceedings of the Interspeech, 2015.
[34] M. J. F. Gales and S. J. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. Speech Audio Process, 1996, Vol.4, no.5, pp.352–359.
[35] Nvidia, “Cuda C Programming guide.” https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2011.
[36] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. Acoust., Speech, Signal Proc., 1989, Vol.37, no.3, pp.328–339.
[37] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” arXiv Prepr. arXiv1507.06947, 2015.
[38] D. Povey et al., “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proceedings of the Interspeech, 2016.
[39] H. Su, G. Li, D. Yu, and F. Seide, “Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp.6664–6668.
[40] A. Zeyer, E. Beck, R. Schlüter, and H. Ney, “{CTC} in the Context of Generalized Full-Sum {HMM} Training,” in Proceedings of the Interspeech, 2017, pp.944–948.
[41] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2018, Vol.26, no.11, pp.1949–1961.
[42] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end Speech Recognition Using Lattice-free MMI\.,” in Proceedings of the Interspeech, 2018, pp.12–16.
[43] C. Bishop, and N. M. Nasrabadi. Pattern recognition and machine learning. Vol. 4, no. 4. New York: springer, 2006.
[44] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the 31st International Conference on Machine Learning, 2014, pp.1764–1772.
[45] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp.369–376.
[46] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp.167–174.
[47] A. Hannun et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv Prepr. arXiv1412.5567, 2014.
[48] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs,” arXiv Prepr. arXiv1408.2873, 2014.
[49] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free conversational speech recognition with neural networks,” in Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp.345–354.
[50] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv Prepr. arXiv1211.3711, 2012.
[51] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with {RNN}-transducer,” in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2017, pp.193–199.
[52] E. Battenberg et al., “Exploring neural transducers for end-to-end speech recognition,” arXiv Prepr. arXiv1707.07413, 2017.
[53] W. Chan, N. Jaitly, Q. V Le, and O. Vinyals, “Listen, attend and spell,” arXiv Prepr. arXiv1508.01211, 2015.
[54] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent {NN}: first results,” arXiv Prepr. arXiv1412.1602, 2014.
[55] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.4945–4949.
[56] I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp.3104–3112.
[57] K. Cho et al., “Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734, doi: 10.3115/v1/D14-1179.
[58] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv Prepr. arXiv1409.0473, 2014.
[59] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. audio, speech, Lang. Process., 2014, Vol.22, no.10, pp.1533–1545.
[60] A. Gulati, J. Qin, C.C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, "Conformer: Convolution-augmented transformer for speech recognition." arXiv preprint arXiv:2005.08100, 2020.
[61] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “Wav2vec: Unsupervised pre-training for speech recognition,” arXiv Prepr. arXiv1904.05862, 2019.
[62] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Adv. Neural Inf. Process. Syst., 2020, Vol.33, pp.12449–12460.
[63] A. Vaswani et al., “Attention is all you need,” in Proceedings of Advances in Neural Information Processing Systems, 2017, pp.6000–6010.
[64] W.N. Hsu, B. Bolte, Y.H.H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2021,Vol.29, pp.3451–3460.
[65] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, and J. Wu, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., 2022, Vol.16, no.6, pp.1505–1518.
[66] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv Prepr. arXiv2212.04356, 2022.
[67] Y. Miao, M. Gowayyed, X. Na, T. Ko, F. Metze, and A. Waibel, “An empirical exploration of CTC acoustic models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.2623–2627.
[68] J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Proceedings of the Interspeech, 2017.
[69] P. with Code, “Automatic Speech Recognition on Librispeech (clean).” 2022, Accessed: Mar. 18, 2023. [Online]. Available: https://paperswithcode.com/sota/automatic-speech-recognition-on-librispeech-7.