The self-adaptation of acoustic encoder in end-to-end automatic speech recognition under diverse acoustic scenes
-
摘要:
提出了一种面向多样化声学场景自适应设计声学编码器的方法(SAE)。该方法通过学习不同声学场景下语音中包含的声学特征的差异, 适应性地为端到端语音识别任务设计出合适的声学编码器。通过引入神经网络结构搜索技术, 提高了编码器设计的有效性, 从而改善了下游识别任务的性能。在Aishell-1、HKUST和SWBD三个常用的中英文数据集上的实验表明, 通过所提场景自适应设计方法得到的声学编码器相比已有的声学编码器可以获得平均5%以上的错误率改善。所提方法是一种深入分析特定场景下语音特征、针对性设计高性能声学编码器的有效方法。
Abstract:In this paper, a scene-adaptive acoustic encoder (SAE) is proposed for different speech scenes. This method adaptively designs an appropriate acoustic encoder for end-to-end speech recognition tasks by learning the differences of acoustic features in different acoustic scenes. By the application of the neural architecture search method, the effectiveness of encoder design and the performance of downstream recognition tasks are improved. Experiments on three commonly used Chinese and English dataset, Aishell-1, HKUST and SWBD, show that the proposed SAE can achieve average 5% relative character error rate reductions than the best human-designed encoders. The results show that the proposed method is an effective method for analysis of acoustic features in specific scenes and targeted design of high-performance acoustic encoders.
-
表 1 SAE编码器与手工编码器在CTC框架下的字错误率CER对比 (%)
模型结构 结构设置 Aishell-1 HKUST Hub5’00 参数量 dev test 参数量 test 参数量 swbd1 callhm Transformer H4 8.19 M 7.1 7.7 8.19 M 24.1 8.19 M 13.4 24.7 H8 8.19 M 6.8 7.5 8.19 M 23.7 8.19 M 13.7 25.2 H16 8.19 M 6.8 7.4 8.19 M 23.9 8.19 M 13.6 25.0 Conformer H4C7 9.68 M 5.8 6.4 9.68 M 23.4 9.68 M 11.8* 22.1 H4C15 9.75 M 6.0* 6.6* 9.75 M 22.8* 9.75 M 12.0 22.4 H4C31 9.92 M 6.5 7.2 9.92 M 23.7 9.92 M 11.8* 22.0* H8C15 9.75 M 6.0* 6.7 9.75 M 22.8* 9.75 M 12.3 22.9 H16C15 9.75 M 6.2 6.8 9.75 M 23.4 9.75 M 12.2 22.7 随机搜索 Searched 9.96 M 6.3 6.0 9.13 M 23.9 9.69 M 12.5 23.2 SAE (无预训练) Searched 9.43 M 5.9 6.4 9.84 M 22.8 9.39 M 11.8 22.0 SAE (softmax) Searched 9.23 M 6.1 6.7 9.97 M 23.4 9.57 M 12.0 22.3 SAE Searched 9.11 M 5.6 6.1 9.07 M 22.1 9.78 M 11.5 21.5 表 2 SAE编码器与手工编码器在AED框架下的字错误率CER对比 (%)
模型结构 结构设置 Aishell-1 HKUST Hub5’00 参数量 dev test 参数量 test 参数量 swbd1 callhm Transformer H4 8.19 M 5.7 6.3 8.19 M 22.8 8.19 M 8.4 17.6 H8 8.19 M 6.2 6.8 8.19 M 22.6 8.19 M 8.5 17.1 H16 8.19 M 5.4 5.9 8.19 M 22.6 8.19 M 8.7 18.0 Conformer H4C7 9.68 M 5.3 5.9 9.68 M 21.5 9.68 M 8.1 16.0 H4C15 9.75 M 5.1* 5.7* 9.75 M 21.4* 9.75 M 8.2 16.3 H4C31 9.92 M 5.2 5.8 9.92 M 21.8 9.92 M 8.0* 15.9* H8C15 9.75 M 5.2 5.7* 9.75 M 21.7 9.75 M 8.1 16.2 H16C15 9.75 M 5.1* 5.7* 9.75 M 22.0 9.75 M 8.2 16.5 随机搜索 Searched 9.38 M 5.4 6.1 9.52 M 22.3 9.09 M 8.4 17.0 SAE (无预训练) Searched 9.46 M 5.0 5.5 9.93 M 21.4 9.52 M 8.0 16.0 SAE (softmax) Searched 9.77 M 5.2 5.7 9.29 M 21.7 9.43 M 8.2 16.5 SAE Searched 9.05 M 4.8 5.3 9.11 M 21.0 9.82 M 7.7 15.3 表 3 SAE编码器与手工编码器在RNN-T框架下的字错误率CER对比 (%)
模型结构 结构设置 Aishell-1 HKUST Hub5’00 参数量 dev test 参数量 test 参数量 swbd1 callhm Transformer H4 8.19 M 6.5 7.1 8.19 M 27.8 8.19 M 11.3 20.5 H8 8.19 M 6.3 6.8 8.19 M 27.9 8.19 M 11.1 20.4 H16 8.19 M 6.3 7.0 8.19 M 27.6 8.19 M 11.5 20.9 Conformer H4C7 9.68 M 6.0* 6.8 9.68 M 27.2 9.68 M 10.3* 20.3 H4C15 9.75 M 6.0* 6.8 9.75 M 26.6* 9.75 M 10.5 19.7 H4C31 9.92 M 6.1 6.8 9.92 M 26.8 9.92 M 10.7 20.1 H8C15 9.75 M 6.0* 6.7* 9.75 M 27.0 9.75 M 10.7 19.6* H16C15 9.75 M 6.0* 6.7* 9.75 M 27.2 9.75 M 10.4 20.0 随机搜索 Searched 9.71 M 6.2 6.9 9.84 M 27.2 9.47 M 10.9 20.5 SAE (无预训练) Searched 9.18 M 6.0 6.5 9.81 M 26.4 9.32 M 10.6 19.9 SAE (softmax) Searched 9.02 M 6.0 6.8 9.39 M 27.1 9.55 M 10.2 19.4 SAE Searched 9.85 M 5.6 6.2 9.33 M 25.9 9.72 M 9.8 18.8 表 4 SAE编码器与手工编码器训练时间成本对比(单位: h)
Aishell-1 HKUST SWBD CTC AED RNN-T CTC AED RNN-T CTC AED RNN-T 手工基线训练 156.3 195.4 231.5 210.8 265.0 316.0 476.6 598.1 725.7 SAE预训练 2.1 2.5 3.2 2.8 3.4 4.1 6.2 7.9 9.3 SAE搜索 12.5 15.0 18.8 16.4 20.5 24.7 37.2 46.7 60.3 SAE重训练 20.6 25.1 31.4 27.4 34.2 41.1 62.1 77.6 94.1 SAE总耗时 35.2 42.6 53.4 46.6 58.1 69.9 105.5 132.2 163.7 表 5 不同SAE编码器与手工编码器在Aishell-Noisy上的字错误率CER对比 (%)
编码器名称 结构设置 Aishell-Noisy dev Transformer H4 20.1 Transformer H8 19.8 Transformer H16 20.2 Conformer H4C7 19.1 Conformer H4C15 17.3 Conformer H4C31 17.1* Conformer H8C15 18.0 Conformer H16C15 17.5 SAE Aishell-1搜索获得 17.8 SAE Aishell-Noisy搜索获得 15.0 表 6 消融搜索空间与完整搜索空间在Aishell-1上的字错误率CER对比 (%)
· CTC AED RNN-T dev test dev test dev test Transformer (best) 6.8 7.4 5.4 5.9 6.3 6.8 Conformer (best) 5.8 6.4 5.1 5.7 6.0 6.7 固定MHA为MHSA4 5.9 6.6 5.0 5.5 5.9 6.4 固定CNN为CNN15 5.8 6.3 5.2 5.7 6.1 6.8 固定FFN为FFN1024 5.6 6.1 4.8 5.3 5.6 6.2 堆叠相同模块 5.8 6.3 5.1 5.7 5.9 6.6 SAE 5.6 6.1 4.8 5.3 5.6 6.2 -
[1] 刘加, 陈谐, 单煜翔, 等. 大规模词表连续语音识别引擎紧致动态网络的构建. 清华大学学报(自然科学版), 2012; 52(11): 1530—1534 doi: 10.16511/j.cnki.qhdxxb.2012.11.012 [2] 刘加. 汉语大词汇量连续语音识别系统研究进展. 电子学报, 2000; 28(1): 85—91 doi: 10.3321/j.issn:0372-2112.2000.01.023 [3] Povey D. Discriminative training for large vocabulary speech recognition. Doctoral dissertation, University of Cambridge, 2003 [4] 倪崇嘉, 刘文举, 徐波. 汉语大词汇量连续语音识别系统研究进展. 中文信息学报, 2009; 23(1): 112—123, 128 doi: 10.3969/j.issn.1003-0077.2009.01.018 [5] 刘娟宏, 胡彧, 黄鹤宇. 端到端的深度卷积神经网络语音识别. 计算机应用与软件, 2020; 37(4): 192—196 doi: 10.3969/j.issn.1000-386x.2020.04.031 [6] 王子龙, 李俊峰, 张劭韡, 等. 基于递归神经网络的端到端语音识别. 计算机与数字工程, 2019; 47(12): 3099—3106 doi: 10.3969/j.issn.1672-9722.2019.12.031 [7] 唐海桃, 薛嘉宾, 韩纪庆. 一种多尺度前向注意力模型的语音识别方法. 电子学报, 2020; 48(7): 1255—1260 doi: 10.3969/j.issn.0372-2112.2020.07.002 [8] 郭家兴, 韩纪庆. 一种RNN-T与BERT相结合的端到端语音识别模型. 智能计算机与应用, 2021; 11(2): 169—173 doi: 10.3969/j.issn.2095-2163.2021.02.037 [9] 张开生, 赵小芬. 复杂环境下基于自适应深度神经网络的鲁棒语音识别. 计算机工程与科学, 2022; 44(6): 1105—1113 doi: 10.3969/j.issn.1007-130X.2022.06.019 [10] Vielzeuf V, Antipov G. Are E2E ASR models ready for an industrial usage? arXiv preprint: 2112.12572, 2021 [11] 杨威, 胡燕. 混合CTC/attention架构端到端带口音普通话识别. 计算机应用研究, 2021; 38(3): 755—759 doi: 10.19734/j.issn.1001-3695.2020.02.0036 [12] Jain M, Schubert K, Mahadeokar J, et al. RNN-T for latency controlled ASR with improved beam search. arXiv preprint: 1911.01629, 2019 [13] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. 23rd international conference on Machine learning, 2006: 369—376 [14] 刘晓峰, 宋文爱, 陈小东, 等. 基于多核卷积融合网络的BLSTM-CTC语音识别. 计算机应用与软件, 2021; 38(11): 167—173 doi: 10.3969/j.issn.1000-386x.2021.11.026 [15] 沈逸文, 孙俊. 结合Transformer的轻量化中文语音识别. 计算机应用研究, 2023; 40(2): 424—429 doi: 10.19734/j.issn.1001-3695.2022.06.0340 [16] Liu H, Simonyan K, Yang Y. DARTS: Differentiable architecture search. International Conference on Learning Representations, 2018 [17] So D, Le Q, Liang C. The evolved transformer. International Conference on Machine Learning, PMLR, 2019: 5877—5886 [18] 姚潇, 史叶伟, 霍冠英, 等. 基于神经网络结构搜索的轻量化网络构建. 模式识别与人工智能, 2021; 34(11): 1038—1048 doi: 10.16451/j.cnki.issn1003-6059.202111007 [19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. 31st International Conference on Neural Information Processing Systems, 2017: 6000—6010 [20] 李业良, 张二华, 唐振民. 基于混合式注意力机制的语音识别研究. 计算机应用研究, 2020; 37(1): 131—134 doi: 10.19734/j.issn.1001-3695.2018.06.0492 [21] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint: 2005.08100, 2020 [22] Sanger T D. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural networks, 1989; 2(6): 459—473 doi: 10.1016/0893-6080(89)90044-0 [23] 朱学超, 张飞, 高鹭, 等. 基于残差网络和门控卷积网络的语音识别研究. 计算机工程与应用, 2022; 58(7): 185—191 doi: 10.3778/j.issn.1002-8331.2108-0265 [24] Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax. arXiv preprint: 1611.01144, 2016 [25] Bu H, Du J, Na X, et al. Aishell-1: An open-source Mandarin speech corpus and a speech recognition baseline. 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, Seoul, Korea, IEEE, 2017 [26] Liu Y, Fung P, Yang Y, et al. HKUST/MTS: A very large scale mandarin telephone speech corpus. International Symposium on Chinese Spoken Language Processing, Springer, Berlin, Heidelberg, 2006: 724—735 [27] Godfrey J J, Holliman E C, McDaniel J. SWITCHBOARD: Telephone speech corpus for research and development. IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Computer Society, San Francisco, CA, USA, 1992: 517—520 [28] Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit. arXiv preprint: 1804.00015, 2018 [29] Maciejewski M, Wichern G, McQuinn E, et al. WHAMR!: Noisy and reverberant single-channel speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 696—700 -