Consistency self-supervised learning method for robust automatic speech recognition
-
摘要:
提出了一种基于一致性自监督学习的鲁棒自动语音识别方法。该方法通过使用语音信号仿真技术, 模拟一条语音在不同声学场景下的副本; 在通过自监督学习方式学习语音表征的同时, 极大化一条语音在不同声学环境下对应语音表征的相似性, 从而获取到与环境干扰无关的语音表征方式, 提高下游语音识别模型的性能。在远讲数据集CHiME-4和会议数据集AMI上的实验表明, 所提的一致性自监督学习算法能够取得相比已有的wav2vec2.0自监督学习基线算法30%以上的识别词错误率下降。这表明, 所提方法是一种获取噪声无关语音表征、提升鲁棒语音识别性能的有效方法。
Abstract:A robust automatic speech recognition (ASR) method using consistency self-supervised learning (CSSL) is proposed. This method uses speech simulation to generate the speech with different acoustic environments, then uses the self-supervised learning to extract the speech representations and maximize the similarity between the representations of the simulated speech. So invariant speech representations can be extracted in different acoustic environments and the ASR performance can be improved. The proposed method is evaluated on the far-field dataset, CHiME-4, and the meeting dataset, AMI. With the help of the CSSL and appropriate pre-training pipeline, up to 30% relative word error rate can be achieved compared to the wav2vec2.0. This proves the CSSL can extract noise-invariant speech feathers and improve the ASR performance effectively.
-
表 1 使用不同预训练策略的wav2vec2.0模型在CHiME-4上的识别词错误率 (%)
训练策略 语言模型 单通道 六通道 dt-s dt-r et-s et-r dt-s dt-r et-s et-r 源数据预训练 三元文法语言模型 10.8 8.5 17.2 15.5 6.2 05.5 10.2 9.2 数据混合预训练 12.8 10.3 19.2 16.5 8.5 7.2 12.4 10.6 连续预训练 9.3 7.6 16.8 13.7 6.3 5.4 10.1 8.6 +1∶1数据重现 9.0 7.3 15.8 13.1 5.9 5.1 9.4 8.0 +1∶3数据重现 9.5 7.4 15.9 13.7 5.8 5.2 9.4 8.0 +1∶9数据重现 9.8 7.8 16.3 13.9 6.0 5.2 9.7 8.3 源数据预训练 神经网络语言模型 7.7 5.8 13.3 11.3 4.2 3.4 7.1 6.2 数据混合预训练 9.4 7.0 15.0 12.1 5.7 4.6 8.8 7.2 连续预训练 6.7 5.0 13.2 10.1 3.8 3.3 7.2 5.6 +1∶1数据重现 6.4 4.8 12.3 9.7 3.8 3.2 6.6 5.4 +1∶3数据重现 6.5 4.7 12.1 9.7 3.6 3.0 6.4 5.1 +1∶9数据重现 6.9 4.8 12.4 10.1 3.6 3.0 6.3 5.3 表 2 wav2vec2.0自监督学习算法和所提CSSL自监督学习算法在CHiME-4上的识别词错误率(%)对比
训练策略 语言模型 单通道 六通道 dt-s dt-r et-s et-r dt-s dt-r et-s et-r 源数据预训练 三元文法语言模型 10.8 8.5 17.2 15.5 6.2 5.5 10.2 9.2 +DA-SIMU 10.5 8.9 16.8 15.5 6.4 5.6 10.0 8.8 +DA-CSSL 9.5 7.3 15.7 13.3 5.9 5.1 9.4 8.2 连续预训练(1∶1) 9.0 7.3 15.8 13.1 5.9 5.1 9.4 8.0 +DA-SIMU 9.0 7.2 15.5 13.0 5.9 5.2 9.5 7.9 +DA-CSSL 8.7 6.5 15.2 12.0 5.7 4.8 9.1 7.8 +MC-CSSL 9.0 7.0 15.0 12.7 5.6 4.9 8.8 7.5 +DAMC-CSSL 8.4 6.3 14.7 11.4 5.3 4.6 8.4 7.3 +DAMCSE-CSSL 8.8 6.5 15.2 12.2 5.7 4.8 9.1 7.6 源数据预训练 神经网络语言模型 7.7 5.8 13.3 11.3 4.2 3.4 7.1 6.2 + DA-SIMU 7.3 5.5 12.9 10.9 3.9 3.2 6.8 5.6 +DA-CSSL 6.6 4.5 11.9 9.3 3.6 2.9 6.2 5.3 连续预训练(1∶1) 6.4 4.8 12.3 9.7 3.8 3.2 6.6 5.4 +DA-SIMU 6.4 4.5 11.8 9.2 3.7 3.0 6.6 5.1 +DA-CSSL 6.2 4.2 11.7 8.5 3.7 2.9 6.3 4.9 +MC-CSSL 6.2 4.5 11.6 9.0 3.5 2.8 5.7 4.8 +DAMC-CSSL 5.8 3.7 11.0 7.7 3.4 2.6 5.4 4.2 +DAMCSE-CSSL 6.2 4.3 11.6 8.6 3.5 2.8 6.3 4.8 表 3 wav2vec2.0自监督学习算法和所提CSSL自监督学习算法在AMI SDM上的识别词错误率(%)对比
训练策略 三元文法语言模型 神经网络语言模型 开发集 测试集 开发集 测试集 源数据预训练 29.8 32.9 29.2 32.5 +DA-CSSL 28.2 31.4 26.9 30.2 连续预训练 29.2 32.5 28.0 31.3 +DA-CSSL 27.9 31.1 26.7 30.2 +MC-CSSL 28.8 32.3 27.7 31.2 +DAMC-CSSL 27.6 31.3 26.4 29.8 表 4 所提方法与已有鲁棒语音识别算法在CHiME-4上的识别词错误率(%)对比
识别方法 模型结构 单通道 六通道 dt-r et-r dt-r et-r Kaldi[23] Hybrid-TDNN 5.6 11.4 1.9 2.7 Du等[24] Hybrid-DCNN 4.6 9.2 2.1 2.2 Wang等[25] Hybrid-BLSTM 3.5 6.8 1.5 2.0 ESPNet[26] E2E-Conformer 11.7 20.6 7.9 14.2 Guo等[13] E2E-Transformer — — 15.8 26.8 Tsunoo等[27] E2E-Conformer 8.4 15.7 — — Wang等[28] E2E-Wav2vec2.0 5.0 9.0 — — Hubert[16] E2E-Hubert-base 5.5 10.8 3.3 5.5 所提方法 E2E-Wav2vec2.0 3.7 7.7 2.0 3.5 -
[1] Anguera X, Wooters C, Hernando J. Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process., 2007; 15(7): 2011—2022 doi: 10.1109/TASL.2007.902460 [2] 陈明建, 胡振彪, 陈林. 一种基于加权TOPS的宽带DOA估计新方法. 数据采集与处理, 2019; 34(3): 453—461 doi: 10.16337/j.1004-9037.2019.03.009 [3] 王子腾, 孙兴伟, 李军锋, 等. 近似窄带假设下的最小方差无失真响应波束形成. 声学学报, 2020; 45(2): 161—168 doi: 10.15949/j.cnki.0371-0025.2020.02.002 [4] 石倩, 陈航艇, 张鹏远. 波达方向初始化空间混合概率模型的语音增强. 声学学报, 2022; 47(1): 139—150 doi: 10.15949/j.cnki.0371-0025.2022.01.016 [5] Higuchi T, Ito N, Yoshioka T, et al. Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise. IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016: 5210—5214 [6] 柯雨璇, 厉剑, 彭任华, 等. 用于自适应波束形成语音增强的球谐域掩蔽函数估计方法. 声学学报, 2021; 46(1): 67—80 doi: 10.15949/j.cnki.0371-0025.2021.01.007 [7] 高飞, 黄哲莹, 王子腾, 等. 早晚期混响划分对理想比值掩蔽在语音识别性能上的影响. 声学学报, 2019; 44(4): 788—795 doi: 10.15949/j.cnki.0371-0025.2019.04.041 [8] Wang Z Q, Wang D L. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process., 2016; 24(4): 796—806 doi: 10.1109/TASLP.2016.2528171 [9] 陈欢. 自动语音识别噪声鲁棒性方法研究. 博士学位论文, 南京: 南京邮电大学, 2014 [10] 杜俊. 自动语音识别中的噪声鲁棒性方法. 博士学位论文, 合肥: 中国科学技术大学, 2009 [11] 柏财通, 崔翛龙, 郑会吉, 等. 基于自监督知识迁移的鲁棒性语音识别技术. 计算机应用, 2022; 42(10): 3217—3223 doi: 10.11772/j.issn.1001-9081.2021050808 [12] 张开生, 赵小芬. 复杂环境下基于自适应深度神经网络的鲁棒语音识别. 计算机工程与科学, 2022; 44(6): 1105—1113 doi: 10.3969/j.issn.1007-130X.2022.06.019 [13] Guo Y, Chen Y, Cheng G, et al. Far-field speech recognition based on complex-valued neural networks and inter-frame similarity difference method. IEEE Automatic Speech Recognition and Understanding Workshop, Cartagena, Colombia, 2021: 1003—1010 [14] Fan C, Yi J, Tao J, et al. Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process., 2021; 29: 198—209 doi: 10.1109/TASLP.2020.3039600 [15] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020; 33: 12449—12460 [16] Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process., 2021; 29: 3451—3460 doi: 10.1109/TASLP.2021.3122291 [17] Gao C, Cheng G, Yang R, et al. Pre-training transformer decoder for end-to-end ASR model with unpaired text data. IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Ontario, Canada, 2021: 6543—6547 [18] Liu X, Zhang F, Hou Z, et al. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng., 2023; 35(1): 857—876 doi: 10.1109/TKDE.2021.3090866 [19] Barker J, Marxer R, Vincent E, et al. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, Arizona, USA, 2015: 504—511 [20] Renals S, Hain T, Bourlard H. Interpretation of multiparty meetings: the AMI and AMIDA projects. 2008 Hands-Free Speech Communication and Microphone Arrays, IEEE, 2008: 115—118 [21] Panayotov V, Chen G, Povey D, et al. Librispeech: an ASR corpus based on public domain audio books. IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Queensland, Australia, 2015: 5206—5210 [22] Ko T, Peddinti V, Povey D, et al. A study on data augmentation of reverberant speech for robust speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 5220—5224 [23] Chen S J, Subramanian A S, Xu H, et al. Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline. Proc. Interspeech, ISCA, Hyderabad, 2018: 1571—1575 [24] Du J, Tu Y H, Sun L, et al. The USTC-iFlytek system for CHiME-4 challenge. Proc. CHiME, 2016: 36—38 [25] Wang Z Q, Wang P, Wang D L. Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE/ACM Trans. Audio. Speech. Lang. Process., 2020; 28: 1778—1787 doi: 10.1109/TASLP.2020.2998279 [26] Guo P, Boyer F, Chang X, et al. Recent developments on espnet toolkit boosted by conformer. IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Ontario, Canada, 2021: 5874—5878 [27] Tsunoo E, Shibata K, Narisetty C, et al. Data augmentation methods for end-to-end speech recognition on distant-talk scenarios. Proc. Interspeech, ISCA, Brno, Czechia, 2021: 301—305 [28] Wang H, Qian Y, Wang X, et al. Improving noise robustness of contrastive speech representation learning with speech reconstruction. IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 2022: 6062—6066 -
计量
- 文章访问数: 34
- 被引次数: 0