多声学场景下端到端语音识别声学编码器的自适应

刘育坤; 郑霖; 黎塔; 张鹏远

doi:10.12395/0371-0025.2022114

多声学场景下端到端语音识别声学编码器的自适应

The self-adaptation of acoustic encoder in end-to-end automatic speech recognition under diverse acoustic scenes

摘要

摘要: 提出了一种面向多样化声学场景自适应设计声学编码器的方法(SAE)。该方法通过学习不同声学场景下语音中包含的声学特征的差异, 适应性地为端到端语音识别任务设计出合适的声学编码器。通过引入神经网络结构搜索技术, 提高了编码器设计的有效性, 从而改善了下游识别任务的性能。在Aishell-1、HKUST和SWBD三个常用的中英文数据集上的实验表明, 通过所提场景自适应设计方法得到的声学编码器相比已有的声学编码器可以获得平均5%以上的错误率改善。所提方法是一种深入分析特定场景下语音特征、针对性设计高性能声学编码器的有效方法。

Abstract: In this paper, a scene-adaptive acoustic encoder (SAE) is proposed for different speech scenes. This method adaptively designs an appropriate acoustic encoder for end-to-end speech recognition tasks by learning the differences of acoustic features in different acoustic scenes. By the application of the neural architecture search method, the effectiveness of encoder design and the performance of downstream recognition tasks are improved. Experiments on three commonly used Chinese and English dataset, Aishell-1, HKUST and SWBD, show that the proposed SAE can achieve average 5% relative character error rate reductions than the best human-designed encoders. The results show that the proposed method is an effective method for analysis of acoustic features in specific scenes and targeted design of high-performance acoustic encoders.

HTML全文

参考文献(29)

施引文献

资源附件(0)