Abstract:
In this paper, a scene-adaptive acoustic encoder (SAE) is proposed for different speech scenes. This method adaptively designs an appropriate acoustic encoder for end-to-end speech recognition tasks by learning the differences of acoustic features in different acoustic scenes. By the application of the neural architecture search method, the effectiveness of encoder design and the performance of downstream recognition tasks are improved. Experiments on three commonly used Chinese and English dataset, Aishell-1, HKUST and SWBD, show that the proposed SAE can achieve average 5% relative character error rate reductions than the best human-designed encoders. The results show that the proposed method is an effective method for analysis of acoustic features in specific scenes and targeted design of high-performance acoustic encoders.