Consistency self-supervised learning method for robust automatic speech recognition
-
Graphical Abstract
-
Abstract
A robust automatic speech recognition (ASR) method using consistency self-supervised learning (CSSL) is proposed. This method uses speech simulation to generate the speech with different acoustic environments, then uses the self-supervised learning to extract the speech representations and maximize the similarity between the representations of the simulated speech. So invariant speech representations can be extracted in different acoustic environments and the ASR performance can be improved. The proposed method is evaluated on the far-field dataset, CHiME-4, and the meeting dataset, AMI. With the help of the CSSL and appropriate pre-training pipeline, up to 30% relative word error rate can be achieved compared to the wav2vec2.0. This proves the CSSL can extract noise-invariant speech feathers and improve the ASR performance effectively.
-
-