EI / SCOPUS / CSCD 收录

中文核心期刊

TANG Guichen, LIANG Ruiyu, KONG Fanliu, XIE Yue, JU Mengjie. A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network[J]. ACTA ACUSTICA, 2022, 47(5): 692-702. DOI: 10.15949/j.cnki.0371-0025.2022.05.003
Citation: TANG Guichen, LIANG Ruiyu, KONG Fanliu, XIE Yue, JU Mengjie. A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network[J]. ACTA ACUSTICA, 2022, 47(5): 692-702. DOI: 10.15949/j.cnki.0371-0025.2022.05.003

A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network

  • The objective evaluation of speech quality can replace expensive manual scoring,but current objective indicators usually need pure reference speech,which is difficult to obtain in many practical acoustic systems.A noninvasive speech quality evaluation algorithm combining auxiliary target learning and Convolutional Recurrent Network(CRN) is proposed.Bark Frequency Cepstral Coefficients(BFCCs) which are based on human-like auditory filters,are used as the input of the CRN network to effectively reduce the network complexity.Firstly,frame-level features are extracted by a Convolutional Neural Network(CNN) from BFCCs.Then,long-term time dependence and sequence features are modeled by the Bidirectional Long Short-Term Memory(BiLSTM) networks in frame-level features.Finally,a self-attention mechanism is introduced into the CRN,thereby adaptively extracting useful information from frame-level features,which is then integrated into the characteristics of the sentence level and mapped into the final objective score.In addition,a multi-task training strategy is adopted,and Voice Activity Detection(VAD) is introduced as an auxiliary learning target to improve the performance of the algorithm.The experiments in public databases show that compared with other non-invasive algorithms,the proposed algorithm has a better correlation with the mean opinion score(MOS).Moreover,it has a small parameter size and good generalization ability for the distorted speech database with MOS released by ITU-T P.808,which is close to the accuracy of the Perceptual Evaluation of Speech Quality(PESQ).
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return