语音鉴伪的人机感知差异分析
Analysis of human-machine perception differences in speech forgery detection
-
摘要: 基于人耳对语音自然度的感知特性, 深入研究了人耳对语音鉴伪的检测能力, 以及人机检测结果的差异。针对中文合成语音设计主观听辨实验, 对比人耳与机器的语音鉴伪正确率, 分析客观特征对主客观检测结果的影响, 并从音频高级特征和自然度影响因素等角度进一步对比人机检测结果的异同之处。实验数据显示, 音色、情感和韵律都有助于人耳做出分辨。相比于合成信号, 人耳听辨自然信号的正确率更高; 发音人性别和不同的语音合成算法都会造成人机检测结果的差异。结合客观声学特征的进一步分析结果表明, 过零率动态范围越宽、均值越大, 人耳越难以做出判断; 75%谱滚降动态范围较大且基频动态范围较窄的真实音频, 人与机器都更容易给出正确判断; 对数谱平坦度变化越稳定且均值越小的合成音频, 越容易被人与机器成功检测。Abstract: Based on the perceptual characteristics of the human ear regarding speech naturalness, this paper delves into the human ear’s ability to detect speech forgery and the differences between human and machine detection results. This paper presents a subjective discrimination experiment focusing on Chinese synthetic speech to compare the accuracy of speech detection between human listeners and machines. It analyzes the influence of objective features on human-machine detection results and further contrasts the similarities and differences of human-machine detection results from the viewpoints of advanced features and naturalness influencing factors. Experimental data indicate that timbre, emotion, and rhythm all assist the human ear in discrimination. Compared to synthetic signals, human ears can recognize natural signals more precisely. Additionally, the gender of the speaker and different speech synthesis algorithms lead to variations in human-machine detection results. Further analysis based on objective acoustic characteristics reveals that the wider the dynamic range and the larger the mean of the zero-crossing rate, the more challenging it is for the human ear to make judgments. Real audio with a large 75% spectral roll-off dynamic range and a narrow fundamental frequency dynamic range is more conducive for both humans and machines to make correct judgments. The more stable the change in logarithmic spectral flatness and the smaller the mean, the easier it is for humans and machines to successfully detect synthesized audio.
下载: