A Frequency Bin Analysis of Distinctive Ranges Between Human and Deepfake Generated Voices

Maltby, Harry; Wall, Julie; Glackin, Cornelius; Moniri, Mansour; Cannings, Nigel; Salami, Iwa

A Frequency Bin Analysis of Distinctive Ranges Between Human and Deepfake Generated Voices

Lists

Maltby, Harry, Wall, Julie, Glackin, Cornelius, Moniri, Mansour, Cannings, Nigel and Salami, Iwa (2024) A Frequency Bin Analysis of Distinctive Ranges Between Human and Deepfake Generated Voices. In: International Joint Conference on Neural Networks (IJCNN), 30 Jun - 5 July 2024, Yokohama, Japan.

Preview

PDF (PDF/A)
ijcnn_2024_submission_paper.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (4MB) | Preview

Abstract

Deepfake technology has advanced rapidly in recent
years. The widespread availability of deepfake audio technology has raised concerns about its potential misuse for malicious purposes, and a need for more robust countermeasure systems is becoming ever more important. Here we analyse the differences between human and deepfake audio and introduce a novel audio pre-processing approach. Our analysis aims to show the specific locations in the frequency spectrum where these artefacts and distinctions between human and deepfake audio can be found. Our approach emphasises specific frequency ranges that we show are transferable across synthetic speech datasets. In doing so, we explore the use of a bespoke filter bank derived from our analysis of the WaveFake dataset to exploit commonalities across algorithms. Our filter bank was constructed based on a frequency bin analysis of the WaveFake dataset, we apply this filter bank to adjust gain/attenuation to improve the effective signal-to-noise ratio, doing so we reduce the similarities while accentuating differences. We then take a baseline performing model and experiment with improving the performance using these frequency ranges to show where these artefacts lie and if this knowledge is transferable across mel-spectrum algorithms. We show that there exist exploitable commonalities between deepfake voice generation methods that generate audio in the mel-spectrum and that artefacts are left behind in similar frequency regions. Our approach is evaluated on the ASVSpoof 2019 Logical Access dataset of which the test set contains unseen generative methods to test the efficacy of our filter bank approach and transferability. Our experiments show that there is enhanced classification performance to be gained from utilizing these transferable frequency bands where there are more artefacts and distinctions. Our highest-performing model provided a 14.75% improvement in Equal Error Rate against our baseline model.