Robust deepfake speech algorithm recognition: classifying generative algorithms via speaker x-vectors and deep learning.

Maltby, Harry, Wall, Julie ORCID logoORCID: https://orcid.org/0000-0001-6714-4867, Glackin, Cornelius, Moniri, Mansour, Shrestha, Roman, Cannings, Nigel and Salami, Iwa (2025) Robust deepfake speech algorithm recognition: classifying generative algorithms via speaker x-vectors and deep learning. In: IEEE International Joint Conference on Neural Networks (IJCNN), 30 June - 5 July 2025, Rome, Italy. (In Press)

[thumbnail of PDF/A]
Preview
PDF (PDF/A)
Robust Deepfake Speech Algorithm Recognition_2025_ijcnn__WallJ_accessible.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

The rapid advancement of deepfake voice technologies has resulted in alarming cases of impersonation and deception, highlighting the urgent need for robust tools that can not only distinguish real audio from fake but also recognise the generative algorithms responsible. The ability to not only detect deepfake audio but also recognise the generative methods used is essential for forensic investigations, legal proceedings, and regulatory enforcement. Without robust and explainable detection frameworks, legal professionals and investigators lack the tools needed to effectively monitor, investigate, and prosecute cases involving deepfake misuse. In this work, we take a voice biometrics approach, shifting the focus from identifying who is speaking to identifying which algorithm is speaking. Doing so allows our approach to inherently handle unseen classes while achieving competitive performance for deepfake speech algorithm recognition. Our system leverages a voice-focused ResNet101-based x-vector extraction model and combines diverse audio features, and our experimental novel feature LFCC-HF, enhanced with Linear Discriminant Analysis and cosine similarity clustering. This approach allows for a more transparent and interpretable decision-making process by usinga single voice similarity decision boundary compared to the ensemble-based methods commonly used in the literature. Unlike previous works that rely on an ensemble of models, which convolute the decision-making process, our method achieves comparable results while using a significantly lighter-weight architecture, with our model having 14.84 M parameters compared to 95 M and 317 M parameters for Wav2Vec2 base and large. Furthermore, we demonstrate the benefits of targeted data augmentation, which, combined with feature fusion and our novel feature, improves system robustness and adaptability, increasing our F1 Score from 0.624 to 0.763, a 22.275\% increase over our best single feature, and a 40.775\% increase over the best ADD 2023 Track 3 baseline. Importantly, the system achieves interpretability through its back-end classification process, where decisions are based on a transparent, learned threshold for voice similarity to known voiceprints. This work offers a foundation for advancing more robust and interpretable solutions in the field of deepfake speech detection.

Item Type: Conference or Workshop Item (Paper)
ISSN: 2161-4407 2161-4393
Page Range: pp. 1-8
Additional Information: Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords: Deepfake Detection, Deepfake Audio, Generative Algorithm Recognition, Synthetic Speech Detection
Subjects: Computing
Related URLs:
Depositing User: Julie Wall
Date Deposited: 18 Jun 2025 09:31
Last Modified: 18 Jun 2025 10:00
URI: https://repository.uwl.ac.uk/id/eprint/13791
Sustainable Development Goals: Goal 9: Industry, Innovation, and Infrastructure

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item

Menu