Robust deepfake speech algorithm recognition: classifying generative algorithms via speaker x-vectors and deep learning.

Maltby, Harry; Wall, Julie; Glackin, Cornelius; Moniri, Mansour; Shrestha, Roman; Cannings, Nigel; Salami, Iwa

Robust deepfake speech algorithm recognition: classifying generative algorithms via speaker x-vectors and deep learning.

Lists

Maltby, Harry, Wall, Julie ORCID: https://orcid.org/0000-0001-6714-4867, Glackin, Cornelius, Moniri, Mansour, Shrestha, Roman, Cannings, Nigel and Salami, Iwa (2025) Robust deepfake speech algorithm recognition: classifying generative algorithms via speaker x-vectors and deep learning. In: IEEE International Joint Conference on Neural Networks (IJCNN), 30 June - 5 July 2025, Rome, Italy. (In Press)

Preview

PDF (PDF/A)
Robust Deepfake Speech Algorithm Recognition_2025_ijcnn__WallJ_accessible.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (1MB) | Preview

Abstract

The rapid advancement of deepfake voice technologies has resulted in alarming cases of impersonation and deception, highlighting the urgent need for robust tools that can not only distinguish real audio from fake but also recognise the generative algorithms responsible. The ability to not only detect deepfake audio but also recognise the generative methods used is essential for forensic investigations, legal proceedings, and regulatory enforcement. Without robust and explainable detection frameworks, legal professionals and investigators lack the tools needed to effectively monitor, investigate, and prosecute cases involving deepfake misuse. In this work, we take a voice biometrics approach, shifting the focus from identifying who is speaking to identifying which algorithm is speaking. Doing so allows our approach to inherently handle unseen classes while achieving competitive performance for deepfake speech algorithm recognition. Our system leverages a voice-focused ResNet101-based x-vector extraction model and combines diverse audio features, and our experimental novel feature LFCC-HF, enhanced with Linear Discriminant Analysis and cosine similarity clustering. This approach allows for a more transparent and interpretable decision-making process by usinga single voice similarity decision boundary compared to the ensemble-based methods commonly used in the literature. Unlike previous works that rely on an ensemble of models, which convolute the decision-making process, our method achieves comparable results while using a significantly lighter-weight architecture, with our model having 14.84 M parameters compared to 95 M and 317 M parameters for Wav2Vec2 base and large. Furthermore, we demonstrate the benefits of targeted data augmentation, which, combined with feature fusion and our novel feature, improves system robustness and adaptability, increasing our F1 Score from 0.624 to 0.763, a 22.275\% increase over our best single feature, and a 40.775\% increase over the best ADD 2023 Track 3 baseline. Importantly, the system achieves interpretability through its back-end classification process, where decisions are based on a transparent, learned threshold for voice similarity to known voiceprints. This work offers a foundation for advancing more robust and interpretable solutions in the field of deepfake speech detection.

Item Type:	Conference or Workshop Item (Paper)
ISSN:	2161-4407 2161-4393
Page Range:	pp. 1-8
Additional Information:	Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords:	Deepfake Detection, Deepfake Audio, Generative Algorithm Recognition, Synthetic Speech Detection
Subjects:	Computing
Related URLs:	https://ieeexplore.ieee.org/xpl/conhome/...
Date Deposited:	18 Jun 2025
URI:	https://repository.uwl.ac.uk/id/eprint/13791
Sustainable Development Goals:	Goal 9: Industry, Innovation, and Infrastructure

Downloads

Downloads per month over past year

Actions (admin access)

References

Thomas Brewster. “Fraudsters Cloned Company Di- rector’s Voice In $35 Million Heist, Police Find”. In: Forbes (Oct. 2021). Updated May 2, 2023. URL: https:
//www.forbes.com/sites/thomasbrewster/2021/10/14/ huge- bank- fraud- uses- deep- fake- voice- tech- to- steal- millions/.
Ben Quinn. “Slew of deepfake video adverts of Sunak on Facebook raises alarm over AI risk to election”. In: The Guardian (Jan. 2024). Published on 12 January 2024. URL: https://www.theguardian.com/technology/ 2024/jan/12/deepfake- video- adverts- sunak- facebook- alarm-ai-risk-election.
Ben Finley. “Deepfake of principal’s voice is the latest case of AI being used for harm”. In: The Independent (Apr. 2024). Published on 29 April 2024. URL: https:
//www.independent.co.uk/news/ap-deepfake-maryland- people-experts-b2536677.html.
Andrzej Porebski. “Looking for the Right Paths to Use XAI in the Judiciary: Which Branches of Law Need Inherently Interpretable Machine Learning Mod- els and Why?” English. In: Joint Proceedings of the xAI 2024 Late-breaking Work, Demos and Doctoral Consortium. Vol. 3793. CEUR Workshop Proceedings. Valletta, Malta: CEUR-WS, 2024, pp. 129–136. URL: https://ceur-ws.org/Vol-3793/paper 17.pdf.
Jingze Lu et al. “Detecting Unknown Speech Spoof- ing Algorithms with Nearest Neighbors.” In: DADA@
IJCAI. 2023, pp. 89–94.

Xiaoyi Qin et al. “From Speaker Verification to Deep- fake Algorithm Recognition: Our Learned Lessons from ADD2023 Track 3.” In: DADA@ IJCAI. 2023, pp. 107–
Xuechen Liu et al. “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild”. In: IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing 31 (2023), pp. 2507–2522. DOI: 10. 1109/TASLP.2023.3285283.
Jiangyan Yi et al. “Add 2023: the second au- dio deepfake detection challenge”. In: arXiv preprint arXiv:2305.13774 (2023).
Mohit Dua, Swati Meena, Nidhi Chakravarty, et al. “Audio Deepfake Detection Using Data Augmented Graph Frequency Cepstral Coefficients”. In: 2023 Inter- national Conference on System, Computation, Automa- tion and Networking (ICSCAN). IEEE. 2023, pp. 1–6.
A. Cohen et al. “A study on data augmentation in voice anti-spoofing”. In: Speech Communication 141 (2022),
pp. 56–67.
Xiao-Min Zeng et al. “Deepfake Algorithm Recognition System with Augmented Data for ADD 2023 Chal- lenge.” In: DADA@ IJCAI. 2023, pp. 31–36.
Ziqian Wang et al. “The NPU-ASLP System for Deep- fake Algorithm Recognition in ADD 2023 Challenge.” In: DADA@ IJCAI. 2023, pp. 64–69.
Xinhui Chen et al. “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021”. In: ArXiv abs/2107.12018 (2021). URL: https : / / api . semanticscholar.org/CorpusID:236428417.
Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang. “A comparative study on physical and percep- tual features for deepfake audio detection”. In: Pro- ceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia. 2022, pp. 35–41.
L. Wu and Y. Jiang. “Attentional Fusion TDNN for Spoof Speech Detection”. In: 2022 5th International Conference on Pattern Recognition and Artificial Intel- ligence (PRAI). IEEE. Aug. 2022, pp. 651–657.
Anton Firc, Kamil Malinka, and Petr Hana´cˇek. “Deep- fake Speech Detection: A Spectrogram Analysis”. In: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing. 2024, pp. 1312–1320.
Harry Maltby et al. “A Frequency Bin Analysis of Distinctive Ranges Between Human and Deepfake Gen- erated Voices”. In: 2024 International Joint Conference on Neural Networks (IJCNN). 2024, pp. 1–7. DOI: 10. 1109/IJCNN60899.2024.10650554.
Roman Shrestha et al. “Speaker Recognition using Multiple X-Vector Speaker Representations with Two- Stage Clustering and Outlier Detection Refinement”. In: 2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technol- ogy Congress (DASC/PiCom/CBDCom/CyberSciTech).
2022, pp. 1–6. DOI: 10.1109/DASC/PiCom/CBDCom/ Cy55231.2022.9927875.
David Snyder et al. “X-vectors: Robust dnn embeddings for speaker recognition”. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 5329–5333.
Federico Landini et al. Bayesian HMM Clustering of X- Vector Sequences (VBx) in Speaker Diarization: Theory, Implementation and Analysis on Standard Tasks. https:
/ / github. com/ phonexiaresearch/ VBx - training - recipe. Accessed: 2025-01-09. 2022.
Tom Ko et al. “A study on data augmentation of rever- berant speech for robust speech recognition”. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2017, pp. 5220– 5224.
David Snyder, Guoguo Chen, and Daniel Povey. “Mu- san: A music, speech, and noise corpus”. In: arXiv preprint arXiv:1510.08484 (2015).
Neville Ryant et al. “The third DIHARD diarization challenge”. In: arXiv preprint arXiv:2012.01477 (2020).
Joon Son Chung et al. “Spot the conversation: speaker diarisation in the wild”. In: INTERSPEECH. 2020.
Zeroth Project. Zeroth-Korean Speech Corpus. http:// www.openslr.org/40/. Accessed: Jan. 15, 2025.
Katsuya Iida. Kokoro Speech Dataset. https:// github. com/kaiidams/Kokoro- Speech- Dataset. Accessed: Jan. 15, 2025. 2021.
Yue Fan et al. “CN-CELEB: a challenging Chi- nese speaker recognition dataset”. In: ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020,
pp. 7604–7608.
Lantian Li et al. CN-Celeb: multi-genre speaker recog- nition. 2020. arXiv: 2012.12468 [eess.AS].
Shinji Watanabe et al. “CHiME-6 Challenge: Tack-
ling Multispeaker Speech Recognition for Unsegmented Recordings”. In: CHiME 2020-6th International Work- shop on Speech Processing in Everyday Environments. 2020.
Fan Yu et al. “Summary on the ICASSP 2022 multi- channel multi-party meeting transcription grand chal- lenge”. In: ICASSP 2022-2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, pp. 9156–9160.
Yihui Fu et al. “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario”. In: Inter- speech. 2021. URL: https://arxiv.org/abs/2104.03603.
Sangdoo Yun et al. “Cutmix: Regularization strategy to train strong classifiers with localizable features”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 6023–6032.
Ye Tian et al. “Deepfake Algorithm Recognition through Multi-model Fusion Based On Manifold Mea-
sure.” In: DADA@ IJCAI. 2023, pp. 76–81.

Tools

CORE (COnnecting REpositories)

The University of West London

Robust deepfake speech algorithm recognition: classifying generative algorithms via speaker x-vectors and deep learning.

Abstract

Downloads

Actions (admin access)

Menu