The increasing prevalence of audio deepfakes has raised serious concerns due to their potential misuse in identity theft, disinformation, and the compromise of voice authentication systems. Detecting these manipulations requires models capable of handling a wide range of audio features and attack strategies. In this paper, we introduce HCN-TA (Hierarchical Capsule Network with Temporal Attention), a novel architecture specifically designed for scalable and generalizable audio deepfake detection. The hierarchical capsule networks capture local and global audio patterns, while the multi-resolution temporal attention focuses on key segments with likely deepfake artifacts. Temporal locality awareness ensures prioritization of critical, rapidly changing regions. We validate the effectiveness of HCN-TA on the ASVspoof 2019 (LA) and FoR datasets, achieving low equal error rates (EER%) of 0.42% and 0.11% respectively.
Dettaglio pubblicazione
2025, Proceedings of the 2023 International Conference on Communication, Signal Processing and Computer Engineering, Pages 775-777
HCN-TA: Hierarchical Capsule Network with Temporal Attention for a Generalizable Approach to Audio Deepfake Detection (04b Atto di convegno in volume)
Wani T. M., Uecker M., Wani F. A., Amerini I.
keywords