Clarke, J., Gotoh, Y. orcid.org/0000-0003-1668-0867 and Goetze, S. (2025) Face-voice association for audiovisual active speaker detection in egocentric recordings. In: 2025 33rd European Signal Processing Conference (EUSIPCO). 2025 33rd European Signal Processing Conference (EUSIPCO), 08-12 Sep 2025, Palermo, Italy. Institute of Electrical and Electronics Engineers, pp. 66-70. ISBN: 9798350391831.
Abstract
Audiovisual active speaker detection (ASD) is conventionally performed by modelling the temporal synchronisation of acoustic and visual speech cues. In egocentric recordings, however, the efficacy of synchronisation-based methods is compromised by occlusions, motion blur, and adverse acoustic conditions. In this work, a novel framework is proposed that exclusively leverages cross-modal face-voice associations to determine speaker activity. An existing face-voice association model is integrated with a transformer-based encoder that aggregates facial identity information by dynamically weighting each frame based on its visual quality. This system is then coupled with a front-end utterance segmentation method, producing a complete ASD system. This work demonstrates that the proposed system, Self-Lifting for audiovisual active speaker detection (SL-ASD), achieves performance comparable to, and in certain cases exceeding, that of parameter-intensive synchronisationbased approaches with significantly fewer learnable parameters, thereby validating the feasibility of substituting strict audiovisual synchronisation modelling with flexible biometric associations in challenging egocentric scenarios. Code is available at https://github.com/jclarke98/SL_ASD.
Metadata
| Item Type: | Proceedings Paper |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2025 The Authors. Except as otherwise noted, this author-accepted version of a conference proceeding published in 2025 33rd European Signal Processing Conference (EUSIPCO) is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
| Keywords: | Biometrics; Visualization; Adaptation models; Biological system modeling; Pipelines; Transformers; Acoustics; Recording; Synchronization; Noise measurement |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
| Date Deposited: | 20 Jan 2026 09:17 |
| Last Modified: | 21 Jan 2026 11:14 |
| Status: | Published |
| Publisher: | Institute of Electrical and Electronics Engineers |
| Refereed: | Yes |
| Identification Number: | 10.23919/eusipco63237.2025.11226795 |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:236720 |
Download
Filename: Jason_EUSIPCO_2025_camera_ready.pdf
Licence: CC-BY 4.0

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)