Clarke, J., Gotoh, Y. orcid.org/0000-0003-1668-0867 and Goetze, S. (2025) Speaker embedding informed audiovisual active speaker detection for egocentric recordings. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Proceedings. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 06-11 Apr 2025, Hyperabad, India. Institute of Electrical and Electronics Engineers (IEEE) ISBN 9798350368758
Abstract
Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 The Author(s). Except as otherwise noted, this author-accepted version of a paper published in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Proceedings is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Diarization; Audiovisual Active Speaker Detection; Video-based Face Recognition; Speaker Recognition |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 18 Feb 2025 11:11 |
Last Modified: | 14 Mar 2025 16:29 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
Refereed: | Yes |
Identification Number: | 10.1109/ICASSP49660.2025.10890414 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:223485 |