Speaker embedding informed audiovisual active speaker detection for egocentric recordings

Clarke, J., Gotoh, Y. orcid.org/0000-0003-1668-0867 and Goetze, S. (2025) Speaker embedding informed audiovisual active speaker detection for egocentric recordings. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Proceedings. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 06-11 Apr 2025, Hyperabad, India. Institute of Electrical and Electronics Engineers (IEEE) ISBN 9798350368758

Abstract

Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Clarke, J. Gotoh, Y. https://orcid.org/0000-0003-1668-0867 Goetze, S.
Copyright, Publisher and Additional Information:	© 2025 The Author(s). Except as otherwise noted, this author-accepted version of a paper published in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Proceedings is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	Diarization; Audiovisual Active Speaker Detection; Video-based Face Recognition; Speaker Recognition
Dates:	Accepted: 20 December 2024 Published (online): 7 March 2025 Published: 7 March 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	18 Feb 2025 11:11
Last Modified:	14 Mar 2025 16:29
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Refereed:	Yes
Identification Number:	10.1109/ICASSP49660.2025.10890414
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:223485

Download

Accepted Version

Filename: Jason_ICASSP_2025.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Speaker embedding informed audiovisual active speaker detection for egocentric recordings

Abstract

Metadata

Download

Accepted Version

Export

Statistics