Improving audiovisual active speaker detection in egocentric recordings with the data-efficient image transformer

Clarke, J., Gotoh, Y. and Goetze, S. orcid.org/0000-0003-1044-7343 (2024) Improving audiovisual active speaker detection in egocentric recordings with the data-efficient image transformer. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2023),, 16-20 Dec 2023, Taipei, Taiwan. Institute of Electrical and Electronics Engineers (IEEE) ISBN 979-8-3503-0690-3

Abstract

Future augmented reality devices have the capacity to enhance human perception and provide assistive functions in complex communication scenarios. Active speaker detection (ASD) systems that are robust to egocentric data are critical to this. Egocentric ASD is challenging due to overlapping speech, single-channel recording, and dynamic scenes. A novel module that uses a data-efficient image transformer (DeiT) to extract features encapsulating the acoustic properties of each scene, and a positional conditioning mechanism is proposed. The module is evaluated in conjunction with TalkNet, an existing ASD architecture, on two audiovisual datasets: Ego4D (egocentric) and AVA-ActiveSpeaker (exocentric), achieving 29% and 0.38% relative improvement in mean Average Precision (mAP), respectively, while retaining a parameter efficient build. A qualitative analysis is also presented, implicitly demonstrating that contextual information is leveraged

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Clarke, J. Gotoh, Y. Goetze, S. https://orcid.org/0000-0003-1044-7343
Copyright, Publisher and Additional Information:	© 2023 The Authors. Except as otherwise noted, this author-accepted version of a proceedings paper is published in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	Active speaker detection; context modelling; data-efficient image transformers
Dates:	Accepted: 22 September 2023 Published (online): 19 January 2024 Published: 19 January 2024
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number META PLATFORM INC UNSPECIFIED Engineering and Physical Sciences Research Council 2638501 Engineering and Physical Sciences Research Council 2588133
Depositing User:	Symplectic Sheffield
Date Deposited:	12 Oct 2023 15:43
Last Modified:	09 Feb 2024 15:24
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Refereed:	Yes
Identification Number:	10.1109/ASRU57964.2023.10389764
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:204202

Download

Accepted Version

Filename: ASRU_AV-ASD_DeiT_cr.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Improving audiovisual active speaker detection in egocentric recordings with the data-efficient image transformer

Abstract

Metadata

Download

Accepted Version

Export

Statistics