Clarke, J., Gotoh, Y. and Goetze, S. orcid.org/0000-0003-1044-7343 (2024) Improving audiovisual active speaker detection in egocentric recordings with the data-efficient image transformer. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2023),, 16-20 Dec 2023, Taipei, Taiwan. Institute of Electrical and Electronics Engineers (IEEE) ISBN 979-8-3503-0690-3
Abstract
Future augmented reality devices have the capacity to enhance human perception and provide assistive functions in complex communication scenarios. Active speaker detection (ASD) systems that are robust to egocentric data are critical to this. Egocentric ASD is challenging due to overlapping speech, single-channel recording, and dynamic scenes. A novel module that uses a data-efficient image transformer (DeiT) to extract features encapsulating the acoustic properties of each scene, and a positional conditioning mechanism is proposed. The module is evaluated in conjunction with TalkNet, an existing ASD architecture, on two audiovisual datasets: Ego4D (egocentric) and AVA-ActiveSpeaker (exocentric), achieving 29% and 0.38% relative improvement in mean Average Precision (mAP), respectively, while retaining a parameter efficient build. A qualitative analysis is also presented, implicitly demonstrating that contextual information is leveraged
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2023 The Authors. Except as otherwise noted, this author-accepted version of a proceedings paper is published in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Active speaker detection; context modelling; data-efficient image transformers |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number META PLATFORM INC UNSPECIFIED Engineering and Physical Sciences Research Council 2638501 Engineering and Physical Sciences Research Council 2588133 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 12 Oct 2023 15:43 |
Last Modified: | 09 Feb 2024 15:24 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
Refereed: | Yes |
Identification Number: | 10.1109/ASRU57964.2023.10389764 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:204202 |