Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Abstract

The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several data-driven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real-world environment. Furthermore, up to the authors’ best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however, this solution does not rely on data-driven approaches.

This work comes as an extension of the authors’ previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method.

As result, the proposed method is more accurate than the baseline framework, and remarkable improvements are specially observed when the data augmentation techniques are applied for both the VAD and SLOC tasks.

Metadata

Item Type:	Article
Authors/Creators:	Vecchiotti, P. Pepe, G. Principi, E. Squartini, S.
Copyright, Publisher and Additional Information:	© 2019 Elsevier. This is an author produced version of a paper subsequently published in Expert Systems with Applications. Uploaded in accordance with the publisher's self-archiving policy. Article available under the terms of the CC-BY-NC-ND licence (https://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords:	Voice activity detection; Speaker localization; Data augmentation; Multi-room environment; Deep learning
Dates:	Accepted: 13 May 2019 Published (online): 16 May 2019 Published: November 2019
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	02 Oct 2019 10:38
Last Modified:	16 May 2020 00:38
Status:	Published
Publisher:	Elsevier
Refereed:	Yes
Identification Number:	10.1016/j.eswa.2019.05.017
Related URLs:	Author
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:151548

Download

Accepted Version

Filename: ESWA_Journal.pdf

Licence: CC-BY-NC-ND 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Abstract

Metadata

Download

Accepted Version

Export

Statistics