Vecchiotti, P., Pepe, G., Principi, E. et al. (1 more author) (2019) Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation. Expert Systems with Applications, 134. pp. 53-65. ISSN 0957-4174
Abstract
The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several data-driven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real-world environment. Furthermore, up to the authors’ best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however, this solution does not rely on data-driven approaches.
This work comes as an extension of the authors’ previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method.
As result, the proposed method is more accurate than the baseline framework, and remarkable improvements are specially observed when the data augmentation techniques are applied for both the VAD and SLOC tasks.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2019 Elsevier. This is an author produced version of a paper subsequently published in Expert Systems with Applications. Uploaded in accordance with the publisher's self-archiving policy. Article available under the terms of the CC-BY-NC-ND licence (https://creativecommons.org/licenses/by-nc-nd/4.0/). |
Keywords: | Voice activity detection; Speaker localization; Data augmentation; Multi-room environment; Deep learning |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 02 Oct 2019 10:38 |
Last Modified: | 16 May 2020 00:38 |
Status: | Published |
Publisher: | Elsevier |
Refereed: | Yes |
Identification Number: | 10.1016/j.eswa.2019.05.017 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:151548 |