Ravenscroft, W. orcid.org/0000-0002-0780-3303, Goetze, S. and Hain, T. (Accepted: 2023) Combining conformer and dual-path-transformer networks for single channel noisy reverberant speech separation. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), 14-19 Apr 2024, Seoul, Korea. Institute of Electrical and Electronics Engineers (IEEE) , pp. 11491-11495. ISBN 979-8-3503-4486-8
Abstract
Separation of overlapping speakers remains an active area of speech technology research. Many deep neural network (DNN) separation models propose modelling local and global temporal context separately using alternating DNN layers. Two such models are SepFormer and TD-Conformer. The largest configurations of each have comparable computational cost and similar performance; with SepFormer performing better on anechoic data and TD-Conformer yielding better results on noisy reverberant data. This work combines these two model types to gain insights into how their computational characteristics affect their performance. The generalization benefits of the larger model size of the conformer layers are demonstrated both on the WHAMR and the out-of-domain far-field evaluation set MC-WSJ-AV across a number of evaluation metrics. The proposed model is able to achieve 22.1 dB and 14.7 dB average scale-invariant signal-to-distortion ratio (SISDR) improvement when trained and evaluated on WSJ0-2Mix and WHAMR, respectively. The model trained using WHAMR is able to achieve 4.3 dB average SISDR improvement on the out-of-domain MC-WSJ-AV dataset.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2024 The Author(s). Except as otherwise noted, this author-accepted version of a conference paper published in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | speech separation; speech enhancement; neural networks; conformer; dual-path transformer |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number Engineering and Physical Sciences Research Council 2268977 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 25 Jan 2024 09:44 |
Last Modified: | 28 Mar 2024 12:15 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
Refereed: | Yes |
Identification Number: | 10.1109/ICASSP48485.2024.10447644 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:207516 |