Shi, Y. and Hain, T. orcid.org/0000-0003-0939-3464 (2021) Supervised speaker embedding de-mixing in two-speaker environment. In: 2021 IEEE Spoken Language Technology Workshop (SLT). 2021 IEEE Spoken Language Technology Workshop (SLT), 19-22 Jan 2021, Shenzhen, China. Institute of Electrical and Electronics Engineers , pp. 758-765. ISBN 9781728170671
Abstract
Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are both input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker by reconstruction loss. Speaker identification accuracy and the cosine similarity score between the clean embeddings and the de-mixed embeddings are used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six different speaker embedding de-mixing architectures are investigated. Comparing with the performance on the clean speaker embeddings, the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy. |
Keywords: | Speaker Embeddings; Speech Source Separation; Speaker De-mixing; Speaker Identification; Two-Speaker Signal |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 15 Jul 2022 11:00 |
Last Modified: | 17 Jul 2022 15:20 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers |
Refereed: | Yes |
Identification Number: | 10.1109/slt48900.2021.9383580 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:189097 |