Supervised speaker embedding de-mixing in two-speaker environment

Abstract

Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are both input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker by reconstruction loss. Speaker identification accuracy and the cosine similarity score between the clean embeddings and the de-mixed embeddings are used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six different speaker embedding de-mixing architectures are investigated. Comparing with the performance on the clean speaker embeddings, the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Shi, Y. Hain, T. https://orcid.org/0000-0003-0939-3464
Copyright, Publisher and Additional Information:	© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy.
Keywords:	Speaker Embeddings; Speech Source Separation; Speaker De-mixing; Speaker Identification; Two-Speaker Signal
Dates:	Published (online): 25 March 2021 Published: 25 March 2021
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	15 Jul 2022 11:00
Last Modified:	17 Jul 2022 15:20
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers
Refereed:	Yes
Identification Number:	10.1109/slt48900.2021.9383580
Related URLs:	Author
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:189097

CORE (COnnecting REpositories)

Supervised speaker embedding de-mixing in two-speaker environment

Abstract

Metadata

Download

Accepted Version

Export

Statistics