H-vectors : utterance-level speaker embedding using a hierarchical attention model

Shi, Y., Huang, Q. and Hain, T. orcid.org/0000-0003-0939-3464 (2020) H-vectors : utterance-level speaker embedding using a hierarchical attention model. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing, 04-08 May 2020, Barcelona, Spain (virtual). Institute of Electrical and Electronics Engineers , pp. 7579-7583. ISBN 9781509066322

Abstract

In this paper, a hierarchical attention network is proposed to generate utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, the data of the NIST SRE2008 Part1 is used for training, and two datasets, the Switchboard Cellular (Part1) and the CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vectors and X-vectors+Attention, the obtained results show that the use of H-vectors can achieve a significantly better performance. Furthermore, the learned utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Shi, Y. Huang, Q. Hain, T. https://orcid.org/0000-0003-0939-3464
Copyright, Publisher and Additional Information:	© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy.
Keywords:	Speaker Embeddings; Speaker Identification; Hierarchical Attention; X-vectors; Attention Mechanism
Dates:	Published (online): 9 April 2020 Published: 9 April 2020
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number Innovate UK 104264
Depositing User:	Symplectic Sheffield
Date Deposited:	15 Jul 2022 10:48
Last Modified:	20 Jul 2022 09:50
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers
Refereed:	Yes
Identification Number:	10.1109/icassp40776.2020.9054448
Related URLs:	Author Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:189093

CORE (COnnecting REpositories)

H-vectors : utterance-level speaker embedding using a hierarchical attention model

Abstract

Metadata

Download

Accepted Version

Export

Statistics