Nomo Sudro, P., Ragni, A. and Hain, T. orcid.org/0000-0003-0939-3464 (2023) Adapting pretrained models for adult to child voice conversion. In: 2023 31st European Signal Processing Conference (EUSIPCO) Proceedings. 2023 31st European Signal Processing Conference (EUSIPCO), 04-08 Sep 2023, Helsinki, Finland. Institute of Electrical and Electronics Engineers (IEEE) , pp. 271-275. ISBN 9789464593600
Abstract
Due to widespread lack of parallel data for adult to child voice conversion (VC), non parallel VC techniques have grown in popularity. Methods, such as encoder-decoder model, have achieved good performance in adult-to-adult VC. It provides flexibility by either training each module separately or exploit pretrained models. These pretrained models are only available for adult speech. In case of children speech, we do not have enough data to train all the modules of a robust encoder-decoder based VC system. In a limited data scenario, we can only train the decoder module using target speech. Specifically, we find that adult to child VC using a pretrained encoder and trained decoder with child speech does not yield spectral variability of a child speech. The reason being gross spectral mismatch between adult and child speech. We address this mismatch by exploiting a warping mechanism to modify the acoustic attributes based on child speech. We conduct objective and subjective evaluations on CMU and CSLU kids corpus and one adult actress data. Results show that the proposed method reduces MCD and F0 RMSE by 0.67 and 0.03 respectively. For subjective evaluations we observe a relative MOS improvement of 10.7% for naturalness and 18.23% for similarity.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2023 The Authors. Except as otherwise noted, this author-accepted version of a paper published in 2023 31st European Signal Processing Conference (EUSIPCO) Proceedings is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Child speech; adult speech; voice conversion; encoder-decoder model |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 06 Oct 2023 09:07 |
Last Modified: | 13 Nov 2023 16:20 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
Refereed: | Yes |
Identification Number: | 10.23919/EUSIPCO58844.2023.10289993 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:203759 |