Flynn, R. and Ragni, A. orcid.org/0000-0003-0634-4456 (Submitted: 2023) How much context does my attention-based ASR system need? [Preprint - arXiv] (Submitted)
Abstract
For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon, and under-investigated in literature. In this work, we examine the effect of scaling the sequence length used to train/evaluate (dense-attention based) acoustic and language models on speech recognition performance. For these experiments a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets Earnings-22 and Tedlium demonstrate a benefit from training with around 80 seconds of acoustic context, showing up to a 14.9% relative improvement from a limited context baseline. Furthermore, we perform a system combination with long-context transformer language models via beam search for a fully long-context ASR system, with results that are competitive with the current state-of-the-art.
Metadata
Item Type: | Preprint |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2023 The Author(s). This preprint is made available under a Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/) |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number META PLATFORM INC UNSPECIFIED |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 07 Jun 2024 13:28 |
Last Modified: | 10 Jun 2024 11:59 |
Status: | Submitted |
Identification Number: | 10.48550/arXiv.2310.15672 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:213154 |