Flynn, R. and Ragni, A. orcid.org/0000-0003-0634-4456 (Submitted: 2023) How much context does my attention-based ASR system need? [Preprint - arXiv] (Submitted)
Abstract
For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon, and under-investigated in literature. In this work, we examine the effect of scaling the sequence length used to train/evaluate (dense-attention based) acoustic and language models on speech recognition performance. For these experiments a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets Earnings-22 and Tedlium demonstrate a benefit from training with around 80 seconds of acoustic context, showing up to a 14.9% relative improvement from a limited context baseline. Furthermore, we perform a system combination with long-context transformer language models via beam search for a fully long-context ASR system, with results that are competitive with the current state-of-the-art.
Metadata
| Item Type: | Preprint |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2023 The Author(s). This preprint is made available under a Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/) |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
| Funding Information: | Funder Grant number META PLATFORM INC UNSPECIFIED |
| Depositing User: | Symplectic Sheffield |
| Date Deposited: | 07 Jun 2024 13:28 |
| Last Modified: | 10 Jun 2024 11:59 |
| Status: | Submitted |
| Identification Number: | 10.48550/arXiv.2310.15672 |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:213154 |

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)