How much context does my attention-based ASR system need?

This is a preprint and may not have undergone formal peer review

Abstract

For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon, and under-investigated in literature. In this work, we examine the effect of scaling the sequence length used to train/evaluate (dense-attention based) acoustic and language models on speech recognition performance. For these experiments a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets Earnings-22 and Tedlium demonstrate a benefit from training with around 80 seconds of acoustic context, showing up to a 14.9% relative improvement from a limited context baseline. Furthermore, we perform a system combination with long-context transformer language models via beam search for a fully long-context ASR system, with results that are competitive with the current state-of-the-art.

Metadata

Item Type:	Preprint
Authors/Creators:	Flynn, R. Ragni, A. https://orcid.org/0000-0003-0634-4456
Copyright, Publisher and Additional Information:	© 2023 The Author(s). This preprint is made available under a Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/)
Dates:	Submitted: 24 October 2023
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number META PLATFORM INC UNSPECIFIED
Depositing User:	Symplectic Sheffield
Date Deposited:	07 Jun 2024 13:28
Last Modified:	10 Jun 2024 11:59
Status:	Submitted
Identification Number:	10.48550/arXiv.2310.15672
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:213154

Download

Preprint

Filename: 2310.15672v1.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

How much context does my attention-based ASR system need?

Abstract

Metadata

Download

Preprint

Export

Statistics