Whittaker, S. and Amento, B. (2004) Semantic speech editing. In: Dykstra-Erickson, E. and Tscheligi, M., (eds.) Proceedings of the SIGCHI conference on Human factors in computing systems. Conference on Human Factors in Computing Systems, 24-29 April 2004, Vienna, Austria. New York: ACM Press , pp. 527-534. ISBN 1-58113-702-8
Editing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a semantic editor that reduces the need for extensive sampling by providing access to meaning. The editor shows a time-aligned errorful transcript produced by applying automatic speech recognition (ASR) to the original speech. Users visually scan the words in the transcript to identify important phrases. They then edit the transcript directly using standard word processing 'cut and paste' operations, which extract the corresponding time-aligned speech. ASR errors mean that users must supplement what they read in the transcript by accessing the original speech. Even when there are transcript errors, however, the semantic representation still provides users with enough information to target what they edit and play, reducing the need for extensive sampling. A laboratory evaluation showed that semantic editing is more efficient than acoustic editing even when ASR is highly inaccurate.
|Institution:||The University of Sheffield|
|Academic Units:||The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)|
|Depositing User:||Information Studies|
|Date Deposited:||25 Mar 2009 10:19|
|Last Modified:||19 May 2009 17:05|
|Publisher:||New York: ACM Press|