Sharoff, S (2021) Genre Annotation for the Web: text-external and text-internal perspectives. Register Studies, 3 (1). pp. 1-32. ISSN 2542-9477
Abstract
This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2021, John Benjamins Publishing Company. This is an author produced version of a paper published in Register Studies. Please contact the publisher (John Benjamins) for permission to re-use or reprint this material in any form. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | automatic genre identification; interpreting neural networks; Deep learning |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) > Translation Studies (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 26 Feb 2021 09:39 |
Last Modified: | 05 Jul 2022 14:15 |
Status: | Published |
Publisher: | John Benjamins |
Identification Number: | 10.1075/rs.19015.sha |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:170250 |