Sharoff, S (2018) Functional Text Dimensions for the annotation of web corpora. Corpora, 13 (1). pp. 65-95. ISSN 1749-5032
Abstract
This paper presents an approach to classifying large web corpora into genres by means of Functional Text Dimensions (FTDs). This offers a topological approach to text typology in which the texts are described in terms of their similarity to prototype genres. The suggested set of categories is designed to be applicable to any text on the web and to be reliable in annotation practice. Interannotator agreement results show that the suggested categories produce Krippendorff's α at above 0.76. In addition to the functional space of eighteen dimensions, similarity between annotated documents can be described visually within a space of reduced dimensions obtained through t-distributed Statistical Neighbour Embedding. Reliably annotated texts also provide the basis for automatic genre classification, which can be done in each FTD, as well as as within the space of reduced dimensions. An example comparing texts from the Brown Corpus, the BNC and ukWac, a large web corpus, is provided.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © Edinburgh University Press. This is an Accepted Manuscript of an article published by Edinburgh University Press in Corpora. The Version of Record is available online at: https://doi.org/10.3366/cor.2018.0136 |
Keywords: | genre classification, webcorpora. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) > Translation Studies (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 26 Jul 2016 12:18 |
Last Modified: | 01 Oct 2020 14:14 |
Status: | Published |
Publisher: | Edinburgh University Press |
Identification Number: | 10.3366/cor.2018.0136 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:102914 |