Sharov, S orcid.org/0000-0002-4877-0210 (2020) Know thy corpus! Robust methods for digital curation of Web corpora. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). 12th Conference on Language Resources and Evaluation (LREC 2020), 11-16 May 2020, Marseille. ISBN 979-10-95546-34-4
Abstract
This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a procedure for estimating the corpus composition via unsupervised topic models and via supervised genre classification of Web pages. The results of the digital curation study applied to several Web-derived corpora demonstrate their considerable differences. First, this concerns different frequency bursts which impact the core lexicon obtained from each corpus. Second, this concerns the kinds of texts they contain. For example, OpenWebText contains considerably more topical news and political argumentation in comparison to ukWac or Wikipedia. The tools and the results of analysis have been released.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © The European Language Resources Association (ELRA), 2020 The LREC 2020 Proceedings are licensed under a Creative Commons Attribution Non-Commercial 4.0 International License. |
Keywords: | Validation of language resources, Text analytics, Language Modelling, Digital curation |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) > Translation Studies (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 17 Apr 2020 13:10 |
Last Modified: | 06 Feb 2021 09:02 |
Published Version: | http://www.elra.info/en/lrec/proceedings/ |
Status: | Published |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:159411 |