Goetze, T.S. orcid.org/0000-0002-3435-3264 and Abramson, D. (2021) Bigger Isn't better : the ethical and scientific vices of extra-large datasets in language models. In: WebSci '21: 13th ACM Web Science Conference 2021 Proceedings. WebSci '21: 13th ACM Web Science Conference, 21-25 Jun 2021, Virtual conference. Association for Computing Machinery , pp. 69-75. ISBN 9781450385251
Abstract
The use of language models in Web applications and other areas of computing and business have grown significantly over the last five years. One reason for this growth is the improvement in performance of language models on a number of benchmarks — but a side effect of these advances has been the adoption of a “bigger is always better” paradigm when it comes to the size of training, testing, and challenge datasets. Drawing on previous criticisms of this paradigm as applied to large training datasets crawled from pre-existing text on the Web, we extend the critique to challenge datasets custom-created by crowdworkers. We present several sets of criticisms, where ethical and scientific issues in language model research reinforce each other: labour injustices in crowdwork, dataset quality and inscrutability, inequities in the research community, and centralized corporate control of the technology. We also present a new type of tool for researchers to use in examining large datasets when evaluating them for quality.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2021 The Authors. This is an author-produced version of a paper subsequently published in WebSci '21: 13th ACM Web Science Conference Proceedings. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | computer ethics; Natural Language Processing; Computing profession; Free and open source software; philosophy of computer science |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Arts and Humanities (Sheffield) > Department of Philosophy (Sheffield) |
Funding Information: | Funder Grant number Social Sciences and Humanities Research Council BPF-162695 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 30 Jun 2021 10:05 |
Last Modified: | 30 Jun 2021 10:42 |
Status: | Published |
Publisher: | Association for Computing Machinery |
Refereed: | Yes |
Identification Number: | 10.1145/3462741.3466809 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:175082 |