Toxic language detection in social media for Brazilian Portuguese : new dataset and multilingual analysis

Leite, J.A., Silva, D.F., Bontcheva, K. et al. (1 more author) (2020) Toxic language detection in social media for Brazilian Portuguese : new dataset and multilingual analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 10th International Joint Conference on Natural Language Processing - AACL-IJCNLP 2020, 04-07 Dec 2020, Suzhou, China (online). Association for Computational Linguistics (ACL) , pp. 914-924. ISBN 9781952148910

Abstract

Hate speech and toxic comments are a common concern of social media platform users. Although these comments are, fortunately, the minority in these platforms, they are still capable of causing harm. Therefore, identifying these comments is an important task for studying and preventing the proliferation of toxicity in social media. Previous work in automatically detecting toxic comments focus mainly in English, with very few work in languages like Brazilian Portuguese. In this paper, we propose a new large-scale dataset for Brazilian Portuguese with tweets annotated as either toxic or non-toxic or in different types of toxicity. We present our dataset collection and annotation process, where we aimed to select candidates covering multiple demographic groups. State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case. We also show that large-scale monolingual data is still needed to create more accurate models, despite recent advances in multilingual approaches. An error analysis and experiments with multi-label classification show the difficulty of classifying certain types of toxic comments that appear less frequently in our data and highlights the need to develop models that are aware of different categories of toxicity.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Leite, J.A. Silva, D.F. Bontcheva, K. Scarton, C. https://orcid.org/0000-0002-0103-4072
Copyright, Publisher and Additional Information:	© 2020 Association for Computational Linguistics. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Dates:	Published (online): December 2020 Published: December 2020
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	08 Jan 2021 12:06
Last Modified:	08 Jan 2021 12:06
Published Version:	https://www.aclweb.org/anthology/2020.aacl-main.91
Status:	Published
Publisher:	Association for Computational Linguistics (ACL)
Refereed:	Yes
Related URLs:	arXiv URL Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:169793

Download

Published Version

Filename: 2020.aacl-main.91.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Toxic language detection in social media for Brazilian Portuguese : new dataset and multilingual analysis

Abstract

Metadata

Download

Published Version

Export

Statistics