Zhao, Z., Zhang, Z. and Hopfgartner, F. orcid.org/0000-0003-0380-6088 (Submitted: 2019) Detecting toxic content online and the effect of training data on classification performance. EasyChair. (Submitted)
Abstract
The spread of toxic content online has attracted a wealth of research into methods of automatic detection and classification in recent years. However, two limitations still exist: 1) the lack of support for multi-label classification; and 2) the lack of understanding of the impact of the typical unbalanced datasets on such tasks. In this work, we build three state of the art methods for the task of multi-label classification of toxic content online, and compare the effect of training data size on their performance. The three methods of choice are based on Support Vector Machine (SVM), Convolutional Neural Networks (CNN) and Long-Short-Term Memory Networks (LSTM), respectively. We conduct learning curve analysis and show that CNN is the most robust method as it outperforms the other two regardless of the sizes of the dataset, even on very small amounts of data. This challenges the conventional belief that Neural Networks require significant amounts of data to train accurate models. We also empirically derive indicative thresholds of training data size to help determine a reliable estimate of classifier performance, or maximise potential classifier performance in such tasks.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2019 The Author(s). For re-use permissions please contact the Author(s). |
Keywords: | classifier performance; Convolutional Neural Network; deep learning; Deep Neural Network; detecting hate speech; hate speech; learning curve; machine learning; multi-label classification; Natural Language Processing; neural network; NLP; offensive language; text classification; text mining; toxic comment; toxic content; toxic content classification; training data |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 13 Jul 2020 13:04 |
Last Modified: | 13 Jul 2020 14:05 |
Published Version: | https://easychair.org/publications/preprint/XGmR |
Status: | Submitted |
Identification Number: | 10.29007/z5xk |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:163193 |