Poulston, A., Waseem, Z. and Stevenson, M. orcid.org/0000-0002-9483-6006 (2017) Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L. and Mandl, T., (eds.) CEUR Workshop Proceedings. Conference and Labs of the Evaluation Forum (CLEF 2017), 11-14 Sep 2017, Dublin, Ireland. CEUR
Abstract
This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Portuguese). The task was to develop a system to infer the same attributes for unseen Twitter users. Our system employs an ensemble of two probabilistic classifiers: a Logistic regression classifier trained on TF-IDF transformed n-grams and a Gaussian Process classifier trained on word embedding clusters derived for an additional, external corpus of tweets.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2017 The Author(s). Reproduced in accordance with the publisher's self-archiving policy. |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 15 Mar 2018 15:51 |
Last Modified: | 19 Dec 2022 13:49 |
Published Version: | http://ceur-ws.org/Vol-1866/ |
Status: | Published |
Publisher: | CEUR |
Refereed: | Yes |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:128573 |