Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017

Poulston, A., Waseem, Z. and Stevenson, M. orcid.org/0000-0002-9483-6006 (2017) Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L. and Mandl, T., (eds.) CEUR Workshop Proceedings. Conference and Labs of the Evaluation Forum (CLEF 2017), 11-14 Sep 2017, Dublin, Ireland. CEUR

Abstract

This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Portuguese). The task was to develop a system to infer the same attributes for unseen Twitter users. Our system employs an ensemble of two probabilistic classifiers: a Logistic regression classifier trained on TF-IDF transformed n-grams and a Gaussian Process classifier trained on word embedding clusters derived for an additional, external corpus of tweets.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Poulston, A. Waseem, Z. Stevenson, M. https://orcid.org/0000-0002-9483-6006
Editors:	Cappellato, L. Ferro, N. Goeuriot, L. Mandl, T.
Copyright, Publisher and Additional Information:	© 2017 The Author(s). Reproduced in accordance with the publisher's self-archiving policy.
Dates:	Published (online): 13 July 2017 Published: 13 July 2017
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	15 Mar 2018 15:51
Last Modified:	19 Dec 2022 13:49
Published Version:	http://ceur-ws.org/Vol-1866/
Status:	Published
Publisher:	CEUR
Refereed:	Yes
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:128573

CORE (COnnecting REpositories)

Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017

Abstract

Metadata

Download

Published Version

Export

Statistics