Forsyth, R and Sharoff, S (2013) Document Dissimilarity within and across Languages: a Benchmarking Study. Literary and Linguistic Computing, 28. ISSN 0268-1145
Abstract
Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used measures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outperform cosine similarity on our data. A method of using what we call ‘anchor texts’ to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 24 Jun 2013 08:36 |
Last Modified: | 04 Nov 2016 02:28 |
Published Version: | http://dx.doi.org/10.1093/llc/fqt002 |
Status: | Published |
Publisher: | Oxford University Press |
Identification Number: | 10.1093/llc/fqt002 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:75761 |