Clough, P., Gaizauskas, R., Piao, S.S.L. et al. (1 more author) (2002) METER: MEasuring TExt Reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL-02, 07-12 Jul 2002, Philadelphia. ACL , 152 - 159.
Abstract
In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not aware of any investigation using existing computational methods for this particular task. We investigate the classification of newspaper articles according to their degree of dependence upon, or derivation from, a newswire source using a simple 3-level scheme designed by journalists. Three approaches to measuring text similarity are considered: n-gram overlap, Greedy String Tiling, and sentence alignment. Measured against a manually annotated corpus of source and derived news text, we show that a combined classifier with features automatically selected performs best overall for the ternary classification achieving an average F1-measure score of 0.664 across all three categories.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2002 ACL. This is an author produced version of a paper subsequently published in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Uploaded in accordance with the publisher's self-archiving policy. |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 15 Apr 2014 15:47 |
Last Modified: | 19 Dec 2022 13:26 |
Published Version: | http://dx.doi.org/10.3115/1073083.1073110 |
Status: | Published |
Publisher: | ACL |
Refereed: | Yes |
Identification Number: | 10.3115/1073083.1073110 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:78530 |