White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Duplicate Detection in the Reuters Collection

Sanderson, M. (1997) Duplicate Detection in the Reuters Collection. Technical Report. Department of Computing Science , University of Glasgow.


Download (49Kb)


While conducting some experiments with the Reuters collection, it was discovered that contained within it were a number of documents that were exact duplicates of each other (see Figure 1). A short study was conducted to try to discover how many such documents there were. The results of this study revealed that the notion of a duplicate document was not as simple as first thought.

The contents of this report are as follows. A brief review of previous duplicate detection research will be presented, followed by a description of the methods and results of the duplicate detection work conducted here. In addition, there is an appendix holding the document ids of the various types of duplicate found.

Item Type: Monograph (Technical Report)
Institution: The University of Sheffield
Academic Units: The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User: Repository Officer
Date Deposited: 16 Sep 2008 13:04
Last Modified: 07 Jun 2014 06:14
Status: Published
Publisher: Department of Computing Science
Identification Number: Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow
URI: http://eprints.whiterose.ac.uk/id/eprint/4571

Actions (repository staff only: login required)