Sanderson, M. (1997) Duplicate Detection in the Reuters Collection. Technical Report. Department of Computing Science , University of Glasgow.
Abstract
While conducting some experiments with the Reuters collection, it was discovered that contained within it were a number of documents that were exact duplicates of each other (see Figure 1). A short study was conducted to try to discover how many such documents there were. The results of this study revealed that the notion of a duplicate document was not as simple as first thought.
The contents of this report are as follows. A brief review of previous duplicate detection research will be presented, followed by a description of the methods and results of the duplicate detection work conducted here. In addition, there is an appendix holding the document ids of the various types of duplicate found.
Metadata
Item Type: | Monograph |
---|---|
Authors/Creators: |
|
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Repository Officer |
Date Deposited: | 16 Sep 2008 13:04 |
Last Modified: | 07 Jun 2014 06:14 |
Status: | Published |
Publisher: | Department of Computing Science |
Identification Number: | Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:4571 |