Sanderson, M. (1997) Duplicate Detection in the Reuters Collection. Technical Report. Department of Computing Science , University of Glasgow.
Full text available as:
|
Text
Duplicates.pdf Download (49Kb) |
Abstract
While conducting some experiments with the Reuters collection, it was discovered that contained within it were a number of documents that were exact duplicates of each other (see Figure 1). A short study was conducted to try to discover how many such documents there were. The results of this study revealed that the notion of a duplicate document was not as simple as first thought.
The contents of this report are as follows. A brief review of previous duplicate detection research will be presented, followed by a description of the methods and results of the duplicate detection work conducted here. In addition, there is an appendix holding the document ids of the various types of duplicate found.
| Item Type: | Monograph (Technical Report) |
|---|---|
| Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
| Depositing User: | Repository Officer |
| Date Deposited: | 16 Sep 2008 13:04 |
| Last Modified: | 08 Feb 2013 16:56 |
| Status: | Published |
| Publisher: | Department of Computing Science |
| Identification Number: | Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow |
| URI: | http://eprints.whiterose.ac.uk/id/eprint/4571 |
Actions (login required)
![]() |
View Item |





