Sanderson, M. (1997) Duplicate Detection in the Reuters Collection. Technical Report. Department of Computing Science , University of Glasgow.Full text available as:
While conducting some experiments with the Reuters collection, it was discovered that contained within it were a number of documents that were exact duplicates of each other (see Figure 1). A short study was conducted to try to discover how many such documents there were. The results of this study revealed that the notion of a duplicate document was not as simple as first thought.
The contents of this report are as follows. A brief review of previous duplicate detection research will be presented, followed by a description of the methods and results of the duplicate detection work conducted here. In addition, there is an appendix holding the document ids of the various types of duplicate found.
|Item Type:||Monograph (Technical Report)|
|Institution:||The University of Sheffield|
|Academic Units:||The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)|
|Depositing User:||Repository Officer|
|Date Deposited:||16 Sep 2008 13:04|
|Last Modified:||07 Jun 2014 06:14|
|Publisher:||Department of Computing Science|
|Identification Number:||Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow|