Duplicate Detection in the Reuters Collection

Abstract

While conducting some experiments with the Reuters collection, it was discovered that contained within it were a number of documents that were exact duplicates of each other (see Figure 1). A short study was conducted to try to discover how many such documents there were. The results of this study revealed that the notion of a duplicate document was not as simple as first thought.

The contents of this report are as follows. A brief review of previous duplicate detection research will be presented, followed by a description of the methods and results of the duplicate detection work conducted here. In addition, there is an appendix holding the document ids of the various types of duplicate found.

Metadata

Item Type:	Monograph
Authors/Creators:	Sanderson, M.
Dates:	Published: 1997
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User:	Repository Officer
Date Deposited:	16 Sep 2008 13:04
Last Modified:	07 Jun 2014 06:14
Status:	Published
Publisher:	Department of Computing Science
Identification Number:	Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:4571

CORE (COnnecting REpositories)

Duplicate Detection in the Reuters Collection

Abstract

Metadata

Download

Duplicates

Export

Statistics