Query-related data extraction of hidden web documents

Hedley, Y., Younas, M., James, A. et al. (1 more author) (2004) Query-related data extraction of hidden web documents. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 25 - 29, 2004, Sheffield, UK. ACM , New York, USA , pp. 558-559. ISBN 1-58113-881-4

Abstract

The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Hedley, Y. Younas, M. James, A. Sanderson, M.
Keywords:	hidden web databases, data extraction
Dates:	Published: 2004
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User:	Repository Officer
Date Deposited:	28 Nov 2008 13:38
Last Modified:	08 Feb 2013 16:57
Published Version:	http://dx.doi.org/10.1145/1008992.1009119
Status:	Published
Publisher:	ACM
Identification Number:	10.1145/1008992.1009119
Related URLs:	Author
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:4545

CORE (COnnecting REpositories)

Query-related data extraction of hidden web documents

Abstract

Metadata

Download

SIGIR2004HedleyYounasJamesSanderson

Export

Statistics