White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Query-related data extraction of hidden web documents

Hedley, Y., Younas, M., James, A. and Sanderson, M. (2004) Query-related data extraction of hidden web documents. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 25 - 29, 2004, Sheffield, UK. ACM , New York, USA , pp. 558-559. ISBN 1-58113-881-4

Full text available as:

Download (145Kb)


The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision.

Item Type: Proceedings Paper
Keywords: hidden web databases, data extraction
Institution: The University of Sheffield
Academic Units: The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User: Repository Officer
Date Deposited: 28 Nov 2008 13:38
Last Modified: 08 Feb 2013 16:57
Published Version: http://dx.doi.org/10.1145/1008992.1009119
Status: Published
Publisher: ACM
Identification Number: 10.1145/1008992.1009119
Related URLs:
URI: http://eprints.whiterose.ac.uk/id/eprint/4545

Actions (repository staff only: login required)