Hedley, Y., Younas, M., James, A. et al. (1 more author) (2004) Query-related data extraction of hidden web documents. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 25 - 29, 2004, Sheffield, UK. ACM , New York, USA , pp. 558-559. ISBN 1-58113-881-4
Abstract
The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Keywords: | hidden web databases, data extraction |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Repository Officer |
Date Deposited: | 28 Nov 2008 13:38 |
Last Modified: | 08 Feb 2013 16:57 |
Published Version: | http://dx.doi.org/10.1145/1008992.1009119 |
Status: | Published |
Publisher: | ACM |
Identification Number: | 10.1145/1008992.1009119 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:4545 |