White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Sampling information extraction and summarisation of hidden web databases

Hedley, Y., Younas, M., James, A. and Sanderson, M. (2006) Sampling information extraction and summarisation of hidden web databases. Data & Knowledge Engineering, 59 (2). pp. 213-230. ISSN 0169-023X

Full text not available from this repository.


Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users’ queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a system designed to detect and extract query-related information from documents sampled from databases. The proposed system, 2PS, is based on a two-phase framework for the sampling, extraction and summarisation of Hidden Web documents. In the first phase, 2PS queries databases with random terms selected from those contained in their search interface pages and the subsequently retrieved documents – this phase retrieves a pre-determined number of sampled documents. In the second phase, it detects Web page templates from the sampled documents in order to extract information relevant to respective queries from which a content summary is generated. 2PS is validated through the implmementation of a prototype system. Its evaluation is performed through experiments on a number of real-world Hidden Web databases. The experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.

Item Type: Article
Copyright, Publisher and Additional Information: © 2006 Published by Elsevier B.V. This is an author produced version of the published paper. Uploaded in accordance with the publisher's self-archiving policy.
Keywords: Hidden Web Databases, Information Extraction, Document Sampling
Institution: The University of Sheffield
Academic Units: The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User: Repository Officer
Date Deposited: 19 Sep 2008 09:19
Last Modified: 19 Sep 2008 09:19
Published Version: http://dx.doi.org/10.1016/j.datak.2006.01.009
Status: Published
Publisher: Elsevier
Refereed: Yes
Identification Number: 10.1016/j.datak.2006.01.009
URI: http://eprints.whiterose.ac.uk/id/eprint/4546

Actions (repository staff only: login required)