Hedley, Y., Younas, M., James, A. and Sanderson, M. (2006) Sampling information extraction and summarisation of hidden web databases. Data & Knowledge Engineering, 59 (2). pp. 213-230. ISSN 0169-023XFull text not available from this repository.
Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users’ queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a system designed to detect and extract query-related information from documents sampled from databases. The proposed system, 2PS, is based on a two-phase framework for the sampling, extraction and summarisation of Hidden Web documents. In the first phase, 2PS queries databases with random terms selected from those contained in their search interface pages and the subsequently retrieved documents – this phase retrieves a pre-determined number of sampled documents. In the second phase, it detects Web page templates from the sampled documents in order to extract information relevant to respective queries from which a content summary is generated. 2PS is validated through the implmementation of a prototype system. Its evaluation is performed through experiments on a number of real-world Hidden Web databases. The experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.
|Copyright, Publisher and Additional Information:||© 2006 Published by Elsevier B.V. This is an author produced version of the published paper. Uploaded in accordance with the publisher's self-archiving policy.|
|Keywords:||Hidden Web Databases, Information Extraction, Document Sampling|
|Institution:||The University of Sheffield|
|Academic Units:||The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)|
|Depositing User:||Repository Officer|
|Date Deposited:||19 Sep 2008 09:19|
|Last Modified:||19 Sep 2008 09:19|