Information extraction from template-generated hidden web documents

Hedley, Y., Younas, M., James, A. et al. (1 more author) (2004) Information extraction from template-generated hidden web documents. In: Isaías, P.T., Karmakar, N., Rodrigues, L. and Barbosa, P., (eds.) Proceedings of the IADIS International Conference WWW/Internet 2004, Madrid, Spain, 2 Volumes. IADIS 2004, 06-09 Oct 2004, Madrid, Spain. , pp. 627-634. ISBN 972-99353-0-0

Abstract

The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (such as Google and Yahoo). Databases dynamically generate a list of documents in response to a user query – which are referred to as Hidden Web databases. Such documents are typically presented to users as templategenerated Web pages. This paper presents a new approach that identifies Web page templates in order to extract queryrelated information from documents. We propose two forms of representation to analyse the content of a document – Text with Immediate Adjacent Tag Segments (TIATS) and Text with Neighbouring Adjacent Tag Segments (TNATS). Our techniques exploit tag structures that surround the textual contents of documents in order to detect Web page templates thereby extracting query-related information. Experimental results demonstrate that TNATS detects Web page templates most effectively and extracts information with high recall and precision.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Hedley, Y. Younas, M. James, A. Sanderson, M.
Editors:	Isaías, P.T. Karmakar, N. Rodrigues, L. Barbosa, P.
Keywords:	hidden web databases, information extraction
Dates:	Published: 2004
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User:	Repository Officer
Date Deposited:	22 Sep 2008 18:12
Last Modified:	19 Dec 2022 13:20
Status:	Published
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:4544

CORE (COnnecting REpositories)

Information extraction from template-generated hidden web documents

Abstract

Metadata

Download

IADIS2004

Export

Statistics