INFROMATION EXTRACTION IN WEBDOCUMENT USING Clustering TECHNIQUE

Authors

D. Saravanan (Faculty of Operations & IT)
ICFAI Business School (IBS), Hyderabad, The ICFAI Foundation for Higher Education (IFHE)
(Deemed to be university u/s 3 of the UGC Act 1956) Hyderabad-India.

Abstract

The problem of extracting a template from the web documents conforming to a common template has been studied. Due to the assumption of all documents being generated from a single common template, solutions for this problem are applicable only when all documents are guaranteed to conform to a common template. However, in real applications, it is not trivial to classify massively crawled documents into homogeneous partitions in order to use these techniques. Since subtle changes in scripts or CGI parameters may result in a significant difference, we cannot simply group the web documents by URL and apply these methods for each group separately. In this problem, clustering of web documents such that the documents in the same group belong to the same template is required, and thus, the correctness of extracted templates depends on the quality of clustering. To overcome this in this paper we propose a Hyper graph based clustering mechanism for extracting HTML tags and templates from a large number of web documents.