ViDE: A Vision-Based Approach for Deep Web Data Extraction
ViDE: A Vision-Based Approach for
Deep
Web Data Extraction
Abstract:
Deep Web contents are accessed
by queries submitted to Web databases and the returned data records are
enwrapped in dynamically generated Web pages (they will be called deep Web
pages in this paper). Extracting structured data from deep Web pages is a
challenging problem due to the underlying intricate structures of such pages.
Until now, a large number of techniques have been proposed to address this
problem, but all of them have inherent limitations because they are
Web-page-programming-language dependent. As the popular two-dimensional media,
the contents on Web pages are always displayed regularly for users to browse.
This motivates us to seek a different way for deep Web data extraction to
overcome the limitations of previous works by utilizing some interesting common
visual features on the deep Web pages. In this paper, a novel vision-based
approach that is Web-page programming- language-independent is proposed. This approach
primarily utilizes the visual features on the deep Web pages to implement deep
Web data extraction, including data record extraction and data item extraction.
We also propose a new evaluation measure revision to capture the amount of
human effort needed to produce perfect extraction. Our experiments on a large
set of Web databases show that the proposed vision-based approach is highly
effective for deep Web data extraction.
Existing System:
The problem of Web data
extraction has received a lot of attention in recent years and most of the
proposed solutions are based on analyzing the HTML source code or the tag trees
of the Web pages (see Section 2 for a review of these works). These solutions
have the following main limitations: First, they are
Web-page-programming-language dependent, or more precisely, HTML-dependent. As
most Web pages are written in HTML, it is not surprising that all previous
solutions are based on analyzing the HTML source code of Web pages. However,
HTML itself is still evolving (from version 2.0 to the current version 4.01,
and version 5.0 is being drafted) and when new versions or new tags are
introduced, the previous works will have to be amended repeatedly to adapt to
new versions or new tags. Furthermore, HTML is no longer the exclusive Web page
programming language, and other languages have been introduced, such as XHTML
and XML (combined with XSLT and CSS). The previous solutions now face the
following dilemma: should they be significantly revised or even abandoned? Or
should other approaches be proposed to accommodate the new languages? Second,
they are incapable of handling the ever-increasing complexity of HTML source
code of Web pages. Most previous works have not considered the scripts, such as
JavaScript and CSS, in the HTML files. In order to make Web pages vivid and
colorful, Web page designers are using more and more complex JavaScript and
CSS. Based on our observation from a large number of real Web pages, especially
deep Web pages, the underlying structure of current Web pages is more
complicated than ever and is far different from their layouts on Web browsers.
This makes it more difficult for existing solutions to infer the regularity of
the structure of Web pages by only analyzing the tag structures.
Disadvantages:
- In existing system all data extraction methods are web-page-programming-language dependent.
- Most previous works have not considered the scripts, such as JavaScript and CSS, in the HTML files
Proposed System:
In this paper, we explore the visual
regularity of the data records and data items on deep Web pages and propose a
novel vision-based approach, Vision-based Data Extractor (ViDE), to extract
structured results from deep Web pages automatically. ViDE is primarily based
on the visual features human users can capture on the deep Web pages while also
utilizing
some simple nonvisual information such as data types and frequent symbols to
make the solution more robust. ViDE consists of two main components, Visionbased
Data Record extractor (ViDRE) and Vision-based Data Item extractor (ViDIE). By using visual features for data
extraction, ViDE avoids the limitations of those solutions that need to analyze
complex Web page source files. Our approach employs a four-step strategy.
First, given a sample deep Web page from a Web database, obtain its visual
representation and transform it into a Visual Block tree which will be
introduced later; second, extract data records from the Visual Block tree;
third, partition extracted data records into data items and align the data
items of the same semantic together; and fourth, generate visual wrappers (a
set of visual extraction rules) for the Web database based on sample deep Web
pages such that both data record extraction and data item extraction for new
deep Web pages that are from the same Web database can be carried out more
efficiently using the visual wrappers.
Advantages:
- In this paper we introduced a vision based approach for extracting data from deep Web pages which is a Web-page-programming-language independent.
- Based on these visual features, we proposed a novel vision-based approach to extract structured data from deep Web pages, which can avoid the limitations of previous works.
- Visual wrapper generation is to generate the wrappers that can improve the efficiency of both data record extraction and data item extraction.
Architecture:
HARDWARE & SOFTWARE
REQUIREMENTS:
HARDWARE REQUIREMENTS:
·
System : Pentium IV 2.4 GHz.
·
Hard Disk : 40
GB.
·
Floppy Drive : 1.44
Mb.
·
Monitor : 15 VGA Colour.
·
Mouse : Logitech.
·
Ram : 512 MB.
SOFTWARE REQUIREMENTS:
·
Operating system : Windows XP Professional.
·
Coding Language : java
( JSP & Servlets )
·
Front End :
JSP & Servlets
·
Back End :
Oracle 10g
Modules Description:
1. Web crawling and Meta searching
2. Web data record and item Extraction
3.
Visual Wrapper
generation
4.
Precision and
recall
1. Web crawling and Meta searching:
Data records and data items
in them machine process able, which is needed in many applications such as deep
Web crawling and meta searching, the structured data need to be extracted from
the deep Web pages. Each data record on the deep Web pages corresponds to an
object. For instance, Fig. 1 shows a typical deep Web page from Amazon.com. On
this page, the books are presented in the form of data records, and each data
record contains some data items such as title, author, etc.
2. Web data record and item Extraction
Data record extraction aims to
discover the boundary of data records and extract them from the deep Web pages.
An ideal record extractor should achieve the following: 1) all data records in
the data region are extracted and 2) for each extracted data record, no data
item is missed and no incorrect data item is included.
3.
Visual
Wrapper generation
The complex extraction
processes are too slow in supporting real-time applications. Second, the
extraction processes would fail if there is only one data record on the page.
Since all deep Web pages from the same Web database share the same visual
template, once the data records and data items on a deep Web page have been
extracted, we can use these extracted data records and data items to generate
the extraction wrapper for the Web database so that new deep Web pages from the
same Web database can be processed using the wrappers quickly without
reapplying the entire extraction process.
4.
Precision
and recall
The basic idea of our
vision-based data item wrapper is described as follows: Given a sequence of
attributes Explanation for (f; l; d) fa1; a2; . . . ; ang obtained from the
sample page and a sequence of data items fitem1; item2; . . . ; itemmg obtained
from a new data record, the wrapper processes the data items in order to decide
which attribute the current data item can be matched to. For item i and aj, if
they are the same on f, l, and d, their match is recognized. The wrapper then
judges whether itemiþ1 and ajþ1 are matched next, and if not, it judges item i
and ajþ1. Repeat this process until all data items are matched to their right
attributes.
Algorithm:
·
The
algorithm of blocks regrouping
·
The
algorithm of data item matching.
·
The
algorithm of data item alignment.
hello Hari,
ReplyDeleteI would like to know if you have this project with you.
Thanks & Regards,
Vikram