FiVaTech: Page-Level Web Data Extraction From Template Pages
FiVaTech: Page-Level Web Data Extraction
From
Template Pages
Abstract
Web data extraction has been
an important part for many Web data analysis applications. In this paper, we
formulate the data extraction problem as the decoding process of page
generation based on structured data and tree templates. We propose an
unsupervised, page-level data extraction approach to deduce the schema and
templates for each individual Deep Website, which contains either singleton or
multiple data records in one Webpage. FiVaTech applies tree matching, tree
alignment, and mining techniques to achieve the challenging task. In
experiments, FiVaTech has much higher precision than EXALG and is comparable
with other record-level extraction systems like ViPER and MSE. The experiments
show an encouraging result for the test pages used in many state-of-the-art Web
data extraction works.
Existing System:
Generally speaking, templates, as a common model for
all pages, occur quite fixed as opposed to data values which vary across pages.
Finding such a common template requires multiple pages or a single page
containing multiple records as input. When multiple pages are given, the
extraction target aims at page-wide information (e.g., Road Runner and EXALG).
When single pages are given, the extraction target is usually constrained to
record wide information, which involves the addition issue of record-boundary
detection. Page-level extraction tasks, although do not involve the addition
problem of boundary detection, are much more complicated than record-level
extraction tasks since more data are concerned. A common technique that is used
to find template is alignment: either string alignment (e.g., IEPAD, Road
Runner) or tree alignment (e.g., DEPTA). As for the problem of distinguishing
template and data, most approaches assume that HTML tags are part of the
template, while EXALG considers a general model where word tokens can also be
part of the template and tag tokens can also be data. However, EXALG’s
approach, without explicit use of alignment, produces many accidental
equivalent classes, making the reconstruction of the schema not complete.
Disadvantages:
•
Complex Schema: The “schema” of
the information encoded in the web pages could be very complex with arbitrary
levels nesting. For instance, each book page can contain a set of authors, with
each author having a set of addresses and so on.
•
Template vs. Data: Syntactically,
there is nothing that distinguishes the text that is part of the template and
the text that is part of the data.
Proposed System:
In this paper, we focus on page-level extraction
tasks and propose a new approach, called FiVaTech, to automatically detect the
schema of a Website. The proposed technique presents a new structure, called
fixed/variant pattern tree, a tree that carries all of the required information
needed to identify the template and detect the data schema. We combine several
techniques: alignment, pattern mining, as well as the idea of tree templates to
solve the much difficult problem of page-level template construction. In
experiments, FiVa Tech has much higher precision than EXALG, one of the few
page-level extraction systems, and is comparable with other record-level
extraction systems like ViPER and MSE.
Advantages:
·
We focus
on page-level extraction tasks and propose a new approach, called FiVaTech, to
automatically detect the schema of a Website.
·
The
proposed technique presents a new structure, called fixed/variant pattern tree,
a tree that carries all of the required information needed to identify the
template and detect the data schema.
Architecture:
General Description
of EXALG
HARDWARE & SOFTWARE REQUIREMENTS:
HARDWARE REQUIREMENTS:
·
System : Pentium IV 2.4 GHz.
·
Hard Disk : 40
GB.
·
Floppy Drive : 1.44
Mb.
·
Monitor : 15 VGA Color.
·
Mouse : Logitech.
·
Ram :
512 MB.
SOFTWARE REQUIREMENTS:
·
Operating system : Windows XP Professional.
·
Coding Language : java(jdk1.6.0)
·
Front
End : Struts Framework
·
Back End : Oracle 10g
·
IDE :
my eclipse 8.0
Modules
Description:
User
Registration:
User can be
register inside the database through that sensitive information. User can be
getting the credentials of information like username and password.
Input
Web pages:
Template pages are
generated by embedding a data
instance in a predefined template via a CGI
program. Thus, the reverse engineering of finding the template and the data
schema given input Web pages should be established on some page generation
model, which we describe next. In this paper, we propose a tree based page
generation model, which encodes data by sub tree concatenation instead of
string concatenation. This is because both data schema and Web pages are
tree-like structures. Thus, we also consider templates as tree structures. The
advantage of tree-based page generation model is that it will not involve
ending tags (e.g., </html>, </body>, etc.) into their templates as
in string-based page generation model applied in EXALG.
DOM Trees Creation:
The first module merges
all input DOM trees at the same time into a structure called fixed/variant
pattern tree, which can then be used to detect the template and the schema of
the Website in the second module. In this section, we will introduce how input
DOM trees can be recognized and merged into the pattern tree for schema
detection.
Tree
Merging:
In the peer node recognition step,
two nodes with the same tag name are compared to check if they are peer sub
trees. All peer sub trees will be denoted by the same symbol. . In the matrix
alignment step, the system tries to align nodes (symbols) in the peer matrix to
get a list of aligned nodes child List. In addition to alignment, the other
important task is to recognize variant leaf nodes that correspond to
basic-typed data. . In the pattern mining step, the system takes the aligned
child List as input to detect every repetitive pattern in this list starting
with length 1. For each detected repetitive pattern, all occurrences of this
pattern except for the first one are deleted for further mining of longer repeats.
The result of this mining step is a modified list of nodes without any
repetitive patterns. . In the last step (line 12), the system recognizes
optional nodes if a node disappears in some columns of the matrix and group
nodes according to their occurrence vector. After the above four steps, the
system inserts nodes in the modified child List as children of P. For non leaf
child node c, if c is not a fixed template tree (as defined in the next
section), the algorithm recursively calls the tree merging algorithm with the
peer sub trees of c (by calling procedure peer Node ðc;MÞ, which returns nodes
in M having the same symbol of c) to build the pattern tree.
Peer Matrix Alignment:
After peer node
recognition, all peer sub trees will be given the same symbol. For leaf nodes,
two text nodes take the same symbol when they have the same text values, and
two <img> tag nodes take the same symbol when they have the same SRC
attribute values. To convert M into an aligned peer matrix, we work row by row
such that each row has (except for empty columns) either the same symbol for
every column or is a text (<img>) node of variant text (SRC attribute,
respectively) values. In the latter case, it will be marked as basic-typed for variant
texts. From the aligned matrix M, we get a list of nodes, where each node
corresponds to a row in the aligned matrix.
Algorithms:
Multiple trees merging algorithm
this is good project
ReplyDeletei need to know about this project
ReplyDelete