Record matching over query results from multiple web databases
Record
matching over query results from multiple web databases
Abstract:
Record matching,
which identifies the records that represent the same real-world entity, is an
important step for data integration. Most state-of-the-art record matching
methods are supervised, which requires the user to provide training data. These
methods are not applicable for the Web database scenario, where the records to
match are query results dynamically generated on-the- fly. Such records are query-dependent
and a pre-learned method using training examples from previous query results
may fail on the results of a new query. To address the problem of record
matching in the Web database scenario, we present an unsupervised, online
record matching method, UDD, which, for a given query, can effectively identify
duplicates from the query result records of multiple Web databases. After
removal of the same-source duplicates, the “presumed” non duplicate records
from the same source can be used as training examples alleviating the burden of
users having to manually label training examples. Starting from the non
duplicate set, we use two cooperating classifiers, a weighted component
similarity summing classifier and an SVM classifier, to iteratively identify duplicates
in the query results from multiple Web databases. Experimental results show
that UDD works well for the Web database scenario where existing supervised
methods do not apply.
Existing System:
Designing a system that
helps users integrate and, more importantly, compare the query results returned
from multiple Web databases, a crucial task is to match the different sources’
records that refer to the same real-world entity. For example, shows some of
the query results returned by two online bookstores, booksamillion.com and
abebooks.com, in response to the same query “Harry Potter” over the Title
field. Before comparing the results (records) we have to find the decision
making attributes means weights of the attributes. Upto now for this comparison
we are majorly depending upon the record matching methods which are supervised,
which requires the user to provide training data. This works is based on
predefined matching rules hand-coded by domain experts or matching rules
learned offline by some learning method from a set of training examples. Such
approaches work well in a traditional database environment, where all instances
of the target databases can be readily accessed, as long as a set of
high-quality representative records can be examined by experts or selected for
the user to label.
Disadvantages:
Ø The
main challenging of this system is to reduce the duplicate records from
different urls.
Ø
Existing record matching methods are
supervised, which requires the user to provide training data. Which are not
applicable for the Web database scenario, where the records to match are query
results dynamically generated on the-fly.
Ø Most
existing work requires human-labeled training data (positive, negative, or
both), which places a heavy burden on users.
Ø Web
database records(results) are query-dependent so a pre-learned(supervised)
method using training examples from previous query results may fail on the
results of a new query.
Proposed System:
In the Web database
scenario, the records to match are highly query-dependent, since they can only
be obtained through online queries. Moreover, they are only a partial and
biased portion of all the data in the source Web databases. Consequently,
hand-coding or offline-learning approaches (supervised techniques) are not appropriate
for two reasons. First, the full data set is not available beforehand, and
therefore, good representative data for training are hard to obtain. Second,
and most importantly, even if good representative data are found and labeled
for learning, the rules learned on the representatives of a full data set may
not work well on a partial and biased part of that data set. To illustrate this
problem, consider a query for books of a specific author, such as “J. K.
Rowling.” Depending on how the Web databases process such a query, all the
result records for this query may well have only “J. K. Rowling” as the value
for the Author field. In this case, the Author field of these records is
ineffective for distinguishing the records that should be matched and those
that should not. To reduce the influence of such fields in determining which
records should match, their weighting should be adjusted to be much lower than
the weighting of other fields or even be zero. Moreover, for each new query,
depending on the results returned, the field weights should probably change
too, which is not possible in the supervised-learning based methods.
Advantages:
Ø
In this paper, By using the unsupervised
methods we are over coming the training data problems which are not applicable
for Web database scenario.
Ø
We propose a new record matching method
Unsupervised Duplicate Detection (UDD) for the specific record matching problem
of identifying duplicates among records in query results from multiple Web
databases.
Ø
Two classifiers, WCSS and SVM, are used
cooperatively in the used in record matching to identify the duplicate pairs
from all potential duplicate pairs iteratively.
Ø
Experimental results show that our
approach have high performance then
previous work that requires training examples.
Architecture:
System Architecture
Software
Requirements Specification:
Software
Requirements:
Front End : Jsp,Servlet
Back End : Oracle 10g
IDE : my eclipse 8.0
Language : java (jdk1.6.0)
Operating
System : windows XP
Hardware
Requirements:
System : Pentium IV 2.4 GHz.
Hard Disk : 80 GB.
Floppy Drive : 1.44 Mb.
Monitor : 14’ Colour Monitor.
Mouse : Optical Mouse.
Ram : 512
Mb.
Keyboard : 101 Keyboards.
Modules Description:
- Get the records from multiple web databases
- Identifying the similarity function
- Unsupervised Duplicate Detection
Get the records from multiple web databases:
In this project our focus is on
getting records from multiple web databases of the same domain, i.e., web
databases that provide the same type of records in response to user queries.
Suppose there are s records in data source A and there are t records in data
source B, with each record having a set of fields/attributes. Each of the t
records in data source B can potentially be a duplicate of each of the s
records in data source A.
Identifying the similarity function:
Web database scenario,
the records to match are highly query-dependent, since they can only be
obtained through online queries. Consequently, hand-coding or offline-learning
approaches are not applicable for web database scenario. In this paper, to
identify the similarity between the records we have to assign a weight to the
record contained attributes. Depending upon the attribute weights only we will
find the similarity and we will determine the duplicates.
Unsupervised Duplicate Detection:
Important aspect of duplicate detection is to reduce the number of
record pair comparisons. Several methods have been proposed for this purpose
including standard blocking, sorted neighborhood method, Bigram Indexing, and
record clustering. Even though these methods differ in how to partition the
data set into blocks, they all considerably reduce the number of comparisons by
only comparing records from the same block. Since any of these methods can be
incorporated into UDD to reduce the number of record pair comparisons, we do
not further consider this issue.
Algorithm:
·
Duplicate
vector identification algorithm
·
Component
weight assignment algorithm
what is the initial step that i take to implement it in real world.?plz help me...itz urgent...
ReplyDeletehi,
ReplyDeleteI want these project code in Java.
Here are my Details:
9908002976, mail me the contact details at teja1103.projects@gmail.com