A Two-Step Method for Clustering Mixed Categorical And Numeric Data
A
Two-Step Method for Clustering Mixed Categorical
And
Numeric Data
Abstract:
Various
clustering algorithms have been developed to group data into clusters in
diverse domains. However, these clustering algorithms work effectively either
on pure numeric data or on pure categorical data, most of them perform poorly
on mixed categorical and numeric data types. In this paper, a new two-step
clustering method is presented to find clusters on this kind of data. In this approach
the items in categorical attributes are processed to construct the similarity
or relationships among them based on the ideas of co-occurrence; then all
categorical attributes can be converted into numeric attributes based on these
constructed relationships. Finally, since all categorical data are converted
into numeric, the existing clustering algorithms can be applied to the dataset
without pain. Nevertheless, the existing clustering algorithms suffer from some
disadvantages or weakness, the proposed two-step method integrates hierarchical
and partitioning clustering algorithm with adding attributes to cluster
objects. This method defines the relationships among items, and improves the weaknesses
of applying single clustering algorithm. Experimental evidences show that
robust results can be achieved by applying this method to cluster mixed numeric
and categorical data.
Existing System:
Various clustering applications have emerged
in diverse domains. However, most of the traditional clustering algorithms are
designed to focus either on numeric data or on categorical data. The collected
data in real world often contain both numeric and categorical attributes. It is
difficult for applying traditional clustering algorithm directly into these
kinds of data. Typically, when people need to apply traditional distance-based
clustering algorithms (ex., k-means) to group these types of data, a numeric
value will be assigned to each category in this attributes. Some categorical
values, for example “low”, “medium” and “high”, can easily be transferred into
numeric values. But if categorical attributes contain the values like “red”,
“white” and “blue” … etc., it cannot be ordered naturally. How to assign
numeric value to these kinds of categorical attributes will be a challenge
work.
Disadvantages:
1. The
major problem of existing clustering algorithms is that most of them treat
every attribute as a single entity,and ignore the relationships among them.
However, there
may be some relationships among attributes.
2. the
traditional clustering algorithm cannot handle this kind of data effectively.
3. The
experimental results show that the proposed approach can achieve a high quality
of clustering results.
Proposed
System
The TMCM algorithm is based on
above observation to produce pure numeric attributes. The algorithm is shown on
lists a sample data set, and this data set will be used to illustrate the
proposed ideas. The first step in the proposed approach is to read the input
data and normalize the numeric attributes’ value into the range of zero and
one. The goal of this process is to avoid certain attributes with a large range
of values will dominate the results of clustering. Additionally, a categorical
attribute A with most number of items is selected to be the base
attribute, and the items appearing in base attribute are defined as base items.
This strategy is to ensure that a non-base item can map to multiple base items.
If an attribute with fewer items is selected as the base attribute, the
probability of mapping several non-based items to the same based items will be
higher. In such a case, it may make different categorical items get the same
numeric value.
Advantages:
1) Clustering
is considered an important tool for data mining. The goal of data clustering is
aimed at dividing the data set into several groups such that objects have a
high degree of similarity to each other in the same group and have a high
degree of dissimilarity to the ones in different groups.
2) The
TMCM algorithm is based on above observation to produce pure numeric
attributes.
3) the
TMCM algorithm integrates HAC and k-means clustering algorithms to cluster
mixed type of data. Applying other algorithms or sophisticated similarity
measures into TMCM may yield better results.
Software
Requirements Specification:
Software
Requirements:
Front End : java swings
Back End : No Database
IDE : my eclipse 8.0
Language : java (jdk1.6.0)
Operating
System : windows XP
Hardware
Requirements:
System : Pentium IV 2.4 GHz.
Hard Disk
: 80 GB.
Floppy Drive :
1.44 Mb.
Monitor
: 14’ Colour Monitor.
Mouse
:
Optical Mouse.
Ram : 512 Mb.
Keyboard
: 101 Keyboards.
Module Description:
- In first step clustering, several similar objects are grouped into subsets, and these subsets are treated as objects to be input into second step clustering. Thus noise or outlier can be smoothed in k-means clustering process.
- The added attributes not only offer useful information for clustering, but also reduce the influence of noise and outlier.
- In second clustering step, the initial selections of cancroids will be groups of similar objects. It is believed that this strategy will be a better solution than a random selection used in most applications
Algorithm:
1.
K-means algorithm
2.
K-mediods algorithm
3.
Agglomerative algorithm
0 comments:
Post a Comment