Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud
Exploiting Dynamic Resource
Allocation for
Efficient Parallel Data Processing in the Cloud
Abstract
In recent years ad-hoc parallel
data processing has emerged to be one of the killer applications for
Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have
started to integrate frameworks for parallel data processing in their product
portfolio, making it easy for customers to access these services and to deploy
their programs. However, the processing frameworks which are currently used
have been designed for static, homogeneous cluster setups and disregard the
particular nature of a cloud. Consequently, the allocated compute resources may
be inadequate for big parts of the submitted job and unnecessarily increase
processing time and cost. In this paper we discuss the opportunities and
challenges for efficient parallel data processing in clouds and present our
research project Nephele. Nephele is the first data processing framework to
explicitly exploit the dynamic resource allocation offered by today’s IaaS
clouds for both, task scheduling and execution. Particular tasks of a
processing job can be assigned to different types of virtual machines which are
automatically instantiated and terminated during the job execution. Based on this
new framework, we perform extended evaluations of MapReduce-inspired processing
jobs on an IaaS cloud system and compare the results to the popular data
processing framework Hadoop.
Index
Terms Many-Task Computing, High-Throughput Computing, Loosely
Coupled Applications, Cloud Computing
Existing System:
A growing number of companies have to
process huge amounts of data in a cost-efficient manner. Classic
representatives for these companies are operators of Internet search engines.
The vast amount of data they have to deal with every day has made traditional
database solutions prohibitively expensive .Instead, these companies have
popularized an architectural paradigm based on a large number of commodity
servers.
Problems like processing crawled
documents or regenerating a web index are split into several independent
subtasks, distributed among the available nodes, and computed in parallel.
Disadvantage
The challenge for both frameworks consists of two abstract
tasks: Given a set of random integer numbers, the first task is to determine
the k smallest of those numbers. The second task subsequently is to calculate the
average of these k smallest numbers. The job is a classic representative for a
variety of data analysis jobs whose particular tasks vary in their complexity
and
hardware demands.
Proposed System:
In recent years a variety of
systems to facilitate MTC has been developed. Although these systems typically
share common goals (e.g. to hide issues of parallelism or fault tolerance),
they aim at different fields of application. MapReduce is designed to run data
analysis jobs on a large amount of data, which is expected to be stored across
a large set of share-nothing commodity servers. Once a user has fit his program into the
required map and reduce pattern, the execution framework takes care of
splitting the job into subtasks, distributing and executing them. A single Map
Reduce job always consists of a distinct map and reduce program.
Advantage
·
The first task has to sort the entire
data set and therefore can take advantage of large amounts of main memory and
parallel execution,
·
The second aggregation task requires
almost no main memory and, at least eventually, cannot be parallelized
Hardware Requirements
·
System : Pentium
IV 2.4 GHz.
·
Hard Disk : 40 GB.
·
Floppy Drive :
1.44 Mb.
·
Monitor : 15
VGA Color.
·
Mouse : Logitech.
·
Ram : 512 MB.
Software Requirements
·
Operating System : Windows xp , Linux
·
Language : Java1.4 or more
·
Technology : Swing, AWT
·
Back End : Oracle 10g
·
IDE : MyEclipse 8.6
Module Description
Modules:
- Network Module
- LBS Services
- System Model
- Scheduled Task
- Query Processing
Exploiting
Dynamic Resource Allocation
cloud
computing
Network Module:
A
network channel lets two subtasks exchange data via a TCP connection. Network channels
allow pipelined processing, so the records emitted by the producing subtask are
immediately transported to the consuming subtask. As a result, two subtasks
connected via a network channel may be executed on different instances.
However, since they must be executed at the same time, they are required to run
in the same Execution Stage
LSB
Service:
Many people are familiar with
wireless Internet, but many don't realize the value and potential to make
information services highly personalized. One of the best ways to personalize
information services is to enable them to be location based. An example would
be someone using their Wireless Application Protocol (WAP) based phone to
search for a restaurant. The LBS application would interact with other location
technology components to determine the user's location and provide a list of
restaurants within a certain proximity to the mobile user.
In this age of significant
telecommunications competition, mobile network operators continuously seek new
and innovative ways to create differentiation and increase profits. One of the
best ways to do accomplish this is through the delivery of highly personalized
services. One of the most powerful ways to personalize mobile services is based
on location.
Scheduled
Task:
A file
channel allows two subtasks to exchange records via the local file system. The records
of the producing task are first entirely written to an intermediate file and
afterwards read into the consuming subtask. Nephele requires two such subtasks
to be assigned to the same instance. Moreover, the consuming Group Vertex must
be scheduled to run in a higher Execution Stage than the producing Group
Vertex. In general, Nephele only allows subtasks to exchange records across different
stages via file channels because they are the only channel types which store
the intermediate records in a persistent manner
Exploiting
Dynamic Resource Allocation
cloud computing
System Model:
Query Processing
Similar to a network channel, an
in-memory channel also enables Query processing. However, instead of using a
TCP connection, the respective subtasks exchange data using the instance’s main
memory. An in-memory channel typically represents the fastest way to transport records
in Nephele, however, it also implies most scheduling restrictions: The two
connected subtasks must be scheduled to run on the same instance and run in the
same Execution Stage
Hi Ameer
ReplyDeleteCan you please send a source code for this project to ma mail? @babukriz@gmail.com
ReplyDeleteAnd please provide your contact number to speak.
ReplyDelete