Cloud Technologies for Bioinformatics Applications
Cloud
Technologies for Bioinformatics Applications
Abstract:
Executing
large number of independent tasks or tasks that perform minimal inter-task
communication in parallel is a common requirement in many domains. In this paper,
we present our experience in applying two new Microsoft technologies Dryad and
Azure to three bioinformatics applications. We also compare with traditional
MPI and Apache Hadoop MapReduce implementation in one example. The applications
are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD
statistical package to identify HLA-associated viral evolution, and a pair wise
Alu gene alignment application. We give detailed performance discussion on a
768 core Windows HPC Server cluster and an Azure cloud. All the applications
start with a “doubly data parallel step” involving independent data chosen from
two similar (EST, Alu) or two different databases (PhyloD). There are different
structures for final stages in each application.
Keywords: Cloud, bioinformatics, Multicore, Dryad,
Hadoop, MPI
Existing System:
There have
been several papers discussing data analysis using a variety of cloud and more
traditional cluster/Grid technologies with the Chicago paper influential in
posing the broad importance of this type of problem. The Notre Dame all pairs
system clearly identified the “doubly
data parallel” structure seen in all of our applications. We discuss in the Alu
case the linking of an initial doubly data parallel to more traditional “singly
data parallel” MPI applications. BLAST is a well known doubly data parallel
problem. The Swarm project successfully uses traditional distributed clustering
scheduling to address the EST and CAP3 problem.
Note approaches
like Condor have significant startup time dominating performance. For basic
operations, we find Hadoop and Dryad get similar performance on bioinformatics,
particle physics and the well known kernels. Wilde has emphasized the value of
scripting to control these (task parallel) problems and here DryadLINQ offers
some capabilities that we exploited. We note that most previous work has used
Linux based systems and technologies. Our work shows that Windows HPC server
based systems can also be very effective.
Disadvantages:
Experience has
shown that the initial (and often most time consuming) parts of data analysis
are naturally data parallel and the processing can be made independent with
perhaps some collective (reduction) operation.
Proposed System:
The
applications each start with a “doubly data-parallel” (all pairs) phase that
can be implemented in MapReduce, MPI or using cloud resources on demand. The
flexibility of clouds and MapReduce suggest they will become the preferred
approaches. We showed how one can support an application (Alu) requiring a
detailed output structure to allow follow-on iterative MPI computations. The
applications differed in the heterogeneity of the initial data sets but in each
case good performance is observed with the new cloud technologies competitive
with MPI performance. The simple structure of the data/compute flow and the
minimum inter-task communicational requirements of these “pleasingly parallel”
applications enabled them to be implemented using a wide variety of
technologies. The support for handling large data sets, the concept of moving
computation to data, and the better quality of services provided by the cloud
technologies, simplify the implementation of some problems over traditional
systems.
We find that different
programming constructs available in cloud technologies such as independent
“maps” in MapReduce, “homomorphic Apply” in Dryad, and the “worker roles” in
Azure are all suitable for implementing applications of the type we examine. In
the Alu case, we show that Dryad and Hadoop can be programmed to prepare data
for use in later parallel MPI/threaded applications used for further analysis.
Our Dryad and Azure work was all performed on Windows machines and achieved
very large speed ups.
Advantages:
·
This is generating justified interest in
new runtimes and programming models that, unlike traditional parallel models
(such as MPI), directly address the data-specific issues.
·
This structure has motivated the
important MapReduce paradigm and many follow-on extensions.
SYSTEM
REQUIREMENTS:
Hardware Requirements
·
System : Pentium
IV 2.4 GHz.
·
Hard Disk : 40 GB.
·
Floppy Drive :
1.44 Mb.
·
Monitor : 15
VGA Color.
·
Mouse : Logitech.
·
Ram : 512 MB.
Software
Requirements
·
Operating
System :
Windows
xp , Linux
·
Language : Java1.4 or more
·
Technology :
Swing, AWT
·
Back End : Oracle 10g
·
IDE
: MyEclipse 8.6
Module Description
Modules:
·
Alu
Sequence Classification
·
CAP3
Application EST and Its Software CAP
1 Alu Sequence Classification:
The Alu clustering problem is one of the most challenging
problems for sequence clustering because Alus represent the largest repeat
families in human genome. There are
about 1 million copies of Alu sequences in human genome, in which most
insertions can be found in other primates and only a small fraction (_7000) is humanspecific. This indicates that the classification of Alu
repeats can be deduced solely from the 1 million human Alu elements. Alu
clustering can be viewed as a classical case study for the capacity of
computational infrastructures because it is not only of great intrinsic
biological interest, but also a problem of a scale that will remain as the
upper limit of many other clustering problems in bioinformatics for the next
few years, such as the automated protein family classification for millions of
proteins. In our previous works, we have examined Alu samples of 35,339 and 50,000
sequences using the pipeline of Fig. 1.
CAP3 Application EST and Its
Software CAP:
An Expressed Sequence
Tag (EST) corresponds to messenger RNAs (mRNAs) transcribed from the genes
residing on chromosomes. Each individual EST sequence represents a fragment of
mRNA, and the EST assembly aims to reconstruct full-length mRNA sequences for
each expressed gene. Because ESTs correspond to the gene regions of a genome,
EST sequencing has become a standard practice for gene discovery, especially
for the genomes of many organisms that maybe too complex for whole-genome sequencing.
EST is addressed by the software CAP3 which is a DNA sequence assembly program
developed by Huang and Madan. CAP3 performs several major assembly steps
including computation of overlaps, construction of contigs, construction of
multiple sequence alignments, and generation of consensus sequences to a given
set of gene sequences. The program reads a collection of gene sequences from an
input file (FASTA file format) and writes its output to several output files,
as well as the standard output.
CAP3
is often required to process large numbers of FASTA formatted input files,
which can be processed independently, making it an embarrassingly parallel
application requiring no interprocess communications. We have implemented a
parallel version of CAP3 using Hadoop and DryadLINQ. In both these
implementations, we adopt the following algorithm to parallelize CAP3.
1. Distribute the input files to
storages used by the runtime. In Hadoop, the files are distributed to HDFS, and
in DryadLINQ, the files are distributed to the individual shared directories of
the computation nodes.
2. Instruct each parallel process (in
Hadoop, these are the map tasks and in DryadLINQ these are the vertices) to
execute CAP3 program on each input file.
System Architecture:
Algorithm:
·
Smith-Waterman-Gotoh algorithm
·
MPI algorithm
0 comments:
Post a Comment