Genome Assembly with Short Reads
The recent availability of
high-throughput sequencing technologies has "democratized" genome
sequencing by providing individual labs with a sequencing capacity
similar to what was previously only available at large genome centers.
Several companies have announced the availability of genome
sequencers capable of sequencing up to about 2Gbp of DNA in a single
run for costs as low as $1000. Machines produced by 454 Life Sciences/Roche, and Solexa/Illumina, are already being actively used in many labs, and competing technologies from Applied Biosystems and Helicos have recently become available.
These
new technologies have several characteristics that complicate the
analysis of the resulting data using software tools originally
developed for the "traditional" Sanger sequencing technology. In
particular, the sequence reads are much shorter than those produced
through Sanger sequencing. The 454 technology currently generates reads
of approximately 250bp (compared to about 1000bp commonly achieved
through the Sanger method), while the other technologies generate reads
of just 30-40bp in length. Furthermore, the new sequencing
machines generate large amounts of data, up to several terabytes per
run, requiring the development of highly-efficient software for
analyzing the resulting sequences.
Researchers at the CBCB are
actively involved in the development of new software tools and
algorithms for the analysis of the data generated by the new
technologies. Several software tools are already available for
this purpose, and we have already been applying this software to
several sequencing projects. The relevant software as well as
instructions on how to use it are described below.
All our software is freely released, without restrictions, under the open-source Artistic License.
Comparative assembly with short reads Comparative
assembly refers to the assembly of a genome using the sequence of a
close relative as a reference, and is frequently refered to as
"templated assembly" or "resequencing". Our software, AMOScmp,
was originally developed in the context of Sanger data however with
small modifications is directly applicable to short read sequencing
data.
* ABBA (Assembly Boosted By Amino Acid) - A short-read assembler that uses a reference amino acid sequence to guide the assembly of a gene. Amino acid sequence is more conserved than DNA which allows a more distance relative to be used as the reference.
* AMOScmp - The main page describing the AMOScmp package, also
containing instructions on how to obtain and install the software.
* AMOScmp-shortReads - Page that describes the pipeline designed for
comparative assembling of short reads.
* AMOScmp-shortReads-alignmentTrimmed - Page
that describes the pipeline designed for alignment based trimming and assembling of short reads.
Short tutorial on using AMOScmp with short read data
Fast alignment of short reads
MUMmmerGPU
- Extension of our genome aligner MUMmer allowing it to take advantage
of the specialized hardware available in many graphics cards to
dramatically speed up the alignment process. MUMmerGPU is
specifically targeted at the large volumes of data generated by new
generation sequencing machines. Short
read mapper - Efficient read mapper for short read data based on
bit-wise operations on compressed reads. Initial tests indicate
our tool is several times faster than other aligners. (...coming
soon)
MUMmer for short reads
- With the use of a simple script available here, the MUMmer tool itself can be used to
map short reads to reference genomes.
De novo assembly with short reads Minimus
- Lightweight assembler originally developed for the assembly of small
sets of reads. In conjunction with an efficient overlapper,
Minimus can be applied to large short-read datasets. Minimus
avoids mis-assembling repeats (a common challenge when analyzing
short-read data) by using a highly conservative assembly algorithm.
Short
read overlapper - Efficient read overlapper for short read data based
on bit-wise operations. Currently the only general purpose
overlapper available for this taks (...coming soon).
Selected Publications
- Mihai Pop and Steven L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends in Genetics. 24(3):142-149. 2008
- Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474, 2008.
- Mihai Pop, Adam M. Phillippy, Arthur L. Delcher and Steven L. Salzberg. Comparative
genome
assembly. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.
|
|