Short Read Alignment and Assembly
The growing availability of
high-throughput sequencing technologies has "democratized" genome
sequencing by providing individual labs with a sequencing capacity
similar to what was previously only available at large genome centers.
The latest genome
sequencers can produce as much as 200 billion base pairs (Gbp) in
a single one-week run, enough to cover the human genome 60 times over.
Scientists are using these new machines to sequence a wealth of genomes,
ranging from bacterial to plants to animals. The new RNA-seq technology
uses sequencing to capture all the genes being expressed in a cell,
allowing us to detect thousands of previously unknown genes and variants
of known genes in a single experiment.
These new technologies have several characteristics that complicate the
analysis of the resulting data using software tools originally
developed for the "traditional" Sanger sequencing technology. In
particular, the sequence reads are shorter than those produced
through Sanger sequencing, and they have differ error characteristics.
Furthermore, the new sequencing
machines generate enormous amounts of data, up to several terabytes per
run, requiring the development of highly-efficient software for
analyzing the resulting sequences.
Researchers at the CBCB are
actively involved in the development of new software tools and
algorithms for the analysis of the data generated by the new
technologies. A number of our software tools, particularly those
for short read alignment, have moved into widespread use in the
research community. We continue to refine and improve these
programs, which are described on this page and elsewhere on the
All our software is freely released, without restrictions, under the
Fast alignment of short reads
Bowtie - an ultrafast,
memory-efficient short read aligner that aligns short DNA sequences
(reads) to the human genome at a rate of 25-35 million reads per
hour on a typical workstation with 2 GB of memory. Bowtie indexes the
genome with a Burrows-Wheeler index to keep its memory footprint small,
as low as 1.1 GB for the human genome. Bowtie has been downloaded well
over 20,000 times and its user base is growing rapidly. Read the Bowtie paper
TopHat - a fast splice
junction mapper for RNA-Seq reads. It aligns RNA-Seq
reads to mammalian-sized genomes using the ultra high-throughput short
read aligner Bowtie,
and then analyzes the mapping results to identify splice junctions
between exons. TopHat was the first spliced alignment program for short
reads that operated independently of annotation: because it does not use
a list of gene coordinates, it is able to find any number of previously
unknown and un-annotated splice sites. Read
the TopHat paper here.
Cufflinks is the first
comprehensive package for transcript assembly and quantification with
RNA-Seq. In our initial study using developing muscle cells, we identified
thousands of unannotated mRNAs and tracked differential splicing and
promoter use in hundreds of genes. Cufflinks uses Bowtie and TopHat
for alignment and spliced alignment, and then assembles these alignments
into full-length transcripts, quantitates the transcripts, and computes
different expression for alternative isoforms.
Read the Cufflinks paper here.
- Extension of our genome aligner MUMmer allowing it to take advantage
of the specialized hardware available in many graphics cards to
dramatically speed up the alignment process. MUMmerGPU is
specifically targeted at the large volumes of data generated by new
generation sequencing machines. Read the MUMmerGPU
MUMmer for short reads
- With the use of a simple script available
here, the MUMmer tool itself can be used to
map short reads to reference genomes.
De novo assembly with
- Lightweight assembler originally developed for the assembly of small
sets of reads. In conjunction with an efficient overlapper,
Minimus can be applied to large short-read datasets. Minimus
avoids mis-assembling repeats (a common challenge when analyzing
short-read data) by using a highly conservative assembly algorithm.
Contrail, a new short-read
aligner that uses DeBruijn graphs. Still in development but widely
available soon, check the website.
Comparative assembly with short reads
assembly refers to the assembly of a genome using the sequence of a
close relative as a reference, and is frequently refered to as
"templated assembly" or "resequencing". Our software, AMOScmp,
was originally developed in the context of Sanger data however with
small modifications is directly applicable to short read sequencing
(Assembly Boosted By Amino Acid) - A short-read assembler that
uses a reference amino acid sequence to guide the assembly of a gene.
Amino acid sequence is more conserved than DNA which allows a more
distance relative to be used as the reference.
- The main page describing the AMOScmp package, also containing
instructions on how to obtain and install the software.
- Page that describes the pipeline designed for comparative assembling
of short reads.
- Page that describes the pipeline designed for alignment based
trimming and assembling of short reads.
tutorial on using AMOScmp with short read data
- [The Bowtie paper] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg.
and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10:R25.
- [The TopHat paper]
Cole Trapnell, Lior Pachter, and Steven L. Salzberg.
discovering splice junctions with RNA-Seq. Bioinformatics 2009
- [The Cufflinks paper] Cole Trapnell, Brian A Williams, Geo Pertea,
Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg,
Barbara J Wold, and Lior Pachter (2010).
Transcript assembly and quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell differentiation.
Nature Biotechnology 20, 511–515 (2010). doi:10.1038/nbt.1621
- Mihai Pop and Steven L. Salzberg. Bioinformatics
challenges of new sequencing technology. Trends in Genetics.
- Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. High-throughput
sequence alignment using Graphics Processing Units. BMC
Bioinformatics 8:474, 2008.
- Mihai Pop, Adam M. Phillippy, Arthur L. Delcher and Steven L.
assembly. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.