Short Read Alignment and Assembly

The growing availability of high-throughput sequencing technologies has "democratized" genome sequencing by providing individual labs with a sequencing capacity similar to what was previously only available at large genome centers.  The latest genome sequencers can produce as much as 200 billion base pairs (Gbp) in a single one-week run, enough to cover the human genome 60 times over. Scientists are using these new machines to sequence a wealth of genomes, ranging from bacterial to plants to animals. The new RNA-seq technology uses sequencing to capture all the genes being expressed in a cell, allowing us to detect thousands of previously unknown genes and variants of known genes in a single experiment.
These new technologies have several characteristics that complicate the analysis of the resulting data using software tools originally developed for the "traditional" Sanger sequencing technology.  In particular, the sequence reads are shorter than those produced through Sanger sequencing, and they have differ error characteristics.  Furthermore, the new sequencing machines generate enormous amounts of data, up to several terabytes per run, requiring the development of highly-efficient software for analyzing the resulting sequences.

Researchers at the CBCB are actively involved in the development of new software tools and algorithms for the analysis of the data generated by the new technologies.   A number of our software tools, particularly those for short read alignment, have moved into widespread use in the research community. We continue to refine and improve these programs, which are described on this page and elsewhere on the CBCB site

All our software is freely released, without restrictions, under the open-source Artistic License.

Fast alignment of short reads



Bowtie - an ultrafast, memory-efficient short read aligner that aligns short DNA sequences (reads) to the human genome at a rate of 25-35 million reads per hour on a typical workstation with 2 GB of memory. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small, as low as 1.1 GB for the human genome. Bowtie has been downloaded well over 20,000 times and its user base is growing rapidly.  Read the Bowtie paper here.

TopHat - a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. TopHat was the first spliced alignment program for short reads that operated independently of annotation: because it does not use a list of gene coordinates, it is able to find any number of previously unknown and un-annotated splice sites.  Read the TopHat paper here.

Cufflinks is the first comprehensive package for transcript assembly and quantification with RNA-Seq. In our initial study using developing muscle cells, we identified thousands of unannotated mRNAs and tracked differential splicing and promoter use in hundreds of genes. Cufflinks uses Bowtie and TopHat for alignment and spliced alignment, and then assembles these alignments into full-length transcripts, quantitates the transcripts, and computes different expression for alternative isoforms. Read the Cufflinks paper here.

MUMmmerGPU - Extension of our genome aligner MUMmer allowing it to take advantage of the specialized hardware available in many graphics cards to dramatically speed up the alignment process.  MUMmerGPU is specifically targeted at the large volumes of data generated by new generation sequencing machines.  Read the MUMmerGPU paper here.


MUMmer
for short reads - With the use of a simple script available here, the MUMmer tool itself can be used to map short reads to reference genomes.

De novo assembly with short reads


Minimus - Lightweight assembler originally developed for the assembly of small sets of reads.  In conjunction with an efficient overlapper, Minimus can be applied to large short-read datasets.  Minimus avoids mis-assembling repeats (a common challenge when analyzing short-read data) by using a highly conservative assembly algorithm.

Contrail, a new short-read aligner that uses DeBruijn graphs. Still in development but widely available soon, check the website.

Comparative assembly with short reads


Comparative assembly refers to the assembly of a genome using the sequence of a close relative as a reference, and is frequently refered to as "templated assembly" or "resequencing".  Our software, AMOScmp, was originally developed in the context of Sanger data however with small modifications is directly applicable to short read sequencing data.

* ABBA (Assembly Boosted By Amino Acid) - A short-read assembler that uses a reference amino acid sequence to guide the assembly of a gene. Amino acid sequence is more conserved than DNA which allows a more distance relative to be used as the reference.

* AMOScmp - The main page describing the AMOScmp package, also containing instructions on how to obtain and install the software.
* AMOScmp-shortReads - Page that describes the pipeline designed for comparative assembling of short reads.
* AMOScmp-shortReads-alignmentTrimmed - Page that describes the pipeline designed for alignment based trimming and assembling of short reads.

Short tutorial on using AMOScmp with short read data


Selected Publications