Genome Assembly with Short Reads

The recent availability of high-throughput sequencing technologies has "democratized" genome sequencing by providing individual labs with a sequencing capacity similar to what was previously only available at large genome centers.  Several companies have announced the availability of genome sequencers capable of sequencing up to about 2Gbp of DNA in a single run for costs as low as $1000.  Machines produced by 454 Life Sciences/Roche, and Solexa/Illumina, are already being actively used in many labs, and competing technologies from Applied Biosystems and Helicos have recently become available.  

These new technologies have several characteristics that complicate the analysis of the resulting data using software tools originally developed for the "traditional" Sanger sequencing technology.  In particular, the sequence reads are much shorter than those produced through Sanger sequencing. The 454 technology currently generates reads of approximately 250bp (compared to about 1000bp commonly achieved through the Sanger method), while the other technologies generate reads of just 30-40bp in length.  Furthermore, the new sequencing machines generate large amounts of data, up to several terabytes per run, requiring the development of highly-efficient software for analyzing the resulting sequences.

Researchers at the CBCB are actively involved in the development of new software tools and algorithms for the analysis of the data generated by the new technologies.   Several software tools are already available for this purpose, and we have already been applying this software to several sequencing projects.  The relevant software as well as instructions on how to use it are described below.

All our software is freely released, without restrictions, under the open-source Artistic License.

Comparative assembly with short reads


Comparative assembly refers to the assembly of a genome using the sequence of a close relative as a reference, and is frequently refered to as "templated assembly" or "resequencing".  Our software, AMOScmp, was originally developed in the context of Sanger data however with small modifications is directly applicable to short read sequencing data.

* ABBA (Assembly Boosted By Amino Acid) - A short-read assembler that uses a reference amino acid sequence to guide the assembly of a gene. Amino acid sequence is more conserved than DNA which allows a more distance relative to be used as the reference.

* AMOScmp - The main page describing the AMOScmp package, also containing instructions on how to obtain and install the software.
* AMOScmp-shortReads - Page that describes the pipeline designed for comparative assembling of short reads.
* AMOScmp-shortReads-alignmentTrimmed - Page that describes the pipeline designed for alignment based trimming and assembling of short reads.

Short tutorial on using AMOScmp with short read data

Fast alignment of short reads



MUMmmerGPU
- Extension of our genome aligner MUMmer allowing it to take advantage of the specialized hardware available in many graphics cards to dramatically speed up the alignment process.  MUMmerGPU is specifically targeted at the large volumes of data generated by new generation sequencing machines.
Short read mapper - Efficient read mapper for short read data based on bit-wise operations on compressed reads.  Initial tests indicate our tool is several times faster than other aligners.  (...coming soon)

MUMmer
for short reads - With the use of a simple script available here, the MUMmer tool itself can be used to map short reads to reference genomes.

De novo assembly with short reads


Minimus - Lightweight assembler originally developed for the assembly of small sets of reads.  In conjunction with an efficient overlapper, Minimus can be applied to large short-read datasets.  Minimus avoids mis-assembling repeats (a common challenge when analyzing short-read data) by using a highly conservative assembly algorithm.

Short read overlapper - Efficient read overlapper for short read data based on bit-wise operations.  Currently the only general purpose overlapper available for this taks (...coming soon).


Selected Publications