How to use MUMmer to align short reads to the human genome

Step 1: Download

The current MUMmer release can be downloaded from our SourceForge.net project page.

Step 2: Compile and install

(Instructions copied from section 2.3 of the MUMmer manual.)

For explanation purposes, let's suppose you just downloaded the MUMmer3.0.tar.gz distribution from the SourceForge site. The first step would be to move this file to the desired installation directory and type:

tar -xvzf MUMmer3.0.tar.gz

to extract the MUMmer source into a MUMmer3.0 subdirectory. Switch to this newly created subdirectory and execute:

make check

to assure the makefile can identify the necessary utilities. If no error messages appear, the diagnostics were successful and you may continue. However, if error messages are displayed, the listed programs are not accessible via your system path. Install the utilities if necessary, add them to your system PATH variable, and continue with the MUMmer installation by typing:

make install

This will attempt to compile the MUMmer scripts and executables. If the make command issues no errors, the compilation was successful and you are ready to begin using MUMmer. If the command fails, it is likely that make was confused by the existence of more than one copy of the same utility, such as two versions of gcc. When this happens, it is important to arrange you system PATH variable so that the more recent versions are listed first, or to hard code the location of your utility location in the makefile. The same advice goes for your LD_LIBRARY_PATH variable if your system is having a difficult time locating the appropriate C or C++ libraries at runtime.

It is important to note that the make command dynamically builds the MUMmer scripts to reference the install directory, therefore if the install directory is moved after the make command is issued the MUMmer scripts will fail. If you need certain MUMmer executables in a directory other than the install directory, it is recommend to leave the install directory untouched and link the needed executables to the desired destination. An alternative would be to move the install directory and reissue the make command at the new location.

Step 3: Add MUMmer binaries to your PATH

For example, if you are using the bash shell and have compiled MUMmer in $HOME/software/MUMmer3.20, then add MUMmer binaries to your path with this command:

export PATH=$PATH:$HOME/software/MUMmer3.20

Step 4: Invoke nucmer via nucmer_reads.pls

Finally, download the nucmer_reads.pls script (to download, right-click the link and select "Save target as" or "Save link as") and use it to invoke nucmer with arguments appropriate for mapping short reads. Usage for nucmer_reads.pls is:

Usage:
  perl nucmer_reads.pls [-v] <read-length> <max-substs> <ref-seq> <qry-seq>
Options:
  -v    verbose output

Where read-length is the length of reads in your query file, and max-substs is the maximum number of substitutions to allow in a hit. ref-seq and qry-seq must both be multi-FASTA files.

Discussion

Nucmer uses exact-matching "anchors" from reads as seeds for building a longer inexact match.  Nucmer performs better (finishes sooner and uses less intermediate space) if the anchors are longer.  Allowing for substitutions and gaps, however, necessitates shorter anchors.  The tradeoff between performance and degree of inexactness can be sensitive, so it is important to set nucmer's parameters appropriately.

Note that nucmer treats wildcards as not matching any base, so matching a read with x wildcards is similar to matching a read with no wildcards and allowing x substitutions.

Here are some general rules for how to set the parameters to nucmer when aligning reads of length r, allowing up to d wildcards and substitutions combined.

Let d = maximum number of wildcards & substitutiones allowed
Let rmin = read length (or, if lengths vary, the minimum read length)
Let rmax = read length (or, if lengths vary, the maximum read length)

--maxmatch

  • Necessary; otherwise legitimate hits may be missed

--minmatch x

  • x should be ceil((rmin-d) / (d+1)); larger values may cause legitimate hits to be missed

  • E.g.: 32-base reads with single substitution or wildcard: ceil((32-1)/(1+1)) = 16

  • E.g.: 32-base reads with total of two substitutions and wildcards: ceil((32-2)/(2+1)) = 10

  • Default is 20

--mincluster x

  • Should be the same as the setting for --minmatch

  • Default is 65

--breaklen x

  • x should be rmax - ceil((rmin-d) / (d+1)); smaller values may cause legitimate hits to be missed

  • E.g.: 32-base reads with single substitution or wildcard: 32 - ceil((32-1)/(1+1)) = 16

  • E.g.: 32-base reads with total of two substitutions and wildcards: 32 - ceil((32-2)/(2+1)) = 22

  • Default is 200

--maxgap x

  • x should be d; smaller values may cause legitimate hits to be missed

  • Default is 90

--nooptimize

  • This encourages nucmer to extend the alignment all the way to the end of the read; without this, nucmer may fail to include ends where a substitution occurs close to an end

For exmaple, for 32-base reads allowing one substitution or wildcard, use:

nucmer --maxmatch --minmatch 16 --mincluster 16 --breaklen 16 --maxgap 1 --nooptimize <reference file> <query file>

For 32-base reads allowing up to two substitutions or wildcards, use:

nucmer --maxmatch --minmatch 10 --mincluster 10 --breaklen 22 --maxgap 2 --nooptimize <reference file> <query file>

The following is a code listing for nucmer_reads.pls a simple perl wrapper for nucmer that takes a read length, a number of substitutions, a reference, and a query and then invokes nucmer appropriately:

#!/usr/bin/perl -w

# Invoke nucmer with the appropriate arguments for matching short
# reads with some number of allowed mismatches.
#
# Usage:
#   perl nucmer_reads.pls <min-read-length> <max-substs> <ref-seq> <qry-seq>

use strict;
use warnings;
use POSIX qw(ceil floor);
use Getopt::Std;

sub usage($) {
    my $msg = shift;
    print "$msg\n\n";
    print "Usage:\n";
    print "  perl nucmer_reads.pls [-v] <min-read-length> <max-substs> <ref-seq> <qry-seq>\n\n";
    print "Options:\n";
    print "  -v    verbose output\n\n";
    exit 1;
}

# Must be able to run nucmer
system("nucmer --version 2> /dev/null") == 0
    || die "nucmer is not in your PATH; please add it and re-run";

# Get -v option
my %options=();
getopts("v",\%options);
my $verbose = 0; $verbose = $options{v} if defined $options{v};

defined($ARGV[0]) || usage("Must specify minimum read length as arg 1");
my $r = $ARGV[0]; print "Min read length: $r\n" if $verbose;

defined($ARGV[1]) || usage("Must specify maximum number of substitutions as arg 2");
my $d = $ARGV[1]; print "Max differences: $d\n" if $verbose;

defined($ARGV[2]) || usage("Must specify reference sequence file as arg 3");
my $ref = $ARGV[2]; print "Reference seq: $ref\n" if $verbose;

defined($ARGV[3]) || usage("Must specify query sequence file as arg 4");
my $qry = $ARGV[3]; print "Query seq: $qry\n" if $verbose;

# Calculate nucmer parameters
my $minmatch = ceil(($r-$d) / ($d+1));
my $mincluster = $minmatch;
my $maxgap = $d;
my $breaklen = $r - $minmatch;

# Run nucmer
my $cmd = "nucmer --maxmatch -b $breaklen -c $mincluster -g $maxgap -l $minmatch --nooptimize $ref $qry";
print "Command: $cmd\n" if $verbose;
exec($cmd);