How to use MUMmer to align short reads to the human genome
Step 1: Download
The current MUMmer release can be
downloaded
from our SourceForge.net
project page.
Step 2: Compile and install
(Instructions copied from section 2.3 of the
MUMmer
manual.)
For explanation purposes, let's suppose you just downloaded the
MUMmer3.0.tar.gz
distribution from the SourceForge site. The first step would be to move this
file to the desired installation directory and type:
tar -xvzf MUMmer3.0.tar.gz
to extract the MUMmer source into a MUMmer3.0 subdirectory. Switch to this
newly created subdirectory and execute:
make check
to assure the makefile can identify the necessary utilities. If no error
messages appear, the diagnostics were successful and you may continue.
However, if error messages are displayed, the listed programs are not
accessible via your system path. Install the utilities if necessary, add them
to your system PATH variable, and continue with the MUMmer installation by
typing:
make install
This will attempt to compile the MUMmer scripts and executables. If the make
command issues no errors, the compilation was successful and you are ready to
begin using MUMmer. If the command fails, it is likely that make was confused
by the existence of more than one copy of the same utility, such as two
versions of gcc. When this happens, it is important to arrange you system PATH
variable so that the more recent versions are listed first, or to hard code
the location of your utility location in the makefile. The same advice goes
for your LD_LIBRARY_PATH variable if your system is having a difficult time
locating the appropriate C or C++ libraries at runtime.
It is important to note that the make command dynamically builds the MUMmer
scripts to reference the install directory, therefore if the install directory
is moved after the make command is issued the MUMmer scripts will fail. If you
need certain MUMmer executables in a directory other than the install
directory, it is recommend to leave the install directory untouched and link
the needed executables to the desired destination. An alternative would be to
move the install directory and reissue the make command at the new location.
Step 3: Add MUMmer binaries to your PATH
For example, if you are using the bash shell and have compiled MUMmer in
$HOME/software/MUMmer3.20,
then add MUMmer binaries to your path with this command:
export PATH=$PATH:$HOME/software/MUMmer3.20
Step 4: Invoke nucmer via nucmer_reads.pls
Finally, download the nucmer_reads.pls script (to download, right-click the link and select "Save target as" or "Save link as") and use it to invoke nucmer with arguments appropriate for mapping short reads. Usage for nucmer_reads.pls is:
Usage:
perl nucmer_reads.pls [-v] <read-length> <max-substs> <ref-seq> <qry-seq>
Options:
-v verbose output
Where read-length is the length of reads in your query file, and max-substs is the maximum number of substitutions to allow in a hit. ref-seq and qry-seq must both be multi-FASTA files.
Discussion
Nucmer uses exact-matching "anchors" from reads as seeds for building a longer
inexact match. Nucmer performs better (finishes sooner and uses less
intermediate space) if the anchors are longer. Allowing for substitutions
and gaps, however, necessitates shorter anchors. The tradeoff between
performance and degree of inexactness can be sensitive, so it is important to
set nucmer's parameters appropriately.
Note that nucmer treats wildcards as not matching any base, so matching a read
with x wildcards is similar to matching a read with no wildcards
and allowing x substitutions.
Here are some general rules for how to set the parameters to nucmer when
aligning reads of length r, allowing up to d
wildcards and substitutions combined.
Let d = maximum number of wildcards & substitutiones
allowed
Let rmin = read length (or, if lengths vary, the minimum read
length)
Let rmax = read length (or, if lengths vary, the maximum read
length)
--maxmatch
--minmatch x
-
x should be ceil((rmin-d) / (d+1)); larger
values may cause legitimate hits to be missed
-
E.g.: 32-base reads with single substitution or wildcard:
ceil((32-1)/(1+1)) = 16
-
E.g.: 32-base reads with total of two substitutions and wildcards:
ceil((32-2)/(2+1)) = 10
-
Default is 20
--mincluster x
--breaklen x
-
x should be rmax - ceil((rmin-d) / (d+1)); smaller values
may cause legitimate hits to be missed
-
E.g.: 32-base reads with single substitution or wildcard: 32 -
ceil((32-1)/(1+1)) = 16
-
E.g.: 32-base reads with total of two substitutions and wildcards:
32 - ceil((32-2)/(2+1)) = 22
-
Default is 200
--maxgap x
--nooptimize
For exmaple, for 32-base reads allowing one substitution or wildcard, use:
nucmer --maxmatch --minmatch 16 --mincluster 16 --breaklen 16 --maxgap 1 --nooptimize <reference file> <query file>
For 32-base reads allowing up to two substitutions or wildcards, use:
nucmer --maxmatch --minmatch 10 --mincluster 10 --breaklen 22 --maxgap 2 --nooptimize <reference file> <query file>
The following is a code listing for
nucmer_reads.pls
a simple perl wrapper for nucmer that takes a read length, a number of
substitutions, a reference, and a query and then invokes nucmer appropriately:
#!/usr/bin/perl
-w
# Invoke nucmer with the
appropriate arguments for matching
short
# reads with some number
of allowed
mismatches.
#
#
Usage:
# perl
nucmer_reads.pls <min-read-length> <max-substs>
<ref-seq>
<qry-seq>
use
strict;
use
warnings;
use POSIX qw(ceil
floor);
use
Getopt::Std;
sub usage($)
{
my
$msg = shift;
print
"$msg\n\n";
print
"Usage:\n";
print
" perl nucmer_reads.pls [-v] <min-read-length>
<max-substs> <ref-seq>
<qry-seq>\n\n";
print
"Options:\n";
print
" -v verbose
output\n\n";
exit
1;
}
# Must be able to run
nucmer
system("nucmer --version
2> /dev/null") ==
0
||
die "nucmer is not in your PATH; please add it and
re-run";
# Get -v
option
my
%options=();
getopts("v",\%options);
my $verbose = 0;
$verbose = $options{v} if defined
$options{v};
defined($ARGV[0]) ||
usage("Must specify minimum read length as arg
1");
my $r = $ARGV[0]; print
"Min read length: $r\n" if
$verbose;
defined($ARGV[1]) ||
usage("Must specify maximum number of substitutions as arg
2");
my $d = $ARGV[1]; print
"Max differences: $d\n" if
$verbose;
defined($ARGV[2]) ||
usage("Must specify reference sequence file as arg
3");
my $ref = $ARGV[2];
print "Reference seq: $ref\n" if
$verbose;
defined($ARGV[3]) ||
usage("Must specify query sequence file as arg
4");
my $qry = $ARGV[3];
print "Query seq: $qry\n" if
$verbose;
# Calculate nucmer
parameters
my $minmatch =
ceil(($r-$d) /
($d+1));
my $mincluster =
$minmatch;
my $maxgap =
$d;
my $breaklen = $r -
$minmatch;
# Run
nucmer
my $cmd = "nucmer
--maxmatch -b $breaklen -c $mincluster -g $maxgap -l $minmatch
--nooptimize $ref
$qry";
print "Command: $cmd\n"
if $verbose;
exec($cmd);
|
|