PBcR pipeline

The results below are for CA 8.1. With CA 8.2 onward, a novel fast overlapping algorithm, MHAP is included which reduces D. melanogaster correction time to < 400 CPU hours and total runtime to < 700 CPU hours, taking approximately a day on a single 16-core machine. For the latest usage, see the PBcR page and Release notes.

The Celera Assembler (CA) PBcR pipeline [1] is for the correction (also referred to as "pre-assembly") and assembly of long-read sequencing data. Our recent publication in Genome Biology [2] demonstrated that this approach is sufficient to finish the majority of known bacterial genomes using only a single PacBio sequencing library. The resulting assemblies exceed the quality and accuracy possible from high-quality, short-read data. The CA 8.1 release includes updates to the PBcR pipeline to support correction and assembly of eukaryotic genomes as well. By incorporating a new consensus algorithm developed by Chin et al. for the HGAP assembler [3], the correction step is 3-4-fold faster than before. Other algorithm improvements make the subsequent assembly 2-3-fold faster as well. A tutorial and sample data is available on the PBcR wiki.

This page describes the application of the PBcR pipeline to a Drosophila melanogaster dataset recently released by Pacific Biosciences. The results demonstrate that hierarchical assembly [1], recently demonstrated for microbial genomes [2][3], is also applicable to larger, eukaryotic genomes and can result in highly continuous assemblies, including near-fully assembled chromosome arms.

D. melanogaster Sequencing

In collaboration with Casey Bergman, Susan Celniker, and Roger Hoskins, Pacific Biosciences recently sequenced the Berkeley Drosophila Genome Project (BDGP) reference subline of D. melanogaster using 42 PacBio SMRT cells of the latest P5-C3 chemistry. This is the same stock first sequenced and assembled in 2000 by Myers et al. [4]. The new single-molecule sequencing dataset and a preliminary assembly is described on the PacBio blog and is available from the PacBio DevNet. Additional analysis, including mapping of these reads to the current reference assembly, has been blogged by Casey Bergman.
The summary statistics of the filtered dataset are:
# Bases 15.74Gbp
# Reads 1.690M
Mean 9,317
Median 8,103
Max 44,766

Pre-assembly using PBcR

Raw sequences were filtered using SMRTportal and input to the PBcR pipeline included in CA 8.1. Only PacBio sequences were used for the pre-assembly (correction).

pacBioToCA -length 500 -s dros_p5.spec -t 32 -partitions 334 -pbCNS -l pbcns -fastq filtered_subreads.fastq maxCoverage=100

Approximately 621K CPU hours were required for correction. The majority of the time (608K) was used for BLASR alignments. It is recommended to correct only the longest subset of data (rather than all data as done here), which would reduce the runtime 3-fold. Based on a comparison to the reference, the corrected data is over 99.3% accurate. The pre-assembled reads (download) have the following characteristics:
# Bases 8.84Gp
# Reads 1.02M
Mean 8,697
Median 7,750
Max 36,765


Two assemblies were generated using CA 8.1 using the parameters recommended in [2] and relaxed parameters to merge haplotypes. The longest 25X of corrected data was used for assembly (mean length: 20,814bp). The conservative assembly is available both pre and post-Quiver (download). The statistics for both preliminary assemblies are:
Asm Conservative Asm Relaxed
# Bases 138.36Mbp 142.438Mbp
# Contigs 128 109
Mean 1,080,972 1,306,769
Max 24,622,056 27,637,272
N25 18,654,797 15,159,996
N50 15,297,019 11,926,096

read length comparison
Figure 1 above shows a histogram comparing the filtered sequence length to the pre-assembled lengths. The shaded region indicates those sequences used to generate an assembly. The assembly was run with the command:

gatekeeper -T -F -o asm.gkpStore pbcns.frg
gatekeeper -dumplongestlength 0 3250000000 -dumpfastq pbcns.25X asm.gkpStore
fastqToCA -libraryname pbcns25X -technology pacbio-corrected -type sanger -reads pbcns.25X.unmated.fastq > pbcns.25X.frg
runCA -s asm.spec -p asm -d asm ovlMinLen=3000 pbcns.25X.frg
runCA -s asm.relax.spec -p asm -d asmRelaxed pbcns.25X.frg

read length comparison
Approximately 8K CPU hours were required for assembly. Only contigs composed of at least 50 sequences were used for downstream analysis. Figure 2 shows the contig length and cumulative length plots for the assembly. Since the contig sizes in both assemblies are very similar, the conservative assembly was selected for analysis.

read length comparison
Figure 3 shows the alignment of the assembly against the current reference. The largest contig comprises the majority of chromosome arm 3L. All autosomal arms (2L, 2R, 3R, and 4) are in less than six large contigs. Chromosome X is at lower coverage because of hemizygosity and can likely be improved by re-assembling with a higher coverage subset of pre-assembled reads. This plot was generating by MUMmer using the following commands:
nucmer -mumreference -c 1000 -l 100 -banded -d 10 DmelRefv5.fa contigs.fa
delta-filter -1 out.delta > out.1delta
mummerplot -large -fat out.1delta
There are no large-scale, structural discrepancies between the assembly and current v5 reference genome. Testament to both the quality of the reference genome and the quality of the PacBio assembly. The base calls of the assembly also largely agree with the reference, but there are a number of SNPs and small Indels. This is improved with Quiver. The average percent identity as reported by dnadiff is 99.92% pre-Quiver and 99.97% post-Quiver. The remaining discrepancies are likely a mix of true allelic variation, assembly errors, and reference errors that must be investigated further.


[1] Koren S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., and Phillippy, A. M. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012)
[2] Koren S., Harhay G. P., Smith T. P. L., Bono J. L., Harhay D. M., Mcvey D. S., Radune D., Bergman N. H., and Phillippy A. M. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14:R101, 2013.
[3] Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., ..., and Korlach, J. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods, 2013.
[4] Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., Kravitz S. A., ..., and Venter, J. C. A whole-genome assembly of Drosophila. Science, 287(5461), 2196-2204, 2000.