TY - Generic T1 - Orchestrating high-throughput genomic analysis with Bioconductor. Y1 - 2015 A1 - Huber, Wolfgang A1 - Carey, Vincent J A1 - Gentleman, Robert A1 - Anders, Simon A1 - Carlson, Marc A1 - Carvalho, Benilton S A1 - Bravo, Héctor Corrada A1 - Davis, Sean A1 - Gatto, Laurent A1 - Girke, Thomas A1 - Gottardo, Raphael A1 - Hahne, Florian A1 - Hansen, Kasper D A1 - Irizarry, Rafael A A1 - Lawrence, Michael A1 - Love, Michael I A1 - MacDonald, James A1 - Obenchain, Valerie A1 - Oleś, Andrzej K A1 - Pagès, Hervé A1 - Reyes, Alejandro A1 - Shannon, Paul A1 - Smyth, Gordon K A1 - Tenenbaum, Dan A1 - Waldron, Levi A1 - Morgan, Martin KW - Computational Biology KW - Gene Expression Profiling KW - Genomics KW - High-Throughput Screening Assays KW - Programming Languages KW - software KW - User-Computer Interface AB -

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.

JA - Nat Methods VL - 12 CP - 2 M3 - 10.1038/nmeth.3252 ER - TY - JOUR T1 - Automated ensemble assembly and validation of microbial genomes. JF - BMC Bioinformatics Y1 - 2014 A1 - Koren, Sergey A1 - Todd Treangen A1 - Hill, Christopher M A1 - Pop, Mihai A1 - Phillippy, Adam M KW - Genome, Bacterial KW - Genome, Microbial KW - Genomics KW - Mycobacterium tuberculosis KW - Rhodobacter sphaeroides KW - Sequence Analysis, DNA KW - software AB -

BACKGROUND: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.

RESULTS: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.

CONCLUSIONS: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.

VL - 15 M3 - 10.1186/1471-2105-15-126 ER - TY - Generic T1 - BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Y1 - 2014 A1 - Ye, Chengxi A1 - Hsiao, Chiaowen A1 - Corrada Bravo, Hector KW - algorithms KW - High-Throughput Nucleotide Sequencing KW - HUMANS KW - Probability KW - Reproducibility of Results KW - Sequence Analysis, DNA KW - software KW - Time factors AB -

MOTIVATION: Base-calling of sequencing data produced by high-throughput sequencing platforms is a fundamental process in current bioinformatics analysis. However, existing third-party probabilistic or machine-learning methods that significantly improve the accuracy of base-calls on these platforms are impractical for production use due to their computational inefficiency.

RESULTS: We directly formulate base-calling as a blind deconvolution problem and implemented BlindCall as an efficient solver to this inverse problem. BlindCall produced base-calls at accuracy comparable to state-of-the-art probabilistic methods while processing data at rates 10 times faster in most cases. The computational complexity of BlindCall scales linearly with read length making it better suited for new long-read sequencing technologies.

JA - Bioinformatics VL - 30 CP - 9 M3 - 10.1093/bioinformatics/btu010 ER - TY - Generic T1 - Epiviz: interactive visual analytics for functional genomics data. Y1 - 2014 A1 - Chelaru, Florin A1 - Smith, Llewellyn A1 - Goldstein, Naomi A1 - Bravo, Héctor Corrada KW - algorithms KW - Chromosome mapping KW - Data Mining KW - database management systems KW - Databases, Genetic KW - Genomics KW - Internet KW - software KW - User-Computer Interface AB -

Visualization is an integral aspect of genomics data analysis. Algorithmic-statistical analysis and interactive visualization are most effective when used iteratively. Epiviz (http://epiviz.cbcb.umd.edu/), a web-based genome browser, and the Epivizr Bioconductor package allow interactive, extensible and reproducible visualization within a state-of-the-art data-analysis platform.

JA - Nat Methods VL - 11 CP - 9 M3 - 10.1038/nmeth.3038 ER - TY - Generic T1 - Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Y1 - 2014 A1 - Aryee, Martin J A1 - Jaffe, Andrew E A1 - Corrada-Bravo, Hector A1 - Ladd-Acosta, Christine A1 - Feinberg, Andrew P A1 - Hansen, Kasper D A1 - Irizarry, Rafael A KW - Aged KW - algorithms KW - Colonic Neoplasms KW - DNA Methylation KW - Genome KW - High-Throughput Nucleotide Sequencing KW - HUMANS KW - Oligonucleotide Array Sequence Analysis KW - Polymorphism, Single Nucleotide KW - software AB -

MOTIVATION: The recently released Infinium HumanMethylation450 array (the '450k' array) provides a high-throughput assay to quantify DNA methylation (DNAm) at ∼450 000 loci across a range of genomic features. Although less comprehensive than high-throughput sequencing-based techniques, this product is more cost-effective and promises to be the most widely used DNAm high-throughput measurement technology over the next several years.

RESULTS: Here we describe a suite of computational tools that incorporate state-of-the-art statistical techniques for the analysis of DNAm data. The software is structured to easily adapt to future versions of the technology. We include methods for preprocessing, quality assessment and detection of differentially methylated regions from the kilobase to the megabase scale. We show how our software provides a powerful and flexible development platform for future methods. We also illustrate how our methods empower the technology to make discoveries previously thought to be possible only with sequencing-based methods.

AVAILABILITY AND IMPLEMENTATION: http://bioconductor.org/packages/release/bioc/html/minfi.html.

CONTACT: khansen@jhsph.edu; rafa@jimmy.harvard.edu

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

JA - Bioinformatics VL - 30 CP - 10 M3 - 10.1093/bioinformatics/btu049 ER - TY - JOUR T1 - Quantitative 4D analyses of epithelial folding during Drosophila gastrulation. JF - Development Y1 - 2014 A1 - Khan, Zia A1 - Wang, Yu-Chiun A1 - Wieschaus, Eric F A1 - Kaschube, Matthias KW - Animals KW - Body Patterning KW - Cell Shape KW - Cell Tracking KW - Drosophila melanogaster KW - Epithelial Cells KW - Epithelium KW - Gastrulation KW - Image Processing, Computer-Assisted KW - software AB -

Understanding the cellular and mechanical processes that underlie the shape changes of individual cells and their collective behaviors in a tissue during dynamic and complex morphogenetic events is currently one of the major frontiers in developmental biology. The advent of high-speed time-lapse microscopy and its use in monitoring the cellular events in fluorescently labeled developing organisms demonstrate tremendous promise in establishing detailed descriptions of these events and could potentially provide a foundation for subsequent hypothesis-driven research strategies. However, obtaining quantitative measurements of dynamic shapes and behaviors of cells and tissues in a rapidly developing metazoan embryo using time-lapse 3D microscopy remains technically challenging, with the main hurdle being the shortage of robust imaging processing and analysis tools. We have developed EDGE4D, a software tool for segmenting and tracking membrane-labeled cells using multi-photon microscopy data. Our results demonstrate that EDGE4D enables quantification of the dynamics of cell shape changes, cell interfaces and neighbor relations at single-cell resolution during a complex epithelial folding event in the early Drosophila embryo. We expect this tool to be broadly useful for the analysis of epithelial cell geometries and movements in a wide variety of developmental contexts.

VL - 141 CP - 14 M3 - 10.1242/dev.107730 ER - TY - Generic T1 - Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Y1 - 2014 A1 - Patro, Rob A1 - Mount, Stephen M A1 - Kingsford, Carl KW - algorithms KW - Brain Chemistry KW - Computational Biology KW - HUMANS KW - Models, Biological KW - RNA Isoforms KW - Sequence Analysis, RNA KW - software AB -

We introduce Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Because Sailfish entirely avoids mapping reads, a time-consuming step in all current methods, it provides quantification estimates much faster than do existing approaches (typically 20 times faster) without loss of accuracy. By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.

JA - Nat Biotechnol VL - 32 CP - 5 M3 - 10.1038/nbt.2862 ER - TY - JOUR T1 - InterPro in 2011: new developments in the family and domain prediction database JF - Nucleic acids researchNucleic Acids Research Y1 - 2012 A1 - Hunter, Sarah A1 - Jones, Philip A1 - Mitchell, Alex A1 - Apweiler, Rolf A1 - Attwood, Teresa K. A1 - Bateman, Alex A1 - Bernard, Thomas A1 - Binns, David A1 - Bork, Peer A1 - Burge, Sarah A1 - de Castro, Edouard A1 - Coggill, Penny A1 - Corbett, Matthew A1 - Das, Ujjwal A1 - Daugherty, Louise A1 - Duquenne, Lauranne A1 - Finn, Robert D. A1 - Fraser, Matthew A1 - Gough, Julian A1 - Haft, Daniel A1 - Hulo, Nicolas A1 - Kahn, Daniel A1 - Kelly, Elizabeth A1 - Letunic, Ivica A1 - Lonsdale, David A1 - Lopez, Rodrigo A1 - Madera, Martin A1 - Maslen, John A1 - McAnulla, Craig A1 - McDowall, Jennifer A1 - McMenamin, Conor A1 - Mi, Huaiyu A1 - Mutowo-Muellenet, Prudence A1 - Mulder, Nicola A1 - Natale, Darren A1 - Orengo, Christine A1 - Pesseat, Sebastien A1 - Punta, Marco A1 - Quinn, Antony F. A1 - Rivoire, Catherine A1 - Sangrador-Vegas, Amaia A1 - J. Selengut A1 - Sigrist, Christian J. A. A1 - Scheremetjew, Maxim A1 - Tate, John A1 - Thimmajanarthanan, Manjulapramila A1 - Thomas, Paul D. A1 - Wu, Cathy H. A1 - Yeats, Corin A1 - Yong, Siew-Yit KW - Databases, Protein KW - Protein Structure, Tertiary KW - Proteins KW - Sequence Analysis, Protein KW - software KW - Terminology as Topic KW - User-Computer Interface AB - InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces. VL - 40 N1 - http://www.ncbi.nlm.nih.gov/pubmed/22096229?dopt=Abstract ER - TY - JOUR T1 - InterPro in 2011: new developments in the family and domain prediction database. JF - Nucleic Acids Res Y1 - 2012 A1 - Hunter, Sarah A1 - Jones, Philip A1 - Mitchell, Alex A1 - Apweiler, Rolf A1 - Attwood, Teresa K A1 - Bateman, Alex A1 - Bernard, Thomas A1 - Binns, David A1 - Bork, Peer A1 - Burge, Sarah A1 - de Castro, Edouard A1 - Coggill, Penny A1 - Corbett, Matthew A1 - Das, Ujjwal A1 - Daugherty, Louise A1 - Duquenne, Lauranne A1 - Finn, Robert D A1 - Fraser, Matthew A1 - Gough, Julian A1 - Haft, Daniel A1 - Hulo, Nicolas A1 - Kahn, Daniel A1 - Kelly, Elizabeth A1 - Letunic, Ivica A1 - Lonsdale, David A1 - Lopez, Rodrigo A1 - Madera, Martin A1 - Maslen, John A1 - McAnulla, Craig A1 - McDowall, Jennifer A1 - McMenamin, Conor A1 - Mi, Huaiyu A1 - Mutowo-Muellenet, Prudence A1 - Mulder, Nicola A1 - Natale, Darren A1 - Orengo, Christine A1 - Pesseat, Sebastien A1 - Punta, Marco A1 - Quinn, Antony F A1 - Rivoire, Catherine A1 - Sangrador-Vegas, Amaia A1 - Selengut, Jeremy D A1 - Sigrist, Christian J A A1 - Scheremetjew, Maxim A1 - Tate, John A1 - Thimmajanarthanan, Manjulapramila A1 - Thomas, Paul D A1 - Wu, Cathy H A1 - Yeats, Corin A1 - Yong, Siew-Yit KW - Databases, Protein KW - Protein Structure, Tertiary KW - Proteins KW - Sequence Analysis, Protein KW - software KW - Terminology as Topic KW - User-Computer Interface AB -

InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.

VL - 40 CP - Database issue M3 - 10.1093/nar/gkr948 ER - TY - JOUR T1 - Accurate proteome-wide protein quantification from high-resolution 15N mass spectra. JF - Genome Biol Y1 - 2011 A1 - Khan, Zia A1 - Amini, Sasan A1 - Bloom, Joshua S A1 - Ruse, Cristian A1 - Caudy, Amy A A1 - Kruglyak, Leonid A1 - Singh, Mona A1 - Perlman, David H A1 - Tavazoie, Saeed KW - algorithms KW - Amino Acid Sequence KW - Bacterial Proteins KW - Escherichia coli KW - Isotope Labeling KW - Mass Spectrometry KW - Molecular Sequence Data KW - Nitrogen Isotopes KW - Proteome KW - proteomics KW - Sensitivity and Specificity KW - software AB -

In quantitative mass spectrometry-based proteomics, the metabolic incorporation of a single source of 15N-labeled nitrogen has many advantages over using stable isotope-labeled amino acids. However, the lack of a robust computational framework for analyzing the resulting spectra has impeded wide use of this approach. We have addressed this challenge by introducing a new computational methodology for analyzing 15N spectra in which quantification is integrated with identification. Application of this method to an Escherichia coli growth transition reveals significant improvement in quantification accuracy over previous methods.

VL - 12 CP - 12 M3 - 10.1186/gb-2011-12-12-r122 ER - TY - Generic T1 - Computing the Tree of Life: Leveraging the Power of Desktop and Service Grids T2 - Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on Y1 - 2011 A1 - Adam L. Bazinet A1 - Michael P. Cummings KW - (artificial KW - (mathematics) KW - analysis KW - BOINC KW - COMPUTATION KW - computational KW - computing KW - data KW - Estimation KW - evolutionary KW - GARLI KW - genetic KW - Grid KW - GRIDS KW - handling KW - heterogeneous KW - History KW - HPC KW - information KW - intelligence) KW - interface KW - interfaces KW - Internet KW - jobs KW - lattice KW - learning KW - life KW - likelihood KW - load KW - machine KW - maximum KW - method KW - model KW - molecular KW - phylogenetic KW - portal KW - Portals KW - power KW - project KW - resource KW - Science KW - sequence KW - service KW - services KW - sets KW - software KW - substantial KW - system KW - systematics KW - tree KW - TREES KW - user KW - Web AB - The trend in life sciences research, particularly in molecular evolutionary systematics, is toward larger data sets and ever-more detailed evolutionary models, which can generate substantial computational loads. Over the past several years we have developed a grid computing system aimed at providing researchers the computational power needed to complete such analyses in a timely manner. Our grid system, known as The Lattice Project, was the first to combine two models of grid computing - the service model, which mainly federates large institutional HPC resources, and the desktop model, which harnesses the power of PCs volunteered by the general public. Recently we have developed a "science portal" style web interface that makes it easier than ever for phylogenetic analyses to be completed using GARLI, a popular program that uses a maximum likelihood method to infer the evolutionary history of organisms on the basis of genetic sequence data. This paper describes our approach to scheduling thousands of GARLI jobs with diverse requirements to heterogeneous grid resources, which include volunteer computers running BOINC software. A key component of this system provides a priori GARLI runtime estimates using machine learning with random forests. JA - Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on ER - TY - JOUR T1 - ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process JF - BMC bioinformaticsBMC Bioinformatics Y1 - 2011 A1 - Basu, Malay K. A1 - J. Selengut A1 - Haft, Daniel H. KW - algorithms KW - Archaea KW - Archaeal Proteins KW - DNA KW - Methane KW - Phylogeny KW - software AB - BACKGROUND: Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies. RESULTS: Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries. CONCLUSION: ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/. VL - 12 N1 - http://www.ncbi.nlm.nih.gov/pubmed/22070167?dopt=Abstract ER - TY - JOUR T1 - ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process. JF - BMC Bioinformatics Y1 - 2011 A1 - Basu, Malay K A1 - Selengut, Jeremy D A1 - Haft, Daniel H KW - algorithms KW - Archaea KW - Archaeal Proteins KW - DNA KW - Methane KW - Phylogeny KW - software AB -

BACKGROUND: Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies.

RESULTS: Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries.

CONCLUSION: ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.

VL - 12 M3 - 10.1186/1471-2105-12-434 ER - TY - JOUR T1 - TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes JF - Nucleic acids researchNucleic Acids Research Y1 - 2007 A1 - J. Selengut A1 - Haft, Daniel H. A1 - Davidsen, Tanja A1 - Ganapathy, Anurhada A1 - Gwinn-Giglio, Michelle A1 - Nelson, William C. A1 - Richter, R. Alexander A1 - White, Owen KW - Archaeal Proteins KW - Bacterial Proteins KW - Databases, Protein KW - Genome, Bacterial KW - Genomics KW - Internet KW - Phylogeny KW - software KW - User-Computer Interface AB - TIGRFAMs is a collection of protein family definitions built to aid in high-throughput annotation of specific protein functions. Each family is based on a hidden Markov model (HMM), where both cutoff scores and membership in the seed alignment are chosen so that the HMMs can classify numerous proteins according to their specific molecular functions. Most TIGRFAMs models describe 'equivalog' families, where both orthology and lateral gene transfer may be part of the evolutionary history, but where a single molecular function has been conserved. The Genome Properties system contains a queriable set of metabolic reconstructions, genome metrics and extractions of information from the scientific literature. Its genome-by-genome assertions of whether or not specific structures, pathways or systems are present provide high-level conceptual descriptions of genomic content. These assertions enable comparative genomics, provide a meaningful biological context to aid in manual annotation, support assignments of Gene Ontology (GO) biological process terms and help validate HMM-based predictions of protein function. The Genome Properties system is particularly useful as a generator of phylogenetic profiles, through which new protein family functions may be discovered. The TIGRFAMs and Genome Properties systems can be accessed at http://www.tigr.org/TIGRFAMs and http://www.tigr.org/Genome_Properties. VL - 35 N1 - http://www.ncbi.nlm.nih.gov/pubmed/17151080?dopt=Abstract ER - TY - JOUR T1 - Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics JF - Bioinformatics (Oxford, England)Bioinformatics (Oxford, England) Y1 - 2005 A1 - Haft, Daniel H. A1 - J. Selengut A1 - Brinkac, Lauren M. A1 - Zafar, Nikhat A1 - White, Owen KW - Chromosome mapping KW - database management systems KW - Databases, Genetic KW - documentation KW - Gene Expression Profiling KW - Gene Expression Regulation KW - Genomics KW - Information Storage and Retrieval KW - Microbiological Techniques KW - natural language processing KW - Prokaryotic Cells KW - Proteome KW - signal transduction KW - software KW - User-Computer Interface KW - Vocabulary, Controlled AB - MOTIVATION: The presence or absence of metabolic pathways and structures provide a context that makes protein annotation far more reliable. Compiling such information across microbial genomes improves the functional classification of proteins and provides a valuable resource for comparative genomics. RESULTS: We have created a Genome Properties system to present key aspects of prokaryotic biology using standardized computational methods and controlled vocabularies. Properties reflect gene content, phenotype, phylogeny and computational analyses. The results of searches using hidden Markov models allow many properties to be deduced automatically, especially for families of proteins (equivalogs) conserved in function since their last common ancestor. Additional properties are derived from curation, published reports and other forms of evidence. Genome Properties system was applied to 156 complete prokaryotic genomes, and is easily mined to find differences between species, correlations between metabolic features and families of uncharacterized proteins, or relationships among properties. AVAILABILITY: Genome Properties can be found at http://www.tigr.org/Genome_Properties SUPPLEMENTARY INFORMATION: http://www.tigr.org/tigr-scripts/CMR2/genome_properties_references.spl. VL - 21 N1 - http://www.ncbi.nlm.nih.gov/pubmed/15347579?dopt=Abstract ER - TY - JOUR T1 - Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. JF - Nucleic Acids Res Y1 - 2003 A1 - Haas, Brian J A1 - Delcher, Arthur L A1 - Mount, Stephen M A1 - Wortman, Jennifer R A1 - Smith, Roger K A1 - Hannick, Linda I A1 - Maiti, Rama A1 - Ronning, Catherine M A1 - Rusch, Douglas B A1 - Town, Christopher D A1 - Salzberg, Steven L A1 - White, Owen KW - algorithms KW - Alternative Splicing KW - Arabidopsis KW - DNA, Complementary KW - Expressed Sequence Tags KW - Genome, Plant KW - Introns KW - Plant Proteins KW - RNA, Plant KW - sequence alignment KW - software KW - Transcription, Genetic KW - Untranslated Regions AB -

The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

VL - 31 CP - 19 ER - TY - JOUR T1 - Splicing signals in Drosophila: intron size, information content, and consensus sequences. JF - Nucleic Acids Res Y1 - 1992 A1 - Mount, S M A1 - Burks, C A1 - Hertz, G A1 - Stormo, G D A1 - White, O A1 - Fields, C KW - Animals KW - Base Sequence KW - Consensus Sequence KW - Databases, Factual KW - Drosophila KW - Introns KW - Molecular Sequence Data KW - RNA Splicing KW - RNA, Messenger KW - software AB -

A database of 209 Drosophila introns was extracted from Genbank (release number 64.0) and examined by a number of methods in order to characterize features that might serve as signals for messenger RNA splicing. A tight distribution of sizes was observed: while the smallest introns in the database are 51 nucleotides, more than half are less than 80 nucleotides in length, and most of these have lengths in the range of 59-67 nucleotides. Drosophila splice sites found in large and small introns differ in only minor ways from each other and from those found in vertebrate introns. However, larger introns have greater pyrimidine-richness in the region between 11 and 21 nucleotides upstream of 3' splice sites. The Drosophila branchpoint consensus matrix resembles C T A A T (in which branch formation occurs at the underlined A), and differs from the corresponding mammalian signal in the absence of G at the position immediately preceding the branchpoint. The distribution of occurrences of this sequence suggests a minimum distance between 5' splice sites and branchpoints of about 38 nucleotides, and a minimum distance between 3' splice sites and branchpoints of 15 nucleotides. The methods we have used detect no information in exon sequences other than in the few nucleotides immediately adjacent to the splice sites. However, Drosophila resembles many other species in that there is a discontinuity in A + T content between exons and introns, which are A + T rich.

VL - 20 CP - 16 ER -