TY  - Generic
T1  - Orchestrating high-throughput genomic analysis with Bioconductor.
Y1  - 2015
A1  - Huber, Wolfgang
A1  - Carey, Vincent J
A1  - Gentleman, Robert
A1  - Anders, Simon
A1  - Carlson, Marc
A1  - Carvalho, Benilton S
A1  - Bravo, Héctor Corrada
A1  - Davis, Sean
A1  - Gatto, Laurent
A1  - Girke, Thomas
A1  - Gottardo, Raphael
A1  - Hahne, Florian
A1  - Hansen, Kasper D
A1  - Irizarry, Rafael A
A1  - Lawrence, Michael
A1  - Love, Michael I
A1  - MacDonald, James
A1  - Obenchain, Valerie
A1  - Oleś, Andrzej K
A1  - Pagès, Hervé
A1  - Reyes, Alejandro
A1  - Shannon, Paul
A1  - Smyth, Gordon K
A1  - Tenenbaum, Dan
A1  - Waldron, Levi
A1  - Morgan, Martin
KW  - Computational Biology
KW  - Gene Expression Profiling
KW  - Genomics
KW  - High-Throughput Screening Assays
KW  - Programming Languages
KW  - software
KW  - User-Computer Interface
AB  - <p>Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.</p>
JA  - Nat Methods
VL  - 12
CP  - 2
M3  - 10.1038/nmeth.3252
ER  - 

TY  - JOUR
T1  - Automated ensemble assembly and validation of microbial genomes.
JF  - BMC Bioinformatics
Y1  - 2014
A1  - Koren, Sergey
A1  - Todd Treangen
A1  - Hill, Christopher M
A1  - Pop, Mihai
A1  - Phillippy, Adam M
KW  - Genome, Bacterial
KW  - Genome, Microbial
KW  - Genomics
KW  - Mycobacterium tuberculosis
KW  - Rhodobacter sphaeroides
KW  - Sequence Analysis, DNA
KW  - software
AB  - <p><b>BACKGROUND: </b>The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.</p><p><b>RESULTS: </b>To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.</p><p><b>CONCLUSIONS: </b>Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.</p>
VL  - 15
M3  - 10.1186/1471-2105-15-126
ER  - 

TY  - Generic
T1  - BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution.
Y1  - 2014
A1  - Ye, Chengxi
A1  - Hsiao, Chiaowen
A1  - Corrada Bravo, Hector
KW  - algorithms
KW  - High-Throughput Nucleotide Sequencing
KW  - HUMANS
KW  - Probability
KW  - Reproducibility of Results
KW  - Sequence Analysis, DNA
KW  - software
KW  - Time factors
AB  - <p><b>MOTIVATION: </b>Base-calling of sequencing data produced by high-throughput sequencing platforms is a fundamental process in current bioinformatics analysis. However, existing third-party probabilistic or machine-learning methods that significantly improve the accuracy of base-calls on these platforms are impractical for production use due to their computational inefficiency.</p><p><b>RESULTS: </b>We directly formulate base-calling as a blind deconvolution problem and implemented BlindCall as an efficient solver to this inverse problem. BlindCall produced base-calls at accuracy comparable to state-of-the-art probabilistic methods while processing data at rates 10 times faster in most cases. The computational complexity of BlindCall scales linearly with read length making it better suited for new long-read sequencing technologies.</p>
JA  - Bioinformatics
VL  - 30
CP  - 9
M3  - 10.1093/bioinformatics/btu010
ER  - 

TY  - Generic
T1  - Epiviz: interactive visual analytics for functional genomics data.
Y1  - 2014
A1  - Chelaru, Florin
A1  - Smith, Llewellyn
A1  - Goldstein, Naomi
A1  - Bravo, Héctor Corrada
KW  - algorithms
KW  - Chromosome mapping
KW  - Data Mining
KW  - database management systems
KW  - Databases, Genetic
KW  - Genomics
KW  - Internet
KW  - software
KW  - User-Computer Interface
AB  - <p>Visualization is an integral aspect of genomics data analysis. Algorithmic-statistical analysis and interactive visualization are most effective when used iteratively. Epiviz (http://epiviz.cbcb.umd.edu/), a web-based genome browser, and the Epivizr Bioconductor package allow interactive, extensible and reproducible visualization within a state-of-the-art data-analysis platform.</p>
JA  - Nat Methods
VL  - 11
CP  - 9
M3  - 10.1038/nmeth.3038
ER  - 

TY  - Generic
T1  - Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.
Y1  - 2014
A1  - Aryee, Martin J
A1  - Jaffe, Andrew E
A1  - Corrada-Bravo, Hector
A1  - Ladd-Acosta, Christine
A1  - Feinberg, Andrew P
A1  - Hansen, Kasper D
A1  - Irizarry, Rafael A
KW  - Aged
KW  - algorithms
KW  - Colonic Neoplasms
KW  - DNA Methylation
KW  - Genome
KW  - High-Throughput Nucleotide Sequencing
KW  - HUMANS
KW  - Oligonucleotide Array Sequence Analysis
KW  - Polymorphism, Single Nucleotide
KW  - software
AB  - <p><b>MOTIVATION: </b>The recently released Infinium HumanMethylation450 array (the '450k' array) provides a high-throughput assay to quantify DNA methylation (DNAm) at ∼450 000 loci across a range of genomic features. Although less comprehensive than high-throughput sequencing-based techniques, this product is more cost-effective and promises to be the most widely used DNAm high-throughput measurement technology over the next several years.</p><p><b>RESULTS: </b>Here we describe a suite of computational tools that incorporate state-of-the-art statistical techniques for the analysis of DNAm data. The software is structured to easily adapt to future versions of the technology. We include methods for preprocessing, quality assessment and detection of differentially methylated regions from the kilobase to the megabase scale. We show how our software provides a powerful and flexible development platform for future methods. We also illustrate how our methods empower the technology to make discoveries previously thought to be possible only with sequencing-based methods.</p><p><b>AVAILABILITY AND IMPLEMENTATION: </b>http://bioconductor.org/packages/release/bioc/html/minfi.html.</p><p><b>CONTACT: </b>khansen@jhsph.edu; rafa@jimmy.harvard.edu</p><p><b>SUPPLEMENTARY INFORMATION: </b>Supplementary data are available at Bioinformatics online.</p>
JA  - Bioinformatics
VL  - 30
CP  - 10
M3  - 10.1093/bioinformatics/btu049
ER  - 

TY  - JOUR
T1  - Quantitative 4D analyses of epithelial folding during Drosophila gastrulation.
JF  - Development
Y1  - 2014
A1  - Khan, Zia
A1  - Wang, Yu-Chiun
A1  - Wieschaus, Eric F
A1  - Kaschube, Matthias
KW  - Animals
KW  - Body Patterning
KW  - Cell Shape
KW  - Cell Tracking
KW  - Drosophila melanogaster
KW  - Epithelial Cells
KW  - Epithelium
KW  - Gastrulation
KW  - Image Processing, Computer-Assisted
KW  - software
AB  - <p>Understanding the cellular and mechanical processes that underlie the shape changes of individual cells and their collective behaviors in a tissue during dynamic and complex morphogenetic events is currently one of the major frontiers in developmental biology. The advent of high-speed time-lapse microscopy and its use in monitoring the cellular events in fluorescently labeled developing organisms demonstrate tremendous promise in establishing detailed descriptions of these events and could potentially provide a foundation for subsequent hypothesis-driven research strategies. However, obtaining quantitative measurements of dynamic shapes and behaviors of cells and tissues in a rapidly developing metazoan embryo using time-lapse 3D microscopy remains technically challenging, with the main hurdle being the shortage of robust imaging processing and analysis tools. We have developed EDGE4D, a software tool for segmenting and tracking membrane-labeled cells using multi-photon microscopy data. Our results demonstrate that EDGE4D enables quantification of the dynamics of cell shape changes, cell interfaces and neighbor relations at single-cell resolution during a complex epithelial folding event in the early Drosophila embryo. We expect this tool to be broadly useful for the analysis of epithelial cell geometries and movements in a wide variety of developmental contexts.</p>
VL  - 141
CP  - 14
M3  - 10.1242/dev.107730
ER  - 

TY  - Generic
T1  - Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.
Y1  - 2014
A1  - Patro, Rob
A1  - Mount, Stephen M
A1  - Kingsford, Carl
KW  - algorithms
KW  - Brain Chemistry
KW  - Computational Biology
KW  - HUMANS
KW  - Models, Biological
KW  - RNA Isoforms
KW  - Sequence Analysis, RNA
KW  - software
AB  - <p>We introduce Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Because Sailfish entirely avoids mapping reads, a time-consuming step in all current methods, it provides quantification estimates much faster than do existing approaches (typically 20 times faster) without loss of accuracy. By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.</p>
JA  - Nat Biotechnol
VL  - 32
CP  - 5
M3  - 10.1038/nbt.2862
ER  - 

TY  - JOUR
T1  - InterPro in 2011: new developments in the family and domain prediction database
JF  - Nucleic acids researchNucleic Acids Research
Y1  - 2012
A1  - Hunter, Sarah
A1  - Jones, Philip
A1  - Mitchell, Alex
A1  - Apweiler, Rolf
A1  - Attwood, Teresa K.
A1  - Bateman, Alex
A1  - Bernard, Thomas
A1  - Binns, David
A1  - Bork, Peer
A1  - Burge, Sarah
A1  - de Castro, Edouard
A1  - Coggill, Penny
A1  - Corbett, Matthew
A1  - Das, Ujjwal
A1  - Daugherty, Louise
A1  - Duquenne, Lauranne
A1  - Finn, Robert D.
A1  - Fraser, Matthew
A1  - Gough, Julian
A1  - Haft, Daniel
A1  - Hulo, Nicolas
A1  - Kahn, Daniel
A1  - Kelly, Elizabeth
A1  - Letunic, Ivica
A1  - Lonsdale, David
A1  - Lopez, Rodrigo
A1  - Madera, Martin
A1  - Maslen, John
A1  - McAnulla, Craig
A1  - McDowall, Jennifer
A1  - McMenamin, Conor
A1  - Mi, Huaiyu
A1  - Mutowo-Muellenet, Prudence
A1  - Mulder, Nicola
A1  - Natale, Darren
A1  - Orengo, Christine
A1  - Pesseat, Sebastien
A1  - Punta, Marco
A1  - Quinn, Antony F.
A1  - Rivoire, Catherine
A1  - Sangrador-Vegas, Amaia
A1  - J. Selengut
A1  - Sigrist, Christian J. A.
A1  - Scheremetjew, Maxim
A1  - Tate, John
A1  - Thimmajanarthanan, Manjulapramila
A1  - Thomas, Paul D.
A1  - Wu, Cathy H.
A1  - Yeats, Corin
A1  - Yong, Siew-Yit
KW  - Databases, Protein
KW  - Protein Structure, Tertiary
KW  - Proteins
KW  - Sequence Analysis, Protein
KW  - software
KW  - Terminology as Topic
KW  - User-Computer Interface
AB  - InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
VL  - 40
N1  - http://www.ncbi.nlm.nih.gov/pubmed/22096229?dopt=Abstract
ER  - 

TY  - JOUR
T1  - InterPro in 2011: new developments in the family and domain prediction database.
JF  - Nucleic Acids Res
Y1  - 2012
A1  - Hunter, Sarah
A1  - Jones, Philip
A1  - Mitchell, Alex
A1  - Apweiler, Rolf
A1  - Attwood, Teresa K
A1  - Bateman, Alex
A1  - Bernard, Thomas
A1  - Binns, David
A1  - Bork, Peer
A1  - Burge, Sarah
A1  - de Castro, Edouard
A1  - Coggill, Penny
A1  - Corbett, Matthew
A1  - Das, Ujjwal
A1  - Daugherty, Louise
A1  - Duquenne, Lauranne
A1  - Finn, Robert D
A1  - Fraser, Matthew
A1  - Gough, Julian
A1  - Haft, Daniel
A1  - Hulo, Nicolas
A1  - Kahn, Daniel
A1  - Kelly, Elizabeth
A1  - Letunic, Ivica
A1  - Lonsdale, David
A1  - Lopez, Rodrigo
A1  - Madera, Martin
A1  - Maslen, John
A1  - McAnulla, Craig
A1  - McDowall, Jennifer
A1  - McMenamin, Conor
A1  - Mi, Huaiyu
A1  - Mutowo-Muellenet, Prudence
A1  - Mulder, Nicola
A1  - Natale, Darren
A1  - Orengo, Christine
A1  - Pesseat, Sebastien
A1  - Punta, Marco
A1  - Quinn, Antony F
A1  - Rivoire, Catherine
A1  - Sangrador-Vegas, Amaia
A1  - Selengut, Jeremy D
A1  - Sigrist, Christian J A
A1  - Scheremetjew, Maxim
A1  - Tate, John
A1  - Thimmajanarthanan, Manjulapramila
A1  - Thomas, Paul D
A1  - Wu, Cathy H
A1  - Yeats, Corin
A1  - Yong, Siew-Yit
KW  - Databases, Protein
KW  - Protein Structure, Tertiary
KW  - Proteins
KW  - Sequence Analysis, Protein
KW  - software
KW  - Terminology as Topic
KW  - User-Computer Interface
AB  - <p>InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.</p>
VL  - 40
CP  - Database issue
M3  - 10.1093/nar/gkr948
ER  - 

TY  - JOUR
T1  - Accurate proteome-wide protein quantification from high-resolution 15N mass spectra.
JF  - Genome Biol
Y1  - 2011
A1  - Khan, Zia
A1  - Amini, Sasan
A1  - Bloom, Joshua S
A1  - Ruse, Cristian
A1  - Caudy, Amy A
A1  - Kruglyak, Leonid
A1  - Singh, Mona
A1  - Perlman, David H
A1  - Tavazoie, Saeed
KW  - algorithms
KW  - Amino Acid Sequence
KW  - Bacterial Proteins
KW  - Escherichia coli
KW  - Isotope Labeling
KW  - Mass Spectrometry
KW  - Molecular Sequence Data
KW  - Nitrogen Isotopes
KW  - Proteome
KW  - proteomics
KW  - Sensitivity and Specificity
KW  - software
AB  - <p>In quantitative mass spectrometry-based proteomics, the metabolic incorporation of a single source of 15N-labeled nitrogen has many advantages over using stable isotope-labeled amino acids. However, the lack of a robust computational framework for analyzing the resulting spectra has impeded wide use of this approach. We have addressed this challenge by introducing a new computational methodology for analyzing 15N spectra in which quantification is integrated with identification. Application of this method to an Escherichia coli growth transition reveals significant improvement in quantification accuracy over previous methods.</p>
VL  - 12
CP  - 12
M3  - 10.1186/gb-2011-12-12-r122
ER  - 

TY  - Generic
T1  - Computing the Tree of Life: Leveraging the Power of Desktop and Service Grids
T2  - Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Y1  - 2011
A1  - Adam L. Bazinet
A1  - Michael P. Cummings
KW  - (artificial
KW  - (mathematics)
KW  - analysis
KW  - BOINC
KW  - COMPUTATION
KW  - computational
KW  - computing
KW  - data
KW  - Estimation
KW  - evolutionary
KW  - GARLI
KW  - genetic
KW  - Grid
KW  - GRIDS
KW  - handling
KW  - heterogeneous
KW  - History
KW  - HPC
KW  - information
KW  - intelligence)
KW  - interface
KW  - interfaces
KW  - Internet
KW  - jobs
KW  - lattice
KW  - learning
KW  - life
KW  - likelihood
KW  - load
KW  - machine
KW  - maximum
KW  - method
KW  - model
KW  - molecular
KW  - phylogenetic
KW  - portal
KW  - Portals
KW  - power
KW  - project
KW  - resource
KW  - Science
KW  - sequence
KW  - service
KW  - services
KW  - sets
KW  - software
KW  - substantial
KW  - system
KW  - systematics
KW  - tree
KW  - TREES
KW  - user
KW  - Web
AB  - The trend in life sciences research, particularly in molecular evolutionary systematics, is toward larger data sets and ever-more detailed evolutionary models, which can generate substantial computational loads. Over the past several years we have developed a grid computing system aimed at providing researchers the computational power needed to complete such analyses in a timely manner. Our grid system, known as The Lattice Project, was the first to combine two models of grid computing - the service model, which mainly federates large institutional HPC resources, and the desktop model, which harnesses the power of PCs volunteered by the general public. Recently we have developed a "science portal" style web interface that makes it easier than ever for phylogenetic analyses to be completed using GARLI, a popular program that uses a maximum likelihood method to infer the evolutionary history of organisms on the basis of genetic sequence data. This paper describes our approach to scheduling thousands of GARLI jobs with diverse requirements to heterogeneous grid resources, which include volunteer computers running BOINC software. A key component of this system provides a priori GARLI runtime estimates using machine learning with random forests.
JA  - Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
ER  - 

TY  - JOUR
T1  - ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process
JF  - BMC bioinformaticsBMC Bioinformatics
Y1  - 2011
A1  - Basu, Malay K.
A1  - J. Selengut
A1  - Haft, Daniel H.
KW  - algorithms
KW  - Archaea
KW  - Archaeal Proteins
KW  - DNA
KW  - Methane
KW  - Phylogeny
KW  - software
AB  - BACKGROUND: Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies. RESULTS: Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries. CONCLUSION: ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.
VL  - 12
N1  - http://www.ncbi.nlm.nih.gov/pubmed/22070167?dopt=Abstract
ER  - 

TY  - JOUR
T1  - ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process.
JF  - BMC Bioinformatics
Y1  - 2011
A1  - Basu, Malay K
A1  - Selengut, Jeremy D
A1  - Haft, Daniel H
KW  - algorithms
KW  - Archaea
KW  - Archaeal Proteins
KW  - DNA
KW  - Methane
KW  - Phylogeny
KW  - software
AB  - <p><b>BACKGROUND: </b>Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies.</p><p><b>RESULTS: </b>Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries.</p><p><b>CONCLUSION: </b>ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.</p>
VL  - 12
M3  - 10.1186/1471-2105-12-434
ER  - 

TY  - JOUR
T1  - TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes
JF  - Nucleic acids researchNucleic Acids Research
Y1  - 2007
A1  - J. Selengut
A1  - Haft, Daniel H.
A1  - Davidsen, Tanja
A1  - Ganapathy, Anurhada
A1  - Gwinn-Giglio, Michelle
A1  - Nelson, William C.
A1  - Richter, R. Alexander
A1  - White, Owen
KW  - Archaeal Proteins
KW  - Bacterial Proteins
KW  - Databases, Protein
KW  - Genome, Bacterial
KW  - Genomics
KW  - Internet
KW  - Phylogeny
KW  - software
KW  - User-Computer Interface
AB  - TIGRFAMs is a collection of protein family definitions built to aid in high-throughput annotation of specific protein functions. Each family is based on a hidden Markov model (HMM), where both cutoff scores and membership in the seed alignment are chosen so that the HMMs can classify numerous proteins according to their specific molecular functions. Most TIGRFAMs models describe 'equivalog' families, where both orthology and lateral gene transfer may be part of the evolutionary history, but where a single molecular function has been conserved. The Genome Properties system contains a queriable set of metabolic reconstructions, genome metrics and extractions of information from the scientific literature. Its genome-by-genome assertions of whether or not specific structures, pathways or systems are present provide high-level conceptual descriptions of genomic content. These assertions enable comparative genomics, provide a meaningful biological context to aid in manual annotation, support assignments of Gene Ontology (GO) biological process terms and help validate HMM-based predictions of protein function. The Genome Properties system is particularly useful as a generator of phylogenetic profiles, through which new protein family functions may be discovered. The TIGRFAMs and Genome Properties systems can be accessed at http://www.tigr.org/TIGRFAMs and http://www.tigr.org/Genome_Properties.
VL  - 35
N1  - http://www.ncbi.nlm.nih.gov/pubmed/17151080?dopt=Abstract
ER  - 

TY  - JOUR
T1  - Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics
JF  - Bioinformatics (Oxford, England)Bioinformatics (Oxford, England)
Y1  - 2005
A1  - Haft, Daniel H.
A1  - J. Selengut
A1  - Brinkac, Lauren M.
A1  - Zafar, Nikhat
A1  - White, Owen
KW  - Chromosome mapping
KW  - database management systems
KW  - Databases, Genetic
KW  - documentation
KW  - Gene Expression Profiling
KW  - Gene Expression Regulation
KW  - Genomics
KW  - Information Storage and Retrieval
KW  - Microbiological Techniques
KW  - natural language processing
KW  - Prokaryotic Cells
KW  - Proteome
KW  - signal transduction
KW  - software
KW  - User-Computer Interface
KW  - Vocabulary, Controlled
AB  - MOTIVATION: The presence or absence of metabolic pathways and structures provide a context that makes protein annotation far more reliable. Compiling such information across microbial genomes improves the functional classification of proteins and provides a valuable resource for comparative genomics. RESULTS: We have created a Genome Properties system to present key aspects of prokaryotic biology using standardized computational methods and controlled vocabularies. Properties reflect gene content, phenotype, phylogeny and computational analyses. The results of searches using hidden Markov models allow many properties to be deduced automatically, especially for families of proteins (equivalogs) conserved in function since their last common ancestor. Additional properties are derived from curation, published reports and other forms of evidence. Genome Properties system was applied to 156 complete prokaryotic genomes, and is easily mined to find differences between species, correlations between metabolic features and families of uncharacterized proteins, or relationships among properties. AVAILABILITY: Genome Properties can be found at http://www.tigr.org/Genome_Properties SUPPLEMENTARY INFORMATION: http://www.tigr.org/tigr-scripts/CMR2/genome_properties_references.spl.
VL  - 21
N1  - http://www.ncbi.nlm.nih.gov/pubmed/15347579?dopt=Abstract
ER  - 

TY  - JOUR
T1  - Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.
JF  - Nucleic Acids Res
Y1  - 2003
A1  - Haas, Brian J
A1  - Delcher, Arthur L
A1  - Mount, Stephen M
A1  - Wortman, Jennifer R
A1  - Smith, Roger K
A1  - Hannick, Linda I
A1  - Maiti, Rama
A1  - Ronning, Catherine M
A1  - Rusch, Douglas B
A1  - Town, Christopher D
A1  - Salzberg, Steven L
A1  - White, Owen
KW  - algorithms
KW  - Alternative Splicing
KW  - Arabidopsis
KW  - DNA, Complementary
KW  - Expressed Sequence Tags
KW  - Genome, Plant
KW  - Introns
KW  - Plant Proteins
KW  - RNA, Plant
KW  - sequence alignment
KW  - software
KW  - Transcription, Genetic
KW  - Untranslated Regions
AB  - <p>The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.</p>
VL  - 31
CP  - 19
ER  - 

TY  - JOUR
T1  - Splicing signals in Drosophila: intron size, information content, and consensus sequences.
JF  - Nucleic Acids Res
Y1  - 1992
A1  - Mount, S M
A1  - Burks, C
A1  - Hertz, G
A1  - Stormo, G D
A1  - White, O
A1  - Fields, C
KW  - Animals
KW  - Base Sequence
KW  - Consensus Sequence
KW  - Databases, Factual
KW  - Drosophila
KW  - Introns
KW  - Molecular Sequence Data
KW  - RNA Splicing
KW  - RNA, Messenger
KW  - software
AB  - <p>A database of 209 Drosophila introns was extracted from Genbank (release number 64.0) and examined by a number of methods in order to characterize features that might serve as signals for messenger RNA splicing. A tight distribution of sizes was observed: while the smallest introns in the database are 51 nucleotides, more than half are less than 80 nucleotides in length, and most of these have lengths in the range of 59-67 nucleotides. Drosophila splice sites found in large and small introns differ in only minor ways from each other and from those found in vertebrate introns. However, larger introns have greater pyrimidine-richness in the region between 11 and 21 nucleotides upstream of 3' splice sites. The Drosophila branchpoint consensus matrix resembles C T A A T (in which branch formation occurs at the underlined A), and differs from the corresponding mammalian signal in the absence of G at the position immediately preceding the branchpoint. The distribution of occurrences of this sequence suggests a minimum distance between 5' splice sites and branchpoints of about 38 nucleotides, and a minimum distance between 3' splice sites and branchpoints of 15 nucleotides. The methods we have used detect no information in exon sequences other than in the few nucleotides immediately adjacent to the splice sites. However, Drosophila resembles many other species in that there is a discontinuity in A + T content between exons and introns, which are A + T rich.</p>
VL  - 20
CP  - 16
ER  -