TY - Generic T1 - Orchestrating high-throughput genomic analysis with Bioconductor. Y1 - 2015 A1 - Huber, Wolfgang A1 - Carey, Vincent J A1 - Gentleman, Robert A1 - Anders, Simon A1 - Carlson, Marc A1 - Carvalho, Benilton S A1 - Bravo, Héctor Corrada A1 - Davis, Sean A1 - Gatto, Laurent A1 - Girke, Thomas A1 - Gottardo, Raphael A1 - Hahne, Florian A1 - Hansen, Kasper D A1 - Irizarry, Rafael A A1 - Lawrence, Michael A1 - Love, Michael I A1 - MacDonald, James A1 - Obenchain, Valerie A1 - Oleś, Andrzej K A1 - Pagès, Hervé A1 - Reyes, Alejandro A1 - Shannon, Paul A1 - Smyth, Gordon K A1 - Tenenbaum, Dan A1 - Waldron, Levi A1 - Morgan, Martin KW - Computational Biology KW - Gene Expression Profiling KW - Genomics KW - High-Throughput Screening Assays KW - Programming Languages KW - software KW - User-Computer Interface AB -

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.

JA - Nat Methods VL - 12 CP - 2 M3 - 10.1038/nmeth.3252 ER - TY - JOUR T1 - Derepression of Cancer/testis antigens in cancer is associated with distinct patterns of DNA hypomethylation JF - BMC CancerBMC CancerBMC Cancer Y1 - 2013 A1 - Kim, R. A1 - Kulkarni, P. A1 - Sridhar Hannenhalli KW - *DNA Methylation KW - *Gene Expression Regulation, Neoplastic KW - *Genes, X-Linked KW - Antigens, Neoplasm/*genetics KW - Binding Sites KW - Cluster Analysis KW - CpG Islands KW - Gene Expression Profiling KW - HUMANS KW - Male KW - Neoplasms/*genetics/*metabolism KW - Promoter Regions, Genetic KW - Protein Binding KW - Protein Interaction Domains and Motifs KW - Testis/*metabolism AB - BACKGROUND: The Cancer/Testis Antigens (CTAs) are a heterogeneous group of proteins whose expression is typically restricted to the testis. However, they are aberrantly expressed in most cancers that have been examined to date. Broadly speaking, the CTAs can be divided into two groups: the CTX antigens that are encoded by the X-linked genes and the non-X CT antigens that are encoded by the autosomes. Unlike the non-X CTAs, the CTX antigens form clusters of closely related gene families and their expression is frequently associated with advanced disease with poorer prognosis. Regardless however, the mechanism(s) underlying their selective derepression and stage-specific expression in cancer remain poorly understood, although promoter DNA demethylation is believed to be the major driver. METHODS: Here, we report a systematic analysis of DNA methylation profiling data from various tissue types to elucidate the mechanism underlying the derepression of the CTAs in cancer. We analyzed the methylation profiles of 501 samples including sperm, several cancer types, and their corresponding normal somatic tissue types. RESULTS: We found strong evidence for specific DNA hypomethylation of CTA promoters in the testis and cancer cells but not in their normal somatic counterparts. We also found that hypomethylation was clustered on the genome into domains that coincided with nuclear lamina-associated domains (LADs) and that these regions appeared to be insulated by CTCF sites. Interestingly, we did not observe any significant differences in the hypomethylation pattern between the CTAs without CpG islands and the CTAs with CpG islands in the proximal promoter. CONCLUSION: Our results corroborate that widespread DNA hypomethylation appears to be the driver in the derepression of CTA expression in cancer and furthermore, demonstrate that these hypomethylated domains are associated with the nuclear lamina-associated domains (LADS). Taken together, our results suggest that wide-spread methylation changes in cancer are linked to derepression of germ-line-specific genes that is orchestrated by the three dimensional organization of the cancer genome. VL - 13 SN - 1471-2407 (Electronic)
1471-2407 (Linking) N1 - Kim, Robert
Kulkarni, Prakash
Hannenhalli, Sridhar
eng
R01 GM100335/GM/NIGMS NIH HHS/
R01GM100335/GM/NIGMS NIH HHS/
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
England
2013/03/26 06:00
BMC Cancer. 2013 Mar 22;13:144. doi: 10.1186/1471-2407-13-144. U2 - 3618251 J1 - BMC cancerBMC cancer ER - TY - JOUR T1 - Gene expression anti-profiles as a basis for accurate universal cancer signatures JF - BMC bioinformaticsBMC Bioinformatics Y1 - 2012 A1 - Héctor Corrada Bravo A1 - Pihur, Vasyl A1 - McCall, Matthew A1 - Irizarry, Rafael A. A1 - Leek, Jeffrey T. KW - Area Under Curve KW - Colonic Neoplasms KW - Gene Expression Profiling KW - Genetic Variation KW - Genomics KW - HUMANS KW - Oligonucleotide Array Sequence Analysis KW - Prognosis KW - Transcriptome KW - Tumor Markers, Biological AB - BACKGROUND: Early screening for cancer is arguably one of the greatest public health advances over the last fifty years. However, many cancer screening tests are invasive (digital rectal exams), expensive (mammograms, imaging) or both (colonoscopies). This has spurred growing interest in developing genomic signatures that can be used for cancer diagnosis and prognosis. However, progress has been slowed by heterogeneity in cancer profiles and the lack of effective computational prediction tools for this type of data. RESULTS: We developed anti-profiles as a first step towards translating experimental findings suggesting that stochastic across-sample hyper-variability in the expression of specific genes is a stable and general property of cancer into predictive and diagnostic signatures. Using single-chip microarray normalization and quality assessment methods, we developed an anti-profile for colon cancer in tissue biopsy samples. To demonstrate the translational potential of our findings, we applied the signature developed in the tissue samples, without any further retraining or normalization, to screen patients for colon cancer based on genomic measurements from peripheral blood in an independent study (AUC of 0.89). This method achieved higher accuracy than the signature underlying commercially available peripheral blood screening tests for colon cancer (AUC of 0.81). We also confirmed the existence of hyper-variable genes across a range of cancer types and found that a significant proportion of tissue-specific genes are hyper-variable in cancer. Based on these observations, we developed a universal cancer anti-profile that accurately distinguishes cancer from normal regardless of tissue type (ten-fold cross-validation AUC > 0.92). CONCLUSIONS: We have introduced anti-profiles as a new approach for developing cancer genomic signatures that specifically takes advantage of gene expression heterogeneity. We have demonstrated that anti-profiles can be successfully applied to develop peripheral-blood based diagnostics for cancer and used anti-profiles to develop a highly accurate universal cancer signature. By using single-chip normalization and quality assessment methods, no further retraining of signatures developed by the anti-profile approach would be required before their application in clinical settings. Our results suggest that anti-profiles may be used to develop inexpensive and non-invasive universal cancer screening tests. VL - 13 N1 - http://www.ncbi.nlm.nih.gov/pubmed/23088656?dopt=Abstract ER - TY - JOUR T1 - The partitioned LASSO-patternsearch algorithm with application to gene expression data JF - BMC bioinformaticsBMC Bioinformatics Y1 - 2012 A1 - Shi, Weiliang A1 - Wahba, Grace A1 - Irizarry, Rafael A. A1 - Héctor Corrada Bravo A1 - Wright, Stephen J. KW - algorithms KW - Breast Neoplasms KW - Computer simulation KW - Female KW - Gene expression KW - Gene Expression Profiling KW - Genomics KW - HUMANS KW - Models, Genetic AB - BACKGROUND: In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly to modern regularization approaches that thrive in high-dimensional settings. RESULTS: The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes. CONCLUSIONS: We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several alternative methodologies. VL - 13 N1 - http://www.ncbi.nlm.nih.gov/pubmed/22587526?dopt=Abstract ER - TY - Generic T1 - Quantitative measurement of allele-specific protein expression in a diploid yeast hybrid by LC-MS. Y1 - 2012 A1 - Khan, Zia A1 - Bloom, Joshua S A1 - Amini, Sasan A1 - Singh, Mona A1 - Perlman, David H A1 - Caudy, Amy A A1 - Kruglyak, Leonid KW - Alleles KW - Chromatography, Liquid KW - Fungal Proteins KW - Gene Expression Profiling KW - Gene Expression Regulation, Fungal KW - HUMANS KW - Mass Spectrometry KW - proteomics KW - Regression Analysis KW - Saccharomyces KW - Saccharomyces cerevisiae KW - Saccharomyces cerevisiae Proteins KW - Species Specificity AB -

Understanding the genetic basis of gene regulatory variation is a key goal of evolutionary and medical genetics. Regulatory variation can act in an allele-specific manner (cis-acting) or it can affect both alleles of a gene (trans-acting). Differential allele-specific expression (ASE), in which the expression of one allele differs from another in a diploid, implies the presence of cis-acting regulatory variation. While microarrays and high-throughput sequencing have enabled genome-wide measurements of transcriptional ASE, methods for measurement of protein ASE (pASE) have lagged far behind. We describe a flexible, accurate, and scalable strategy for measurement of pASE by liquid chromatography-coupled mass spectrometry (LC-MS). We apply this approach to a hybrid between the yeast species Saccharomyces cerevisiae and Saccharomyces bayanus. Our results provide the first analysis of the relative contribution of cis-acting and trans-acting regulatory differences to protein expression divergence between yeast species.

JA - Mol Syst Biol VL - 8 M3 - 10.1038/msb.2012.34 ER - TY - Generic T1 - Transcript expression analysis of putative Trypanosoma brucei GPI-anchored surface proteins during development in the tsetse and mammalian hosts. Y1 - 2012 A1 - Savage, Amy F A1 - Cerqueira, Gustavo C A1 - Regmi, Sandesh A1 - Wu, Yineng A1 - El Sayed, Najib M A1 - Aksoy, Serap KW - Animals KW - Computational Biology KW - Gastrointestinal Tract KW - Gene Expression Profiling KW - GPI-Linked Proteins KW - HUMANS KW - Male KW - Membrane Proteins KW - Protozoan Proteins KW - Real-Time Polymerase Chain Reaction KW - Salivary Glands KW - Trypanosoma brucei brucei KW - Trypanosomiasis, African KW - Tsetse Flies AB -

Human African Trypanosomiasis is a devastating disease caused by the parasite Trypanosoma brucei. Trypanosomes live extracellularly in both the tsetse fly and the mammal. Trypanosome surface proteins can directly interact with the host environment, allowing parasites to effectively establish and maintain infections. Glycosylphosphatidylinositol (GPI) anchoring is a common posttranslational modification associated with eukaryotic surface proteins. In T. brucei, three GPI-anchored major surface proteins have been identified: variant surface glycoproteins (VSGs), procyclic acidic repetitive protein (PARP or procyclins), and brucei alanine rich proteins (BARP). The objective of this study was to select genes encoding predicted GPI-anchored proteins with unknown function(s) from the T. brucei genome and characterize the expression profile of a subset during cyclical development in the tsetse and mammalian hosts. An initial in silico screen of putative T. brucei proteins by Big PI algorithm identified 163 predicted GPI-anchored proteins, 106 of which had no known functions. Application of a second GPI-anchor prediction algorithm (FragAnchor), signal peptide and trans-membrane domain prediction software resulted in the identification of 25 putative hypothetical proteins. Eighty-one gene products with hypothetical functions were analyzed for stage-regulated expression using semi-quantitative RT-PCR. The expression of most of these genes were found to be upregulated in trypanosomes infecting tsetse salivary gland and proventriculus tissues, and 38% were specifically expressed only by parasites infecting salivary gland tissues. Transcripts for all of the genes specifically expressed in salivary glands were also detected in mammalian infective metacyclic trypomastigotes, suggesting a possible role for these putative proteins in invasion and/or establishment processes in the mammalian host. These results represent the first large-scale report of the differential expression of unknown genes encoding predicted T. brucei surface proteins during the complete developmental cycle. This knowledge may form the foundation for the development of future novel transmission blocking strategies against metacyclic parasites.

JA - PLoS Negl Trop Dis VL - 6 CP - 6 M3 - 10.1371/journal.pntd.0001708 ER - TY - JOUR T1 - Direct targeting of Sec23a by miR-200s influences cancer cell secretome and promotes metastatic colonization. JF - Nat Med Y1 - 2011 A1 - Korpal, Manav A1 - Ell, Brian J A1 - Buffa, Francesca M A1 - Ibrahim, Toni A1 - Blanco, Mario A A1 - Celià-Terrassa, Toni A1 - Mercatali, Laura A1 - Khan, Zia A1 - Goodarzi, Hani A1 - Hua, Yuling A1 - Wei, Yong A1 - Hu, Guohong A1 - Garcia, Benjamin A A1 - Ragoussis, Jiannis A1 - Amadori, Dino A1 - Harris, Adrian L A1 - Kang, Yibin KW - Animals KW - Cadherins KW - Cell Line, Tumor KW - Female KW - Gene Expression Profiling KW - Gene Expression Regulation, Neoplastic KW - HUMANS KW - Mass Spectrometry KW - Mice KW - Mice, Inbred BALB C KW - Microarray Analysis KW - MicroRNAs KW - Neoplasm Metastasis KW - Statistics, Nonparametric KW - Vesicular Transport Proteins AB -

Although the role of miR-200s in regulating E-cadherin expression and epithelial-to-mesenchymal transition is well established, their influence on metastatic colonization remains controversial. Here we have used clinical and experimental models of breast cancer metastasis to discover a pro-metastatic role of miR-200s that goes beyond their regulation of E-cadherin and epithelial phenotype. Overexpression of miR-200s is associated with increased risk of metastasis in breast cancer and promotes metastatic colonization in mouse models, phenotypes that cannot be recapitulated by E-cadherin expression alone. Genomic and proteomic analyses revealed global shifts in gene expression upon miR-200 overexpression toward that of highly metastatic cells. miR-200s promote metastatic colonization partly through direct targeting of Sec23a, which mediates secretion of metastasis-suppressive proteins, including Igfbp4 and Tinagl1, as validated by functional and clinical correlation studies. Overall, these findings suggest a pleiotropic role of miR-200s in promoting metastatic colonization by influencing E-cadherin-dependent epithelial traits and Sec23a-mediated tumor cell secretome.

VL - 17 CP - 9 M3 - 10.1038/nm.2401 ER - TY - JOUR T1 - Influence of host gene transcription level and orientation on HIV-1 latency in a primary-cell model JF - Journal of virologyJournal of virology Y1 - 2011 A1 - Shan, Liang A1 - Yang, Hung-Chih A1 - Rabi, S. Alireza A1 - Héctor Corrada Bravo A1 - Shroff, Neeta S. A1 - Irizarry, Rafael A. A1 - Zhang, Hao A1 - Margolick, Joseph B. A1 - Siliciano, Janet D. A1 - Siliciano, Robert F. KW - CD4-Positive T-Lymphocytes KW - Cells, Cultured KW - Gene Expression Profiling KW - Gene Expression Regulation, Viral KW - HIV-1 KW - HUMANS KW - Transcription, Genetic KW - Virus Integration KW - Virus Latency AB - Human immunodeficiency virus type 1 (HIV-1) establishes a latent reservoir in resting memory CD4(+) T cells. This latent reservoir is a major barrier to the eradication of HIV-1 in infected individuals and is not affected by highly active antiretroviral therapy (HAART). Reactivation of latent HIV-1 is a possible strategy for elimination of this reservoir. The mechanisms with which latency is maintained are unclear. In the analysis of the regulation of HIV-1 gene expression, it is important to consider the nature of HIV-1 integration sites. In this study, we analyzed the integration and transcription of latent HIV-1 in a primary CD4(+) T cell model of latency. The majority of integration sites in latently infected cells were in introns of transcription units. Serial analysis of gene expression (SAGE) demonstrated that more than 90% of those host genes harboring a latent integrated provirus were transcriptionally active, mostly at high levels. For latently infected cells, we observed a modest preference for integration in the same transcriptional orientation as the host gene (63.8% versus 36.2%). In contrast, this orientation preference was not observed in acutely infected or persistently infected cells. These results suggest that transcriptional interference may be one of the important factors in the establishment and maintenance of HIV-1 latency. Our findings suggest that disrupting the negative control of HIV-1 transcription by upstream host promoters could facilitate the reactivation of latent HIV-1 in some resting CD4(+) T cells. VL - 85 N1 - http://www.ncbi.nlm.nih.gov/pubmed/21430059?dopt=Abstract ER - TY - JOUR T1 - Genome-wide analysis reveals novel genes essential for heme homeostasis in Caenorhabditis elegans. JF - PLoS Genet Y1 - 2010 A1 - Severance, Scott A1 - Rajagopal, Abbhirami A1 - Rao, Anita U A1 - Cerqueira, Gustavo C A1 - Mitreva, Makedonka A1 - El-Sayed, Najib M A1 - Krause, Michael A1 - Hamza, Iqbal KW - Animals KW - Caenorhabditis elegans KW - Dose-Response Relationship, Drug KW - Gene Expression Profiling KW - Gene Expression Regulation KW - genes KW - Genome-Wide Association Study KW - Heme KW - Homeostasis KW - HUMANS KW - Leishmania KW - Nematoda KW - Trypanosoma AB -

Heme is a cofactor in proteins that function in almost all sub-cellular compartments and in many diverse biological processes. Heme is produced by a conserved biosynthetic pathway that is highly regulated to prevent the accumulation of heme--a cytotoxic, hydrophobic tetrapyrrole. Caenorhabditis elegans and related parasitic nematodes do not synthesize heme, but instead require environmental heme to grow and develop. Heme homeostasis in these auxotrophs is, therefore, regulated in accordance with available dietary heme. We have capitalized on this auxotrophy in C. elegans to study gene expression changes associated with precisely controlled dietary heme concentrations. RNA was isolated from cultures containing 4, 20, or 500 microM heme; derived cDNA probes were hybridized to Affymetrix C. elegans expression arrays. We identified 288 heme-responsive genes (hrgs) that were differentially expressed under these conditions. Of these genes, 42% had putative homologs in humans, while genomes of medically relevant heme auxotrophs revealed homologs for 12% in both Trypanosoma and Leishmania and 24% in parasitic nematodes. Depletion of each of the 288 hrgs by RNA-mediated interference (RNAi) in a transgenic heme-sensor worm strain identified six genes that regulated heme homeostasis. In addition, seven membrane-spanning transporters involved in heme uptake were identified by RNAi knockdown studies using a toxic heme analog. Comparison of genes that were positive in both of the RNAi screens resulted in the identification of three genes in common that were vital for organismal heme homeostasis in C. elegans. Collectively, our results provide a catalog of genes that are essential for metazoan heme homeostasis and demonstrate the power of C. elegans as a genetic animal model to dissect the regulatory circuits which mediate heme trafficking in both vertebrate hosts and their parasites, which depend on environmental heme for survival.

VL - 6 CP - 7 M3 - 10.1371/journal.pgen.1001044 ER - TY - JOUR T1 - Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL): adapting the Partial Phylogenetic Profiling algorithm to scan sequences for signatures that predict protein function JF - BMC bioinformaticsBMC Bioinformatics Y1 - 2010 A1 - J. Selengut A1 - Rusch, Douglas B. A1 - Haft, Daniel H. KW - algorithms KW - Amino Acid Sequence KW - Gene Expression Profiling KW - Molecular Sequence Data KW - Phylogeny KW - Proteins KW - Sequence Analysis, Protein KW - Structure-Activity Relationship AB - BACKGROUND: Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets. RESULTS: Here we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization. CONCLUSIONS: SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites. VL - 11 N1 - http://www.ncbi.nlm.nih.gov/pubmed/20102603?dopt=Abstract ER - TY - JOUR T1 - Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL): adapting the Partial Phylogenetic Profiling algorithm to scan sequences for signatures that predict protein function. JF - BMC Bioinformatics Y1 - 2010 A1 - Selengut, Jeremy D A1 - Rusch, Douglas B A1 - Haft, Daniel H KW - algorithms KW - Amino Acid Sequence KW - Gene Expression Profiling KW - Molecular Sequence Data KW - Phylogeny KW - Proteins KW - Sequence Analysis, Protein KW - Structure-Activity Relationship AB -

BACKGROUND: Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets.

RESULTS: Here we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization.

CONCLUSIONS: SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites.

VL - 11 M3 - 10.1186/1471-2105-11-52 ER - TY - JOUR T1 - Unexpected abundance of coenzyme F(420)-dependent enzymes in Mycobacterium tuberculosis and other actinobacteria. JF - J Bacteriol Y1 - 2010 A1 - Selengut, Jeremy D A1 - Haft, Daniel H KW - Actinobacteria KW - Amino Acid Sequence KW - Binding Sites KW - Coenzymes KW - Flavonoids KW - Gene Expression Profiling KW - Gene Expression Regulation, Bacterial KW - Genome, Bacterial KW - molecular biology KW - Molecular Sequence Data KW - Molecular Structure KW - Mycobacterium tuberculosis KW - Phylogeny KW - Protein Conformation KW - Riboflavin AB -

Regimens targeting Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), require long courses of treatment and a combination of three or more drugs. An increase in drug-resistant strains of M. tuberculosis demonstrates the need for additional TB-specific drugs. A notable feature of M. tuberculosis is coenzyme F(420), which is distributed sporadically and sparsely among prokaryotes. This distribution allows for comparative genomics-based investigations. Phylogenetic profiling (comparison of differential gene content) based on F(420) biosynthesis nominated many actinobacterial proteins as candidate F(420)-dependent enzymes. Three such families dominated the results: the luciferase-like monooxygenase (LLM), pyridoxamine 5'-phosphate oxidase (PPOX), and deazaflavin-dependent nitroreductase (DDN) families. The DDN family was determined to be limited to F(420)-producing species. The LLM and PPOX families were observed in F(420)-producing species as well as species lacking F(420) but were particularly numerous in many actinobacterial species, including M. tuberculosis. Partitioning the LLM and PPOX families based on an organism's ability to make F(420) allowed the application of the SIMBAL (sites inferred by metabolic background assertion labeling) profiling method to identify F(420)-correlated subsequences. These regions were found to correspond to flavonoid cofactor binding sites. Significantly, these results showed that M. tuberculosis carries at least 28 separate F(420)-dependent enzymes, most of unknown function, and a paucity of flavin mononucleotide (FMN)-dependent proteins in these families. While prevalent in mycobacteria, markers of F(420) biosynthesis appeared to be absent from the normal human gut flora. These findings suggest that M. tuberculosis relies heavily on coenzyme F(420) for its redox reactions. This dependence and the cofactor's rarity may make F(420)-related proteins promising drug targets.

VL - 192 CP - 21 M3 - 10.1128/JB.00425-10 ER - TY - JOUR T1 - Unexpected abundance of coenzyme F(420)-dependent enzymes in Mycobacterium tuberculosis and other actinobacteria JF - Journal of bacteriologyJournal of bacteriology Y1 - 2010 A1 - J. Selengut A1 - Haft, Daniel H. KW - Actinobacteria KW - Amino Acid Sequence KW - Binding Sites KW - Coenzymes KW - Flavonoids KW - Gene Expression Profiling KW - Gene Expression Regulation, Bacterial KW - Genome, Bacterial KW - molecular biology KW - Molecular Sequence Data KW - Molecular Structure KW - Mycobacterium tuberculosis KW - Phylogeny KW - Protein Conformation KW - Riboflavin AB - Regimens targeting Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), require long courses of treatment and a combination of three or more drugs. An increase in drug-resistant strains of M. tuberculosis demonstrates the need for additional TB-specific drugs. A notable feature of M. tuberculosis is coenzyme F(420), which is distributed sporadically and sparsely among prokaryotes. This distribution allows for comparative genomics-based investigations. Phylogenetic profiling (comparison of differential gene content) based on F(420) biosynthesis nominated many actinobacterial proteins as candidate F(420)-dependent enzymes. Three such families dominated the results: the luciferase-like monooxygenase (LLM), pyridoxamine 5'-phosphate oxidase (PPOX), and deazaflavin-dependent nitroreductase (DDN) families. The DDN family was determined to be limited to F(420)-producing species. The LLM and PPOX families were observed in F(420)-producing species as well as species lacking F(420) but were particularly numerous in many actinobacterial species, including M. tuberculosis. Partitioning the LLM and PPOX families based on an organism's ability to make F(420) allowed the application of the SIMBAL (sites inferred by metabolic background assertion labeling) profiling method to identify F(420)-correlated subsequences. These regions were found to correspond to flavonoid cofactor binding sites. Significantly, these results showed that M. tuberculosis carries at least 28 separate F(420)-dependent enzymes, most of unknown function, and a paucity of flavin mononucleotide (FMN)-dependent proteins in these families. While prevalent in mycobacteria, markers of F(420) biosynthesis appeared to be absent from the normal human gut flora. These findings suggest that M. tuberculosis relies heavily on coenzyme F(420) for its redox reactions. This dependence and the cofactor's rarity may make F(420)-related proteins promising drug targets. VL - 192 N1 - http://www.ncbi.nlm.nih.gov/pubmed/20675471?dopt=Abstract ER - TY - JOUR T1 - Genomic organization and expression profile of the mucin-associated surface protein (masp) family of the human pathogen Trypanosoma cruzi. JF - Nucleic Acids Res Y1 - 2009 A1 - Bartholomeu, Daniella C A1 - Cerqueira, Gustavo C A1 - Leão, Ana Carolina A A1 - daRocha, Wanderson D A1 - Pais, Fabiano S A1 - Macedo, Camila A1 - Djikeng, Appolinaire A1 - Teixeira, Santuza M R A1 - El-Sayed, Najib M KW - 3' Flanking Region KW - 5' Flanking Region KW - Amino Acid Sequence KW - Animals KW - Base Sequence KW - Conserved Sequence KW - Gene Expression Profiling KW - Genes, Protozoan KW - Genome, Protozoan KW - Membrane Proteins KW - Molecular Sequence Data KW - Mucins KW - Multigene Family KW - Protozoan Proteins KW - RNA, Messenger KW - Trypanosoma cruzi AB -

A novel large multigene family was recently identified in the human pathogen Trypanosoma cruzi, causative agent of Chagas disease, and corresponds to approximately 6% of the parasite diploid genome. The predicted gene products, mucin-associated surface proteins (MASPs), are characterized by highly conserved N- and C-terminal domains and a strikingly variable and repetitive central region. We report here an analysis of the genomic organization and expression profile of masp genes. Masps are not randomly distributed throughout the genome but instead are clustered with genes encoding mucin and other surface protein families. Masp transcripts vary in size, are preferentially expressed during the trypomastigote stage and contain highly conserved 5' and 3' untranslated regions. A sequence analysis of a trypomastigote cDNA library reveals the expression of multiple masp variants with a bias towards a particular masp subgroup. Immunofluorescence assays using antibodies generated against a MASP peptide reveals that the expression of particular MASPs at the cell membrane is limited to subsets of the parasite population. Western blots of phosphatidylinositol-specific phospholipase C (PI-PLC)-treated parasites suggest that MASP may be GPI-anchored and shed into the medium culture, thus contributing to the large repertoire of parasite polypeptides that are exposed to the host immune system.

VL - 37 CP - 10 M3 - 10.1093/nar/gkp172 ER - TY - JOUR T1 - Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. JF - BMC Genomics Y1 - 2009 A1 - Bloom, Joshua S A1 - Khan, Zia A1 - Kruglyak, Leonid A1 - Singh, Mona A1 - Caudy, Amy A KW - algorithms KW - DNA, Complementary KW - DNA, Fungal KW - Gene Expression Profiling KW - Oligonucleotide Array Sequence Analysis KW - Saccharomyces cerevisiae KW - sequence alignment KW - Sequence Analysis, DNA AB -

BACKGROUND: High-throughput cDNA synthesis and sequencing of poly(A)-enriched RNA is rapidly emerging as a technology competing to replace microarrays as a quantitative platform for measuring gene expression.

RESULTS: Consequently, we compared full length cDNA sequencing to 2-channel gene expression microarrays in the context of measuring differential gene expression. Because of its comparable cost to a gene expression microarray, our study focused on the data obtainable from a single lane of an Illumina 1 G sequencer. We compared sequencing data to a highly replicated microarray experiment profiling two divergent strains of S. cerevisiae.

CONCLUSION: Using a large number of quantitative PCR (qPCR) assays, more than previous studies, we found that neither technology is decisively better at measuring differential gene expression. Further, we report sequencing results from a diploid hybrid of two strains of S. cerevisiae that indicate full length cDNA sequencing can discover heterozygosity and measure quantitative allele-specific expression simultaneously.

VL - 10 M3 - 10.1186/1471-2164-10-221 ER - TY - JOUR T1 - Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. JF - Insect Mol Biol Y1 - 2006 A1 - Attardo, G M A1 - Strickler-Dinglasan, P A1 - Perkin, S A H A1 - Caler, E A1 - Bonaldo, M F A1 - Soares, M B A1 - El-Sayeed, N A1 - Aksoy, S KW - Adipose Tissue KW - Animals KW - Base Sequence KW - Computational Biology KW - DNA Primers KW - Egg Proteins KW - Expressed Sequence Tags KW - Female KW - Gene Expression Profiling KW - Insect Vectors KW - Male KW - Molecular Sequence Data KW - Reverse Transcriptase Polymerase Chain Reaction KW - Sequence Analysis, DNA KW - Sex Factors KW - Tsetse Flies AB -

Tsetse flies (Diptera: Glossinidia) are vectors of pathogenic African trypanosomes. To develop a foundation for tsetse physiology, a normalized expressed sequence tag (EST) library was constructed from fat body tissue of immune-stimulated Glossina morsitans morsitans. Analysis of 20,257 high-quality ESTs yielded 6372 unique genes comprised of 3059 tentative consensus (TC) sequences and 3313 singletons (available at http://aksoylab.yale.edu). We analysed the putative fat body transcriptome based on homology to other gene products with known functions available in the public domain. In particular, we describe the immune-related products, reproductive function related yolk proteins and milk-gland protein, iron metabolism regulating ferritins and transferrin, and tsetse's major energy source proline biosynthesis. Expression analysis of the three yolk proteins indicates that all are detected in females, while only the yolk protein with similarity to lipases, is expressed in males. Milk gland protein, apparently important for larval nutrition, however, is primarily synthesized by accessory milk gland tissue.

VL - 15 CP - 4 M3 - 10.1111/j.1365-2583.2006.00649.x ER - TY - JOUR T1 - Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics JF - Bioinformatics (Oxford, England)Bioinformatics (Oxford, England) Y1 - 2005 A1 - Haft, Daniel H. A1 - J. Selengut A1 - Brinkac, Lauren M. A1 - Zafar, Nikhat A1 - White, Owen KW - Chromosome mapping KW - database management systems KW - Databases, Genetic KW - documentation KW - Gene Expression Profiling KW - Gene Expression Regulation KW - Genomics KW - Information Storage and Retrieval KW - Microbiological Techniques KW - natural language processing KW - Prokaryotic Cells KW - Proteome KW - signal transduction KW - software KW - User-Computer Interface KW - Vocabulary, Controlled AB - MOTIVATION: The presence or absence of metabolic pathways and structures provide a context that makes protein annotation far more reliable. Compiling such information across microbial genomes improves the functional classification of proteins and provides a valuable resource for comparative genomics. RESULTS: We have created a Genome Properties system to present key aspects of prokaryotic biology using standardized computational methods and controlled vocabularies. Properties reflect gene content, phenotype, phylogeny and computational analyses. The results of searches using hidden Markov models allow many properties to be deduced automatically, especially for families of proteins (equivalogs) conserved in function since their last common ancestor. Additional properties are derived from curation, published reports and other forms of evidence. Genome Properties system was applied to 156 complete prokaryotic genomes, and is easily mined to find differences between species, correlations between metabolic features and families of uncharacterized proteins, or relationships among properties. AVAILABILITY: Genome Properties can be found at http://www.tigr.org/Genome_Properties SUPPLEMENTARY INFORMATION: http://www.tigr.org/tigr-scripts/CMR2/genome_properties_references.spl. VL - 21 N1 - http://www.ncbi.nlm.nih.gov/pubmed/15347579?dopt=Abstract ER - TY - JOUR T1 - Transcriptional profiling of the hyperthermophilic methanarchaeon Methanococcus jannaschii in response to lethal heat and non-lethal cold shock. JF - Environ Microbiol Y1 - 2005 A1 - Boonyaratanakornkit, Boonchai B A1 - Simpson, Anjana J A1 - Whitehead, Timothy A A1 - Fraser, Claire M A1 - el-Sayed, Najib M A A1 - Clark, Douglas S KW - Adaptation, Physiological KW - Archaeal Proteins KW - Cold Temperature KW - Gene Expression Profiling KW - Gene Expression Regulation, Archaeal KW - Heat-Shock Proteins KW - Hot Temperature KW - Methanococcus KW - Temperature KW - Transcription, Genetic AB -

Temperature shock of the hyperthermophilic methanarchaeon Methanococcus jannaschii from its optimal growth temperature of 85 degrees C to 65 degrees C and 95 degrees C resulted in different transcriptional responses characteristic of both the direction of shock (heat or cold shock) and whether the shock was lethal. Specific outcomes of lethal heat shock to 95 degrees C included upregulation of genes encoding chaperones, and downregulation of genes encoding subunits of the H+ transporting ATP synthase. A gene encoding an alpha subunit of a putative prefoldin was also upregulated, which may comprise a novel element in the protein processing pathway in M. jannaschii. Very different responses were observed upon cold shock to 65 degrees C. These included upregulation of a gene encoding an RNA helicase and other genes involved in transcription and translation, and upregulation of genes coding for proteases and transport proteins. Also upregulated was a gene that codes for an 18 kDa FKBP-type PPIase, which may facilitate protein folding at low temperatures. Transcriptional profiling also revealed several hypothetical proteins that respond to temperature stress conditions.

VL - 7 CP - 6 M3 - 10.1111/j.1462-2920.2005.00751.x ER - TY - JOUR T1 - Analysis of stage-specific gene expression in the bloodstream and the procyclic form of Trypanosoma brucei using a genomic DNA-microarray. JF - Mol Biochem Parasitol Y1 - 2002 A1 - Diehl, Susanne A1 - Diehl, Frank A1 - El-Sayed, Najib M A1 - Clayton, Christine A1 - Hoheisel, Jörg D KW - Animals KW - Blotting, Northern KW - Escherichia coli KW - Gene expression KW - Gene Expression Profiling KW - Genes, Protozoan KW - HUMANS KW - Life Cycle Stages KW - Molecular Sequence Data KW - Oligonucleotide Array Sequence Analysis KW - Polymerase Chain Reaction KW - Transcription, Genetic KW - Trypanosoma brucei brucei AB -

A microarray comprising 21,024 different PCR products spotted on glass slides was constructed for gene expression studies on Trypanosoma brucei. The arrayed fragments were generated from a T. brucei shotgun clone library, which had been prepared from randomly sheared and size-fractionated genomic DNA. For the identification of stage-specific gene activity, total RNA from in vitro cultures of the human, long slender form and the insect, procyclic form of the parasite was labelled and hybridised to the microarray. Approximately 75% of the genomic fragments produced a signal and about 2% exhibited significant differences between the transcript levels in the bloodstream and procyclic forms. A few results were confirmed by Northern blot analysis or reverse-transcription and PCR. Three hundred differentially regulated clones have been selected for sequencing. So far, of 33 clones that showed about 2-fold or more over-expression in bloodstream forms, 15 contained sequences similar to those of VSG expression sites and at least six others appeared non-protein-coding. Of 29 procyclic-specific clones, at least eight appeared not to be protein-coding. A surprisingly large proportion of known regulated genes was already identified in this small sample, and some new ones were found, illustrating the utility of genomic arrays.

VL - 123 CP - 2 ER -