Reducing storage requirements for biological sequence comparison.
Title | Reducing storage requirements for biological sequence comparison. |
Publication Type | Journal Articles |
Year of Publication | 2004 |
Authors | Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA |
Journal | Bioinformatics |
Volume | 20 |
Issue | 18 |
Pagination | 3363-9 |
Date Published | 2004 Dec 12 |
ISSN | 1367-4803 |
Keywords | algorithms, Databases, Genetic, Information Storage and Retrieval, Numerical Analysis, Computer-Assisted, sequence alignment, Sequence Analysis |
Abstract | MOTIVATION: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. RESULTS: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds. |
DOI | 10.1093/bioinformatics/bth408 |
Alternate Journal | Bioinformatics |
PubMed ID | 15256412 |
Grant List | 1R01HG0294501 / HG / NHGRI NIH HHS / United States |