CMSC 828N Laboratory 3, due Apr 12. This lab will involve writing your own, simple, bacterial gene finding program. Submit your answers as a SINGLE tarfile, attached to your email. Please include a file named README explaining what the files are. INPUT Use the genome of Staphylococcus aureus strain COL from the directory: /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Staphylococcus_aureus_COL This is also available directly from NCBI at: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Staphylococcus_aureus_COL The first file you need is NC_002951.fna, which is the fasta file of the genome itself. You will also need NC_002951.ptt, which is the "protein translation table," a file listing all the annotated protein-coding genes. This is a relatively small bacterial genome with low GC-content. S. aureus COL is one of the early methocillin-resistant (MRSA) strains, a deadly variety of S. aureus that is endemic in hospitals and many communities around the world. 1. Write a program to find all open reading frames (ORFs) in the genome. Remember that this is a CIRCULAR chromosome, so ORFs can span the endpoint of the string. Also note that you need to find ORFs on both the forward and the reverse strand. The program should accept a minimum ORF length as a parameter; thus a minimum of 200 means that it will only identify ORFs that are 200bp or longer. Output should include the ORF as a 4-tuple: a unique ID, two coordinates (start codon position coordinate first), and either "+" or "-" to indicate whether the ORF is on the forward or reverse strand. For this assignment, an ORF is a maximal-length substring that begins at a START codon and ends at a STOP, with no STOPs occurring in-frame prior to the final STOP. Maximal here means that no ORF is properly contained in another on the same DNA strand: in other words, use the start codon that makes the longest ORF. START codons must be one of (ATG,GTG) and STOP codons must be one of (TAA,TAG,TGA). NOTE: coordinates of the ORFs should INCLUDE the stop codon. Thus if the start codon (ATG) occurs at positions 101,102,103, and the STOP is at 161,162,163, then the ORF is the closed interval [101,163], and has length 63. What to turn in: (a) Your program (b) A file showing all ORFs of 300bp or longer in S. aureus. Each line should look like this: ORF0001 45123 46123 + This file should be sorted by column 2, with one ORF per line. Note that because column 2 indicates the START codon and column 3 the STOP, column 2 will be larger than column 3 for genes on the reverse strand. 2. Codon usage is a measurement of how frequently each of the 61 possible codons are used. Determine the "expected" frequencies for all 61 codons, using ORFs that are 750 base pairs in length or longer. (Ideally we would use a set of "known" genes to compute these values, but instead we will use these "long" ORFs as a surrogate.) To compute expected frequencies, you count how many times each codon is used in the entire set of long ORFs, and then divide by the total number of codons in these ORFs. As a result you will have 61 percentages, one for each codon. Next, using the ORF list from part 1, count the occurrences of the 61 codons for each ORF. Thus if a codon occurs one time in an ORF with 200 codons, the count will be 1. Then compare each ORF's codon usage to the expected frequencies as follows, by computing the chi-squared statistic of the difference between the two distributions. The chi-squared statistic is: SUM[(observed-expected)^2/expected] where the SUM is computed over all 61 codons. The "expected" number of occurrences for any codon in an ORF is the expected frequency of that codon (computed above) times the number of codons in the ORF. For example, if 2% of all codons are CCC, then for an ORF with 200 codons you would expect 4 of those codons to be CCC. Finally, note that a "typical" ORF will have a low chi-squared statistic. What to turn in: (a) your program (b) a file with the 61 expected frequencies for S. aureus codons (c) A file with all the chi-squared scores for the ORFs output in part 1. For (c), print out a list of all ORFs as in part 1, but this time output five columns: the ORF ID, coordinates, orientation, and the chi-square statistic for that ORF. Note that the first four columns are identical to those for part 1. 3. (Simple gene finder). For each ORF in the ORF list from parts 1 and 2, determine whether or not it is a "gene" by the following very simple test: (a) if the ORF doesn't overlap another ORF on either strand, then call it a gene. (b) if the ORF overlaps other ORFs, and if it has a lower chi-squared statistic than any of its overlapping ORFs, then call it a gene. Note that you should only need the output of part 2 as input to this program. Finally, compare these predictions to the annotation in NC_002951.ptt. For each ORF in your list, consider it a match if it has the same STOP codon as the gene in the ptt file. Don't worry about the start codons. Compute the number of genes from your list that match the annotation. What to turn in: (a) your program (b) A file with all the "genes", in the same format as the output from part 2. (c) Report the number of genes in your predictions that match annotation as three values: the total count, the sensitivity (total count divided by number of genes listed in the ptt file), and the specificity (total count divided by the number of genes you predicted in 3(b).