Let c_a denote the number of times the amino acid a is observed to occur in the column of the multiple alignment modeled by HMM state M. We call c_a the "count" for amino acid a. Let c_total = sum_a c_a, the total counts for all amino acids in this column. The maximum likelihood method, discussed in section 5.6 and in more detail on page 311, and also discussed in class, is not appropriate in this case, because it just estimates e_M(a) = c_a/c_total, which is just the frequency that the amino acid a occurs in the column of the (small) multiple alignment we are given. We need to use a Bayesian parameter estimation method.
The simplest Bayesian method is the pseudocount method. The pseudocount for amino acid a is denoted by alpha_a in this section of the text. Let alpha_total = sum_a alpha_a be the total of all pseudocounts over all amino acids. Then the pseudocount method estimates e_M(a) = (c_a + alpha_a)/(c_total + alpha_total). It is shown on page 320 that this corresponds to a Bayesian estimation method known as Posterior Mean Estimation (PME) using a Dirichlet probability density function as a prior density function on the parameters e_M(a). The quantity e_M(a) is called the posterior mean estimate for the probability of the amino acid a. In class, we discussed how, with a slight change in parameters (from alpha_a to alpha_a + 1), this becomes the formula for Maximum a posteriori (MAP) estimation.
Suppose that in the column of the multiple alignment there are 10 valines, 3 isoleucines and no other amino acids. Consider two cases. In the first case, alpha_a = 1 for every amino acid a. In the second case alpha_a = 100 for every amino acid a. In each of these cases, what is the posterior mean estimate for the probability of each amino acid? Why might this estimation method not be suitable, for these or any other choice of the pseudocounts alpha_a, in this case?
e_M(a) = sum_k P(k|C) (c_a + alpha_a^k)/(c_total + alpha_total^k)
as in the first equation on page 117. P(k|C) is the posterior probability of environment k given the counts C for all the amino acids in the column of the multiple alignment. Suppose that K = 2, i.e. there are only 2 environments. Assume the first environment, corresponding to k=1, is a special kind of buried environment that prefers valines, leucine and isoleucines, and thus has pseudocounts alpha_V^1 = alpha_L^1 = alpha_I^1 = 10 for these amino acids, and has pseudocounts alpha_a^1 = 1 for any other amino acid a. Assume that the second environment has pseudocounts alpha_a^2 = 1 for each amino acid a. The formula for P(k|C), given on page 117, is a bit difficult to calculate, so you may do this for extra credit if you like, assuming the priors p_1 and p_2 for each of the two environments are 1/2. (NOTE: in the edition I have, the first, dated 1998, the equation for P(k|C) in the middle of page 117 has an error in it. You can view the right hand side as a product of 3 ratios. The middle ratio is inverted. The correct version is on page 320, equation 11.23, where Z is defined in equation 11.6 and M is defined in equation 11.4.)
I don't expect you to mess a lot with that formula. But at least do the following: Calculate the posterior mean estimate using the Dirichlet mixture prior, as given by the formula for e_M(a) immediately above, for every amino acid a, assuming that the counts of the observed amino acids are the same as in the previous question, and assuming that the posterior probability of environment 1 is P(1|C) = 0.95. (This probability is actually much closer to 1). What effect does the use of the Dirichlet mixture prior have on the estimate e_M(a), as compared to the estimates you obtained in the previous question?
S_i(X) = ln (P(X|null_i)/P(X|theta_i)) = - ln (P(X|theta_i)/P(X|null_i)).
The natural logarithm is used. How does this explain why it is with SAM scores that the more negative the score, the better?
The scores you got from your search of this library with X = t31 were
1agjA -328.83 1bit -95.28 5ptp -89.66 3gctA -88.2 1ton -87.86 1pytC -78.26 1autC -75.11 1rtfB -73.3 1ay6H -65.88 1sgt -58.68 2pkaB -55.87 2pkaA -33.45 2sga -32.86 1p04A -30.17 2alp -27.42 2hntC -25.07 2hntE -21.89 1arb -21.81 1jxpA -18.37 3rnt -9.86 9rnt -9.86 1svpA -9.29 1aw8B -9.04 4blmA -8.98 1tie -8.85 1hmy -7.47 1pioA -7.42 1btl -7.38 1mugA -6.87 1havA -6.68 1pmd -6.4 1bmv2 -6.25 1cknA -6.23 1pyaA -6.12 3blm -5.63 1awcA -5.56 1ecpA -5.47 3kar -5.46 1bjmA -5.44 1fus -5.43 1pamA -5.32 1ospO -5.26 1whf -5.07 1pueE -5.01No scores greater than -5 were reported because that is the default score cutoff. Suppose that the log prior odds, ln (P(theta_i)/P(null_i)), are assumed to be -7. This would mean that, a priori, the prior probability of a genuine match to the model theta_i is only about e^{-7}. Using Bayes rule, write a formula for the log posterior odds, ln (P(theta_i|X)/P(null_i|X)), in terms of the score S_i(X) and the log priors odds, under these assumptions. This would be the log posterior odds of a genuine match, under these assumptions. What PDB hits from the above list would have positive log posterior odds under these assumptions? These would be hits for which the posterior probability of a genuine match is greater than 1/2. Why?
ggugauaaacguacuauauugugaaagauuauuuacuaacgacacauugaag aaaaguucucauuacuaacaagcaaaauguuuuguuucuccuuuuaaaauag cguugauacuuaugcguauacauucauauacguucuucguguuuauuuuuag agucucaacuaccgaagagaaauaaacuacuaacguacuuuaauauuuauag aaaagucaaguacuaacguuuguuuaccccuguuuauuguguuuccacucag aacuuuacuaacaucaauaugcaaauaaaucugcaaaaacuuuuuuucacag cgaacccaaucuuacuaacagagaaagggcuuuuucccgaccaucaagacag ggcaaauuacuaacauugaaauacgggaauugauauuucccguuguguuaag cuucggccucauacucaaagaacacguuuacuaacauaacuuauuuacauag uuuacuaacaaaauuauucucacuccccgauauugagaucuuuuuuaacuag ggcucucaaaaauaauugacgagcuuacggugauacgcuuaccguuauccag caugauuauauuuuagaucaaccgguuuacuaacaugcuauuuuucauacag acauguauaaaucgaucgggaauacuaacacuacuuuucuuuaucuaagcag uuuaaaacuuaugcucuuauuuacuaacaaaaucaacaugcuauugaacuag aauagacuuuuuccuucuuacagaacgauaauaacuaacaugacuuuaacag ccuuccuuuguuuaucauagcgcacgacaagaguacuaacuaauuaacuuag uucuuucuacuaacguuuucauuauucuauacucuaugaccaauaaaaacag gucgauauuggcuucggacaacuuuacuaacaaaugguauuauuuauaacag uccgggaaaacuuuugacuaacaagaauccaauuuuacuuuuuuuuuuuuag cuaacaugauuucuuauaaaucuuaauuuugauuuauuuccuuauaauacag gcaaaggcacauaugacuaaccuuuuuuuuuucgcaauuaauuuucuaauag gguucucuuuugacuaacguauaucaucuaugaccaucuuucuuuuccccag uuacaauuacgcguuauaauuuacuaacauuuccuuuacauaauuuauucag uaacaacauuacguuugauaucguccgauaucgauuuacuauuuccauuuag auguuuucauuucaaggauagccuuugaaucaauuuacuaacaauacuucag auguuuucauuucaaggauagccuuugaaucaauuuacuaacaauacuucag guauguaauaugagaaucaaacuuaaauauauccuauacuaacaauuuguag ucacaucguacuaacgagcccuuguauuuucauuacaucacuuucgauucag gcauuuuaguucuuuuaccccuauauuauuaauacuaaccaucucgauauag cagcgguauacuaacaaaucgaugaacuuaacuuguuuuacuugaauaacag uuaaaaugaacaaucguuacuaacaaaaaauuuaccauuuuauuuuuaauag caucaacuaccauuuuuuaugggaaucacauuauauacuaacauagguuuag uucuguauacuaacaauuuuuaaaaugggaucuaccaauugaauuauaucag uguuuuacuaacaugugacacgaaauguuuuucauuuuugguuuuauaacag cuuuuaacauucaauaaugauacuaacguugauuaaugaugauucaucgcag guaguacauauacauaaagcagaauacuaacaaucgauccgcuaugcaacag caacgaauaacgacucuuccuauggagcuuguuauauguaucauugaauuag aggaacuuuuacuaacaguuaugauuuuuuguucccauuuucuuaucaauag uucaagucucuucaaguauauaaggaguuagggcagaaguaggugaugacag cuaacguuaauaaaaauuuggaauauuauuucauuuuuuauccuauuaauag auucgauaaauuuuuuggaauauccaauuccacuaaauauuacuuuaaacag agaacaaacuacuagggcagaaaacgcaaagucuuugugaacuuucauauag cuuuuagguuguaauaaacguuucauauacuaacaaacucuucugcgauuag uuaucuuuuaaguuuauuuacguuuuugagaccuuacuaacgaccaggauag accauugcguuacuaacuauauggagaccuuauaaucucugcuucuuuacag aagcaaagcacaauuuuacuaaccuguucucaaacauauuuuuuuuucauag ugcggguacugagcgguuauacuaacuuagagaaaacacugaaugaucuuag auuuuucuuuacuaacucgaggaagagugagguuuucuuccaugaauugcag uuuguguguguacauuugaauauauauauuuacuaacaaggagaaaacuuag agagguauuuacuaacuuaaguugucucauuugauuauugcuauuuuuauag
Are there any "hidden motifs" in this data? You might notice that there is a definite motif that occurs at the end of each sequence. This motif is roughly "YAG", where Y stands for a pyrimidine (C or U), although there are some exceptions. There are also preferences for certain nucleic acids to occur at preceding positions, so this motif might be extended somewhat. This motif is recognized by the spliceosome, the complex of proteins and RNA that is responsible for splicing out introns. We would like to be able to automatically discover this and other motifs in this data.
Go to Jim's Improbizer page. Click on "Improbizer" and read the documentation. This program is something like the MEME program. Improbizer will discover motifs in the data you submit. It scans the first few sequences you submit (default: the first 20 sequences, adjustable by the parameter "Number of Sequences in Initial Scan") and for every window x_1 ... x_6 of 6 residues in each of these sequences, it creates an initial motif that has 6 states, and has its probabilities set so that state i prefers residue x_i. The initial motif size of 6 is an adjustable parameter. Then for each one of these initial motifs, Improbizer does the following.
First it does what is essentially a Viterbi alignment of the motif to each of the sequences in the data you supplied (not just the first 20 sequences). This alignment finds the best single match for that motif in each sequence. (There is a parameter you can set to find more than one occurrence per sequence.) Then each of these matches is placed in a multiple alignment of all matches, and the parameters of the initial motif are reestimated from these aligned matches. Each match is "weighted" so that it contributes counts to this reestimation in proportion to the posterior probability that it is a real match, when you use the option "training weights = classical", which I would like you to select. Then the process is repeated, i.e. new matches are found to the revised motif, and the parameters are reestimated. This continues until the motif stops changing. In this case, Improbizer is implementing a close approximation to the classical EM algorithm. (The full EM algorithm would use every possible window of 6 residues in every sequence as a possible match, weighting the counts from each one according to the posterior probability that it is a real match.) Improbizer also decides whether to grow or shrink the motif size after each iteration. The tendency to grow the motif is governed by the parameter "Restrain Expansionist Tendencies". This whole process may be called "training a motif".
After training all the initial motifs, Improbizer picks the best one and reports it. The quality of the motif is measured by the average of the log likelihood scores for all its matches. The log likelihood score for a given match X is calculated as log (P(X|motif)/P(X|null)) where "motif" is the motif model and "null" is a null model. You can select the null model to be a Markov model of order 0, 1 or 2, or a special model that includes codon structure (useful for finding motifs in coding regions). You can also select the data you want to use to estimate the parameters of the null model, which is a very nice feature, since the quality of the null model can really make a difference in motif discovery. The probability P(X|motif) is like the usual product of the probabilities of the individual residues of X for the corresponding states of the motifs (no indels are allowed, as in a profile HMM), but it also includes a distribution of the location of the match within the data sequence, so Improbizer can prefer motifs that all occur in roughly the same location in each input sequence. (This location can be specified with respect to either the left end or the right end of each sequence, in case the input sequences have different lengths.) If you want to turn this off, select "Ignore location".
Finally, you can ask Improbizer to "probabilistically erase" from the data the motif you just found, and then find the second best motif. You can find up to 6 motifs this way. This is accomplished by setting the "Number of motifs to find" parameter. The method used for this is like that used by MEME. If you don't erase, it just keeps finding the same motif again and again.
Set the training weight to classical, paste in the 50 sequences above and run Improbizer. It should report 2 motifs. Interestingly, the second motif it reports is the 3' splice site, but the first motif is something else. What do you think it is? Note that it has a considerably higher average log likelihood ratio than the 3' splice site. Improbizer gives you some graphical displays showing where the hits to these motifs occur, and how strong they are. It also reports the full motifs. Look these over. They may be helpful. You may need to find out more about how introns are spliced to answer this question. When you figure out what the first motif is, comment on why some positions in the motif show very high conservation.
Now try the same thing, but set the number of motifs to find to 3. Examine the third motif.
Which motifs significant? One way to test this is to scramble the sequences you provided, and repeat this procedure several times, and see how good the motifs are that you discover "just by chance". There is a button "control run" at the bottom of the Improbizer page that does just this. Try it a few times, and look at the log odds scores and other details of the motifs you get. Do you think the two motifs that Improbizer found in the unscrambled data are real?
For extra credit, run Improbizer on other data sets, and with other setting of the parameters, and report the results.