CMP 243 Homework 4, part 2

Due: Tues. November 16 (Please turn in both parts of HW4 this day.)

Dirichlet and Dirichlet mixture priors, profile HMMs, and motif discovery.

Reading

Review the material on Bayes methods on pages 4-11 of the text, then read Chapter 5 in the text. (Chapter 11 contains more mathematical detail on this topic, and is suggested, but optional reading.)

Written questions

  1. The estimation of parameters for a "match" state M in a profile HMM is discussed in section 5.6 or the text. Here it is assumed that a multiple protein alignment is given. The match state M is associated with a particular column in that multiple alignment. The parameters of the match state M are the emission probabilities e_M(a) for each amino acid a. Here, for simplicity, I am dropping the subscript j used in the book to identify the particular match state (or alignment column) in question. The column in the multiple alignment (and hence also the match state M) represents a position that is occupied by an amino acid in the three dimensional structure of a typical protein in the family of proteins being modeled by the HMM. That structure defines an environment for the amino acid, e.g. buried in the core, or exposed to solvent, which will cause certain amino acids to be preferred over other in that position. In many cases, the structure is not known, so it is not known what this environment is. The given multiple alignment usually only provides a small sample of amino acids that have been observed to occur in this position in different proteins in the family. The parameter e_M(a) should represent the frequency that the amino acid a will occur in this position in general, for a typical member of this protein family, not just the frequency that it is observed to occur in this position in the small sample given in the multiple alignment. Therefore, we need to use a statistical method to estimate the parameter e_M(a).

    Let c_a denote the number of times the amino acid a is observed to occur in the column of the multiple alignment modeled by HMM state M. We call c_a the "count" for amino acid a. Let c_total = sum_a c_a, the total counts for all amino acids in this column. The maximum likelihood method, discussed in section 5.6 and in more detail on page 311, and also discussed in class, is not appropriate in this case, because it just estimates e_M(a) = c_a/c_total, which is just the frequency that the amino acid a occurs in the column of the (small) multiple alignment we are given. We need to use a Bayesian parameter estimation method.

    The simplest Bayesian method is the pseudocount method. The pseudocount for amino acid a is denoted by alpha_a in this section of the text. Let alpha_total = sum_a alpha_a be the total of all pseudocounts over all amino acids. Then the pseudocount method estimates e_M(a) = (c_a + alpha_a)/(c_total + alpha_total). It is shown on page 320 that this corresponds to a Bayesian estimation method known as Posterior Mean Estimation (PME) using a Dirichlet probability density function as a prior density function on the parameters e_M(a). The quantity e_M(a) is called the posterior mean estimate for the probability of the amino acid a. In class, we discussed how, with a slight change in parameters (from alpha_a to alpha_a + 1), this becomes the formula for Maximum a posteriori (MAP) estimation.

    Suppose that in the column of the multiple alignment there are 10 valines, 3 isoleucines and no other amino acids. Consider two cases. In the first case, alpha_a = 1 for every amino acid a. In the second case alpha_a = 100 for every amino acid a. In each of these cases, what is the posterior mean estimate for the probability of each amino acid? Why might this estimation method not be suitable, for these or any other choice of the pseudocounts alpha_a, in this case?


  2. A better prior probability density to use in the case above is a mixture of Dirichlet densities, as discussed on page 117, and in more detail on page 321. In this case we assume that there are K different amino acid environments that occur in protein structures. Each of these environments has its own set of 20 pseudocounts. The pseudocount for amino acid a in environment k is denoted alpha_a^k. Let alpha_total^k = sum_a alpha_a^k. Let C be the vector of all amino acid counts c_a. Then the posterior mean estimate using the Dirichlet mixture prior is given by

    e_M(a) = sum_k P(k|C) (c_a + alpha_a^k)/(c_total + alpha_total^k)

    as in the first equation on page 117. P(k|C) is the posterior probability of environment k given the counts C for all the amino acids in the column of the multiple alignment. Suppose that K = 2, i.e. there are only 2 environments. Assume the first environment, corresponding to k=1, is a special kind of buried environment that prefers valines, leucine and isoleucines, and thus has pseudocounts alpha_V^1 = alpha_L^1 = alpha_I^1 = 10 for these amino acids, and has pseudocounts alpha_a^1 = 1 for any other amino acid a. Assume that the second environment has pseudocounts alpha_a^2 = 1 for each amino acid a. The formula for P(k|C), given on page 117, is a bit difficult to calculate, so you may do this for extra credit if you like, assuming the priors p_1 and p_2 for each of the two environments are 1/2. (NOTE: in the edition I have, the first, dated 1998, the equation for P(k|C) in the middle of page 117 has an error in it. You can view the right hand side as a product of 3 ratios. The middle ratio is inverted. The correct version is on page 320, equation 11.23, where Z is defined in equation 11.6 and M is defined in equation 11.4.)

    I don't expect you to mess a lot with that formula. But at least do the following: Calculate the posterior mean estimate using the Dirichlet mixture prior, as given by the formula for e_M(a) immediately above, for every amino acid a, assuming that the counts of the observed amino acids are the same as in the previous question, and assuming that the posterior probability of environment 1 is P(1|C) = 0.95. (This probability is actually much closer to 1). What effect does the use of the Dirichlet mixture prior have on the estimate e_M(a), as compared to the estimates you obtained in the previous question?


  3. This question refers back to Homework 2, Question 9. There you did a search on the CASP protein t31, and got back some predicted remote homologs for this protein. This search took the t31 sequence you pasted in, and evaluated its probability for many different HMMs, using the forward algorithm we discussed in class. These HMMs are a library of HMMs constructed at UCSC, with one HMM for every representative member of the protein data bank (PDB), a collection of proteins with known 3d structure, and their structures (see HW2). Suppose we number from 1 to N these HMMs and the PDB proteins they are made from. Let X_i denote the PDB protein sequence used to create HMM_i. Let us denote the parameters for HMM_i by theta_i. Then the probability for any protein X under HMM_i is P(X|theta_i). It turns out that each HMM in this collection has its own null model associated with it. This null model is itself and HMM, made from the reverse of the sequence X_i. Let us denote the probability of any sequence X under this null model by P(X|null_i). Then the score returned by SAM-T98 for the query sequence X against HMM_i is

    S_i(X) = ln (P(X|null_i)/P(X|theta_i)) = - ln (P(X|theta_i)/P(X|null_i)).

    The natural logarithm is used. How does this explain why it is with SAM scores that the more negative the score, the better?

    The scores you got from your search of this library with X = t31 were

    1agjA -328.83
    1bit  -95.28
    5ptp  -89.66
    3gctA -88.2
    1ton  -87.86
    1pytC -78.26
    1autC -75.11
    1rtfB -73.3
    1ay6H -65.88
    1sgt  -58.68
    2pkaB -55.87
    2pkaA -33.45
    2sga  -32.86
    1p04A -30.17
    2alp  -27.42
    2hntC -25.07
    2hntE -21.89
    1arb  -21.81
    1jxpA -18.37
    3rnt  -9.86
    9rnt  -9.86
    1svpA -9.29
    1aw8B -9.04
    4blmA -8.98
    1tie  -8.85
    1hmy  -7.47
    1pioA -7.42
    1btl  -7.38
    1mugA -6.87
    1havA -6.68
    1pmd  -6.4
    1bmv2 -6.25
    1cknA -6.23
    1pyaA -6.12
    3blm  -5.63
    1awcA -5.56
    1ecpA -5.47
    3kar  -5.46
    1bjmA -5.44
    1fus  -5.43
    1pamA -5.32
    1ospO -5.26
    1whf  -5.07
    1pueE -5.01
    
    
    No scores greater than -5 were reported because that is the default score cutoff. Suppose that the log prior odds, ln (P(theta_i)/P(null_i)), are assumed to be -7. This would mean that, a priori, the prior probability of a genuine match to the model theta_i is only about e^{-7}. Using Bayes rule, write a formula for the log posterior odds, ln (P(theta_i|X)/P(null_i|X)), in terms of the score S_i(X) and the log priors odds, under these assumptions. This would be the log posterior odds of a genuine match, under these assumptions. What PDB hits from the above list would have positive log posterior odds under these assumptions? These would be hits for which the posterior probability of a genuine match is greater than 1/2. Why?

Experimental Part

We discussed the Expectation Maximization (EM) algorithm in class. This algorithm is used to estimate the parameters of HMMs. It can also be used to estimate the parameters of a simple motif in cases where there is no alignment given. Suppose, for example, that you have a collection of introns from different genes in an organism such as yeast, and you want to discover one or more motifs in this data. Here, as an example, are 50 sequences taken from the 3' ends of yeast introns by Jim Kent.
ggugauaaacguacuauauugugaaagauuauuuacuaacgacacauugaag
aaaaguucucauuacuaacaagcaaaauguuuuguuucuccuuuuaaaauag
cguugauacuuaugcguauacauucauauacguucuucguguuuauuuuuag
agucucaacuaccgaagagaaauaaacuacuaacguacuuuaauauuuauag
aaaagucaaguacuaacguuuguuuaccccuguuuauuguguuuccacucag
aacuuuacuaacaucaauaugcaaauaaaucugcaaaaacuuuuuuucacag
cgaacccaaucuuacuaacagagaaagggcuuuuucccgaccaucaagacag
ggcaaauuacuaacauugaaauacgggaauugauauuucccguuguguuaag
cuucggccucauacucaaagaacacguuuacuaacauaacuuauuuacauag
uuuacuaacaaaauuauucucacuccccgauauugagaucuuuuuuaacuag
ggcucucaaaaauaauugacgagcuuacggugauacgcuuaccguuauccag
caugauuauauuuuagaucaaccgguuuacuaacaugcuauuuuucauacag
acauguauaaaucgaucgggaauacuaacacuacuuuucuuuaucuaagcag
uuuaaaacuuaugcucuuauuuacuaacaaaaucaacaugcuauugaacuag
aauagacuuuuuccuucuuacagaacgauaauaacuaacaugacuuuaacag
ccuuccuuuguuuaucauagcgcacgacaagaguacuaacuaauuaacuuag
uucuuucuacuaacguuuucauuauucuauacucuaugaccaauaaaaacag
gucgauauuggcuucggacaacuuuacuaacaaaugguauuauuuauaacag
uccgggaaaacuuuugacuaacaagaauccaauuuuacuuuuuuuuuuuuag
cuaacaugauuucuuauaaaucuuaauuuugauuuauuuccuuauaauacag
gcaaaggcacauaugacuaaccuuuuuuuuuucgcaauuaauuuucuaauag
gguucucuuuugacuaacguauaucaucuaugaccaucuuucuuuuccccag
uuacaauuacgcguuauaauuuacuaacauuuccuuuacauaauuuauucag
uaacaacauuacguuugauaucguccgauaucgauuuacuauuuccauuuag
auguuuucauuucaaggauagccuuugaaucaauuuacuaacaauacuucag
auguuuucauuucaaggauagccuuugaaucaauuuacuaacaauacuucag
guauguaauaugagaaucaaacuuaaauauauccuauacuaacaauuuguag
ucacaucguacuaacgagcccuuguauuuucauuacaucacuuucgauucag
gcauuuuaguucuuuuaccccuauauuauuaauacuaaccaucucgauauag
cagcgguauacuaacaaaucgaugaacuuaacuuguuuuacuugaauaacag
uuaaaaugaacaaucguuacuaacaaaaaauuuaccauuuuauuuuuaauag
caucaacuaccauuuuuuaugggaaucacauuauauacuaacauagguuuag
uucuguauacuaacaauuuuuaaaaugggaucuaccaauugaauuauaucag
uguuuuacuaacaugugacacgaaauguuuuucauuuuugguuuuauaacag
cuuuuaacauucaauaaugauacuaacguugauuaaugaugauucaucgcag
guaguacauauacauaaagcagaauacuaacaaucgauccgcuaugcaacag
caacgaauaacgacucuuccuauggagcuuguuauauguaucauugaauuag
aggaacuuuuacuaacaguuaugauuuuuuguucccauuuucuuaucaauag
uucaagucucuucaaguauauaaggaguuagggcagaaguaggugaugacag
cuaacguuaauaaaaauuuggaauauuauuucauuuuuuauccuauuaauag
auucgauaaauuuuuuggaauauccaauuccacuaaauauuacuuuaaacag
agaacaaacuacuagggcagaaaacgcaaagucuuugugaacuuucauauag
cuuuuagguuguaauaaacguuucauauacuaacaaacucuucugcgauuag
uuaucuuuuaaguuuauuuacguuuuugagaccuuacuaacgaccaggauag
accauugcguuacuaacuauauggagaccuuauaaucucugcuucuuuacag
aagcaaagcacaauuuuacuaaccuguucucaaacauauuuuuuuuucauag
ugcggguacugagcgguuauacuaacuuagagaaaacacugaaugaucuuag
auuuuucuuuacuaacucgaggaagagugagguuuucuuccaugaauugcag
uuuguguguguacauuugaauauauauauuuacuaacaaggagaaaacuuag
agagguauuuacuaacuuaaguugucucauuugauuauugcuauuuuuauag

Are there any "hidden motifs" in this data? You might notice that there is a definite motif that occurs at the end of each sequence. This motif is roughly "YAG", where Y stands for a pyrimidine (C or U), although there are some exceptions. There are also preferences for certain nucleic acids to occur at preceding positions, so this motif might be extended somewhat. This motif is recognized by the spliceosome, the complex of proteins and RNA that is responsible for splicing out introns. We would like to be able to automatically discover this and other motifs in this data.

Go to Jim's Improbizer page. Click on "Improbizer" and read the documentation. This program is something like the MEME program. Improbizer will discover motifs in the data you submit. It scans the first few sequences you submit (default: the first 20 sequences, adjustable by the parameter "Number of Sequences in Initial Scan") and for every window x_1 ... x_6 of 6 residues in each of these sequences, it creates an initial motif that has 6 states, and has its probabilities set so that state i prefers residue x_i. The initial motif size of 6 is an adjustable parameter. Then for each one of these initial motifs, Improbizer does the following.

First it does what is essentially a Viterbi alignment of the motif to each of the sequences in the data you supplied (not just the first 20 sequences). This alignment finds the best single match for that motif in each sequence. (There is a parameter you can set to find more than one occurrence per sequence.) Then each of these matches is placed in a multiple alignment of all matches, and the parameters of the initial motif are reestimated from these aligned matches. Each match is "weighted" so that it contributes counts to this reestimation in proportion to the posterior probability that it is a real match, when you use the option "training weights = classical", which I would like you to select. Then the process is repeated, i.e. new matches are found to the revised motif, and the parameters are reestimated. This continues until the motif stops changing. In this case, Improbizer is implementing a close approximation to the classical EM algorithm. (The full EM algorithm would use every possible window of 6 residues in every sequence as a possible match, weighting the counts from each one according to the posterior probability that it is a real match.) Improbizer also decides whether to grow or shrink the motif size after each iteration. The tendency to grow the motif is governed by the parameter "Restrain Expansionist Tendencies". This whole process may be called "training a motif".

After training all the initial motifs, Improbizer picks the best one and reports it. The quality of the motif is measured by the average of the log likelihood scores for all its matches. The log likelihood score for a given match X is calculated as log (P(X|motif)/P(X|null)) where "motif" is the motif model and "null" is a null model. You can select the null model to be a Markov model of order 0, 1 or 2, or a special model that includes codon structure (useful for finding motifs in coding regions). You can also select the data you want to use to estimate the parameters of the null model, which is a very nice feature, since the quality of the null model can really make a difference in motif discovery. The probability P(X|motif) is like the usual product of the probabilities of the individual residues of X for the corresponding states of the motifs (no indels are allowed, as in a profile HMM), but it also includes a distribution of the location of the match within the data sequence, so Improbizer can prefer motifs that all occur in roughly the same location in each input sequence. (This location can be specified with respect to either the left end or the right end of each sequence, in case the input sequences have different lengths.) If you want to turn this off, select "Ignore location".

Finally, you can ask Improbizer to "probabilistically erase" from the data the motif you just found, and then find the second best motif. You can find up to 6 motifs this way. This is accomplished by setting the "Number of motifs to find" parameter. The method used for this is like that used by MEME. If you don't erase, it just keeps finding the same motif again and again.

Set the training weight to classical, paste in the 50 sequences above and run Improbizer. It should report 2 motifs. Interestingly, the second motif it reports is the 3' splice site, but the first motif is something else. What do you think it is? Note that it has a considerably higher average log likelihood ratio than the 3' splice site. Improbizer gives you some graphical displays showing where the hits to these motifs occur, and how strong they are. It also reports the full motifs. Look these over. They may be helpful. You may need to find out more about how introns are spliced to answer this question. When you figure out what the first motif is, comment on why some positions in the motif show very high conservation.

Now try the same thing, but set the number of motifs to find to 3. Examine the third motif.

Which motifs significant? One way to test this is to scramble the sequences you provided, and repeat this procedure several times, and see how good the motifs are that you discover "just by chance". There is a button "control run" at the bottom of the Improbizer page that does just this. Try it a few times, and look at the log odds scores and other details of the motifs you get. Do you think the two motifs that Improbizer found in the unscrambled data are real?

For extra credit, run Improbizer on other data sets, and with other setting of the parameters, and report the results.


Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Nov 5, 1999, 5PM.

Back to the CMP 243 Class Page.