CMP 243 Homework 4

Due: Tues. Feb. 24

Pairwise sequence alignment, Markov models, introduction to HMMs

Written part:

  1. Do problem 2.9 on page 22 of the text.
  2. Do problem 2.10 on page 32 of the text.
  3. Suppose you are given the DNA sequence X = ACTCCGATCG. Using the Markov models defined by the tables at the bottom of page 50 of the text, calculate the probability of the sequence X using the Markov model of CpG islands, call it M+, and also the probability of X using the Markov model of regular DNA, not in CpG islands, (call this model M-). You may assume that the probability distribution for the first nucleotide is uniform for each of these models. Then calculate the log odds ratio

    log_2 (P(X|M+)/P(X|M-)).

    This gives the log odds in total bits. What is the difference between this score and the scores plotted in the histogram on page 53 of the text? Where would this sequence go in that histogram?

  4. Using the hidden Markov model in the example on page 54 of the text, calculate the probability that in four die rolls, the first 3 tosses are from a fair die, and they are 3, 1 and 5, and the fourth toss is from a loaded die, and it is 6. Assume the probability of starting with a fair die is 0.99.
  5. Calculate the probability of getting a pair of sixes in two die rolls from the casino in this example on page 54. Note that here I am not specifying the hidden state, i.e. I am not telling you if the each die is fair or not. You must consider the possible cases. Again, assume the probability of starting with a fair die is 0.99.
  6. Do problem 3.4 on page 57 of the text.

Experimental part:

Anders Krogh has provided us with a set of HMM teaching/experimenting programs for use with the text. The programs are under the directory

   /cse/guests/krogh/book/ahmm
on the barnyard machines. The documentation is in
   /cse/guests/krogh/book/ahmm/AHMM.documentation
and the executable code for the DEC alphas is in
   /cse/guests/krogh/book/ahmm/bin/OSF1
You should add this directory to your UNIX path. The three programs in this directory are decodeahmm, testahmm and trainahmm.

Using these programs you can define the state structure for an HMM, train it on some training sequences, and use it to "decode" other sequences. Here by decoding, we mean assigning a label to each letter in the observation sequence, as described in Chapter 4 of the text. This label usually represents one of the states of the HMM, although in a more general case the same label may be used for more than one state. There are two ways to decode an observation sequence. The first is the Viterbi method, discussed on page 56 of the text. In this method, the most likely sequence of (hidden) states is computed for the entire observation sequence. The second way is posterior decoding, discussed on page 59. In this method, for each individual letter of the observation sequence, the most likely label (or state, if each state has a unique label) is computed separately. The formulas used in the Viterbi and posterior decoding cases are given in the text.

In this part of the assignment we are going to compare the Viterbi and posterior decoding methods. We'll do this using the "occasionally dishonest casino" example on page 54 of the text. I have designed an HMM to represent this hidden Markov model, under the additional assumption that when the HMM starts, the first die chosen is fair with probability 0.99. I have also changed the transition probabilities a bit. For use by Anders' program, this HMM is specified by the following file (see documentation for details):

# HMM based on example of page 54 of text 

########## ALPHABET ################################################
alphabet { alphabet AB123456; }
########## STATE begin #############################################
begin {  trans
      fair: 0.99
      loaded: 0.01;
   letter NULL;
   end 0;
   }
########## STATE fair ##############################################
fair {   trans
      fair: 0.99
      loaded: 0.01;
   only   1:0.166666666 2:0.166666666 3:0.16666666 4:0.16666666 5:0.16666666
       6:0.16666666;
   label F;
   }
########## STATE loaded ############################################
loaded { trans
      fair: 0.1
      loaded: 0.9;
   only   1:0.1 2:0.1 3:0.1 4:0.1 5:0.1 6:0.5;
   label L;
   }

Here is what you need to do.
  1. Read the documentation file and see if you can figure out what the above HMM specification means. Then copy it to a file called dice.mod. Then run the Viterbi algorithm to decode the following sequence using the HMM dice.mod
    1552433142516636666166646666425341323435241324435166366126666166643523425234142
    
    One way to do this is to make a file called data1 for this sequence. (Don't add any blank lines at the end of the file, the software is just quickly made classroom software, and is very brittle.) Then type
        decodeahmm -v dice.mod data1
    
    The "-v" option specifies Viterbi decoding. What is the output? Display the output above the observation sequence. Is it a reasonable decoding?
  2. Decode the same sequence using posterior decoding. All you need to do is leave off the "-v" option, because posterior decoding is the default. Is the result different from the Viterbi decoding?
  3. Now repeat the above two tests with the following alternate observation sequences:

    data2:

    11116161661666666611116111116111111616666166616166666616111166666661616161666666
    

    data3:

    11111111116666666666666666111111111166666666661111111
    
    In each case, compare the results of the decoding, and say which decoding you feel is more reasonable. Try to explain why they differ the way they do.
Optional (extra credit): Use Anders' HMM software to train an HMM for some sequence feature of biological interest and test your HMM on real data. One possibility would be implement the HMM for CpG islands discussed in the text and try it on the worm data at the Sanger site (see also /projects/compbio/data/worm.) Note: Before applying the model, you will have to retrain starting from the initial model given in the text, since the background frequencies are quite different in worm DNA from what they are in human DNA. Another option would be to try to build an HMM to recognize worm splice sites and/or (coding) exons. This is more ambitious, but at least you have good labeled training and test data.
Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Feb. 17, 1998.

Back to the CMP 243 Class Page.