CMP 243 Homework 4

Due: Mon. Nov. 4

Pairwise sequence alignment, multiple sequence alignment, Markov models, introduction to HMMs

Reading for this homework is Chapter 3 and beginning of Chapter 4 of the text, and the tutorial on pairwise sequence alignment by Robert Giegerich and David Wheeler (GW). ( postscript version)

  1. Do problem 2 on page 6 of GW.
  2. Using the unit cost model, give the entire dynamic programming matrix, as on page 8 of GW, for the global pairwise alignment of ACCCAT and GCCCACCATG. What is the minimum cost among all alignments? Show one such minimum cost alignment.
  3. Repeat the above using (semi) local alignment.
  4. Suppose you are given the DNA sequence X = ACTCCGATCG. Using the Markov models defined by the tables at the bottom of page 52 of the text, calculate the probability of the sequence X using the Markov model of CpG islands, call it M+, and also the probability of X using the Markov model of regular DNA, not in CpG islands, (call this model M-). You may assume that the probability distribution for the first nucleotide is uniform for each of these models. Then calculate the log odds ratio

    log_2 (P(X|M+)/P(X|M-)).

    This gives the log odds in total bits. (Sometimes this is called "bits saved" since

    log_2 (P(X|M+)/P(X|M-)) = [- log_2 P(X|M-)] - [- log_2 P(X|M+)],

    and the first number in brackets can be interpreted as the number of bits it takes to represent (encode) X using model M-, while the second number in brackets represents the number of bits to represent X using model M+. See Tom Schneider's primer on Information Theory written for Molecular Biologists for details. Hence the difference between these two quantities in brackets represents the number of bits saved by using M+ rather than the "Null Model" M-.) What is the difference between this kind of score and the scores plotted in the histogram on page 53 of the text? Where would this sequence go in that histogram?

  5. Using the hidden Markov model in the example on page 56 of the text, calculate the probability that in four die rolls, the first 3 tosses are from a fair die, and they are 3, 1 and 5, and the fourth toss is from a loaded die, and it is 6. Assume the probability of starting with a fair die is 0.99.
  6. Calculate the probability of getting a pair of sixes in two die rolls from the casino in this example on page 56. Note that here I am not specifying the hidden state, i.e. I am not telling you if the each die is fair or not. You must consider the possible cases. Again, assume the probability of starting with a fair die is 0.99.
  7. (Optional) Make up or find a set of related sequences and submit it to both the CLUSTALW server (find it on Pedro's page) and the SAM server. Compare the multiple alignments you get. (Note: don't wait until the last minute. It takes time to get your answer back by email. I am still waiting for an answer from CLUSTALW for a relatively small set of sequences I submitted.)

Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Nov. 17, 1996.

Back to the CMP 243 Class Page.