CMP 243 Homework 3, part 2

Due: Thurs. Oct 28

Pairwise sequence alignment, Markov models, introduction to HMMs

Experimental part:

Anders Krogh, one of the authors of our text and one of the original developers of HMM methods for biosequence analysis, has provided us with a set of HMM teaching/experimenting programs for use with the text. The programs are under the directory

   /cse/guests/krogh/book/ahmm
on the barnyard machines. The documentation is in
   /cse/guests/krogh/book/ahmm/AHMM.documentation
and the executable code for the DEC alphas is in
   /cse/guests/krogh/book/ahmm/bin/OSF1
You should add this directory to your UNIX path. To do this, in the file ".cshrc" in your home directory, after the line that begins "set path = ( . /usr/ucb ..." add the line "set path = ($path /cse/guests/krogh/book/ahmm/bin/OSF1)". Then use the command "source .cshrc" to get your new path to take effect. The three programs in Anders' OSF1 directory are decodeahmm, testahmm and trainahmm. You should now be able to run these. Type "testahmm". You should get the message "Give modelfile as argument" back. This means the program is working for you, but it terminated with this complaint because you didn't provide an appropriate input file. Note: the programs for this homework must be run on a DEC alpha machine, such as "eclipse" or "beta" or "oink" or some other barnyard machine. Be sure you have permission to use one of these machines.

Using these three programs you can define the state structure for an HMM, train it on some training sequences, and use it to "decode" other sequences. Here by decoding, we mean assigning a label to each letter in the observation sequence, as described in Chapter 3 of the text. This label usually represents one of the states of the HMM, although in a more general case the same label may be used for more than one state. There are two ways to decode an observation sequence. The first is the Viterbi method, discussed on page 55 of the text. In this method, the most likely sequence of (hidden) states is computed for the entire observation sequence. The second way is posterior decoding, discussed on page 59. In this method, for each individual letter of the observation sequence, the most likely label (or state, if each state has a unique label) is computed separately. The formulas used in the Viterbi and posterior decoding cases are given in the text.

In this part of the assignment we are going to compare the Viterbi and posterior decoding methods. We'll do this using the "occasionally dishonest casino" example on page 54 of the text. I have designed an HMM to represent this hidden Markov model, under the additional assumption that when the HMM starts, the first die chosen is fair with probability 0.99. I have also changed the transition probabilities a bit. For use by Anders' program, this HMM is specified by the following file (see his documentation for details):

# HMM based on example of page 54 of text 

########## ALPHABET ################################################
alphabet { alphabet AB123456; }
########## STATE begin #############################################
begin {  trans
      fair: 0.99
      loaded: 0.01;
   letter NULL;
   end 0;
   }
########## STATE fair ##############################################
fair {   trans
      fair: 0.99
      loaded: 0.01;
   only   1:0.166666666 2:0.166666666 3:0.16666666 4:0.16666666 5:0.16666666
       6:0.16666666;
   label F;
   }
########## STATE loaded ############################################
loaded { trans
      fair: 0.1
      loaded: 0.9;
   only   1:0.1 2:0.1 3:0.1 4:0.1 5:0.1 6:0.5;
   label L;
   }

Here is what you need to do.
  1. Read the documentation file and figure out what the above HMM specification means. Then copy it to a file called dice.mod. Then run the Viterbi algorithm to decode the following sequence using the HMM dice.mod
    1552433142516636666166646666425341323435241324435166366126666166643523425234142
    
    One way to do this is to make a file called data1 for this sequence. (Don't add any blank lines at the end of the file, the software is just quickly-made classroom software, and is very brittle.) Then type
        decodeahmm -v dice.mod data1
    
    The "-v" option specifies Viterbi decoding. What is the output? Display the output above the observation sequence. Is it a reasonable decoding?
  2. Decode the same sequence using posterior decoding. All you need to do is leave off the "-v" option, because posterior decoding is the default. Is the result different from the Viterbi decoding?
  3. Now repeat the above two tests with the following alternate observation sequences:

    data2:

    11116161661666666611116111116111111616666166616166666616111166666661616161666666
    

    data3:

    11111111116666666666666666111111111166666666661111111
    
    In each case, compare the results of the decoding, and say which decoding you feel is more reasonable. Explain why they differ the way they do.
Optional (extra credit, turn in anytime before end of qtr.): Use Anders' HMM software to train an HMM for some sequence feature of biological interest and test your HMM on real data. One possibility would be implement the HMM for CpG islands discussed in the text and try it on parts of the human genome. Anders has the CpG island HMMs from the text and some labeled human genome data in the directory /cse/guests/krogh/book/ahmm/cpg. Look at the "README" file there. Another option would be to try to build an HMM to recognize genes or some other interesting feature of genomic DNA. You might do this for the C. elegans genome. All the data for this genome is in the directory /projects/compbio/data/celegans/sanger. The worm has 6 chromosomes, I, II, III, IV, V, and X. The file, e.g. IV.fa, contains the genomic DNA sequence for chromosome IV in FASTA format. The file IV.gff.gz contains the annotation for chromosome IV in GFF format. (This file is compressed with "gzip" and need to be "gunzip"ed before it can be read.) These GFF files have a lot of stuff in them, and the chromosome files are very large. I suggest you start by looking into the subdirectory "clean". Look at the "README". Here David Kulp has broken up the chromosome files into smaller pieces, and provided a simplified and cleaned-up GFF annotation for each piece. This is much more ambitious, more than a quarter project, but at least you have good labeled training and test data.
Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified October 22, 1999.

Back to the CMP 243 Class Page.