CMP 243 Homework 1

Due: Monday Oct. 7

Using Internet Tools to Research a CASP2 protein sequence.

D1, D2, etc. are things to do, and Q1, Q2, are questions that you must answer. Turn in your answers by the due date.


You just heard about the CASP2 protein structure prediction contest that is being held this year and want to get in on it at the last minute. So you go to the WWW class page for your Bioinformatics class and see if there is a link to the home page for this contest. You find one under the subpage "www resources for biosequence analysis".

D1: Go to the CASP2 homepage. Find the link to the page describing the target sequences that you are supposed to predict structure for. Go there. Set the table display to show the top 45 targets.

Q1: How many targets are there in the contest?

Notice that there are different categories of prediction, as were described on the home page. You decide to look at one target in Comparative modeling (category "C"), and one target in the Fold recognition ("threading") and Ab initio structure prediction categories (category "F/A"). For targets in category C, there is a clear and known similarity ("homology") between the amino acid sequence of the target protein and the amino acid sequence of one or more proteins of known structure. For targets in category F/A, there is no known or easily determined such similarity.

Many targets are expired, so it is too late to make predictions for them. However, you notice that predictions for target 25 in category C are due Oct. 15, and predictions for target 37 in category F/A are due October 14. This gives you about 2 weeks. You decide to focus on these.

D2: Get further information about targets 25 and 37 by clicking on them. You should save these files as "t25.doc" and "t37.doc" or something similar, for later easy reference. Here and below, I suggest you save files in text format.

In the remainder of this assignment we will work with target 25, but you are strongly encouraged to try these things for the harder prediction problem, t37, as well. Target 25 is the protein adrenodoxin, taken from a cow (more precisely, the mitochondria of bovine adrenal cortex).

Q2: What are the accession numbers listed for t25? Note that the database identified in the record is SWISS-PROT. These accession numbers are unique identifiers for protein sequence records in the SWISS-PROT database.

Note them.

D3: Go into Entrez (remember, you played with Entrez in homework assignment number zero) and search in the protein database on the first accession number that was listed for t25. Display the record in report format.

Q3: What is the SWISS-PROT name for the sequence? Note it. Is target 25 all of this sequence or just part of it?

Students are encouraged to look at some of the papers cited on this protein, although they will be difficult to follow for those students not familiar with these areas of biology and chemistry. Abstracts can be retrieved from MEDLINE by going back to the previous Entrez page and clicking on MEDLINE links. Note also that significant sites in the protein are listed in the file. The amino acids are numbered sequentially from 1 to 186, and the positions of various things, including the amino acids that bind the iron-sulfer cluster associated with the protein, are indicated by their positions within this range. Also indicated are cases where experimental changes of one or more amino acids to other amino acids have disrupted the functionality of the protein (indicated by "conflicts"). Thus the amino acids in these positions are somehow important in giving the protein its structure and/or function.

D4: Now go back to the search page in Entrez and try searching on the second accession number for target 25. (Note: if you just type this one in, it adds it to your previous search query, giving you a search for a record that has both the first accession number AND the second accession number you just typed. In this case you just get the same record back. You must click on "clear all" first, before starting a new search.)

Q4: In what amino acid positions does ADX2_BOVIN differ from ADX1_BOVIN?

D5: Now go back to the Entrez document summary page and click on the related protein sequences for ADX1_BOVIN. Here you see 64 other protein sequences that are similar to ADX1_BOVIN. Note the number of other ferredoxins, including human ferredoxin. Go back the the Entrez main browser page, and search the 3D-structure database for ferredoxin.

Q5: How many ferredoxin structures are listed?

The names 1DOX, 1DOY, 4FXC are identifiers for the Protein Data Bank (PDB) of records of known protein structures. Note however, that many of these are minor variants of the same structure, sometimes obtained by changing some of the amino acids, and other times by crystalizing the protein in different conditions, or complexed with other molecules. You may recall that in the t25.doc file it said that target 25 was known to be homologous to a protein with known structure, and gave the example 1PUT. Notice that 1PUT is on this list. It is the protein putidaredoxin. You might save this list.

We'd like to know if any of these or other known structures have a strong sequence similarity with target 25. One simple thing we can do is to use the program BLAST to search the database of sequences that have known structures for sequences that are similar to target 25.

D6: Go to the page for the Entrez BLAST database search page, read as much of the BLAST help page as you can stand to read, and then click on Basic BLAST. Choose the database to be PDB, to search the database of protein sequences with known structures. Select the blastp program, in order to match the target protein sequence to the protein sequences in PDB. Now display the t25.doc file you saved earlier in another window and use the cut-and-paste facility on your machine to paste in the t25 amino acid sequence. Click "Submit query." Look at the BLAST search results. Lo and behold, 1PUT is at the top of the list. Save this file.

Notice that for this "hit" it has a probability P(N) of 7.7e-17. This probability is related to the probability that one would expect to find, merely by chance, such a hit of this quality in a database search of the size of PDB when searching for a sequence like t25. We'll talk about this more in class. The other hits have substantially higher probabilities of occuring by chance, and hence are not statistically significant matches.

BLAST shows the relationship that it found between pieces of t25 (called the query sequence) and corresponding pieces of 1PUT (called the "subject" sequence) in the lines below the score summary. This is called a sequence alignment.

Q6: What pieces of t25 did it find similar to what pieces of 1PUT, i.e. where are these pieces in the respective sequences?

These sequence comparisons show places where both sequences have the same residue in the same position, and also places where both sequences have similar residues in the same position. The latter are marked with a "+".

Q7: For the first similarity that BLAST finds:

Query:    41 DGFGACEGTLACSTCHLIFEQHIFEKLEAITDEENDMLD 79
             D  G C G+ +C+TCH+   +   +K+ A  + E  ML+
Sbjct:    34 DIVGDCGGSASCATCHVYVNEAFTDKVPAANEREIGMLE 72

For each match marked with "+", use the summary of amino acid properties from the Tooze handout to say 5-6 words on why these two amino acids are similar. Look at the important amino acids in target 25, as indicated in the SWISS-PROT file for t25 that you saved by an iron-sulfer binding position or a "conflict". How are they changed in 1PUT, according to this alignment given by BLAST? Get the SWISS-PROT file for 1PUT (SWISS-PROT name: PUTX_PSEPU) and check its information for consistency with this alignment. Do the locations of the iron-sulfur binding locations of the two proteins agree with the alignment given by BLAST?

Notice that the BLAST hits to the other structures, besides 1PUT, are smaller, but some of them look convincing. Which do you think look convincing and which look like they may just be due to random chance? (You don't need to hand in an answer to this question.)

Well, now you are hot on 1PUT. Let's get this structure and look at it.

D7: Go to the home page for PDB, following the link from the WWW resources subpage of the class home page. Click on the 3DB browser. Search for 1put. When it retrieves the 1put record click on the button to retrieve the PDB file for 1put "complete with coordinates". Save the file as 1put.pdb (text format).

Now you have a file, 1put.pdb, that contains the 3D coordinates for the atoms in the 1put protein. Of course, reading this file is not very fun. If you have a favorite 3D structure viewer, use it. If you want to retrieve and intall the free RASMOL 3D structure viewer, follow the instructions on the help page (click on the "?" at the top of the 3DB browser page). It is claimed that you can set it up to display structures automatically when searching Entrez or PDB. (I haven't done this.) However, as a shortcut, if you have your account on oink, moo or whatever barnyard animal, you can use our installed version of RASMOL:

/projects/compbio/bin/alpha/rasmol

To do this remotely from another color terminal running X on campus (not connected by modem!), do

xhost oink
rlogin oink
(then login and get the 1put.pdb file there)
setenv DISPLAY :0
/projects/compbio/bin/alpha/rasmol 1put.pdb

The RASMOL program has on-line help. Unfortunately our installation does not seem to have it wired in (any systems people in the class: please help!). However, there is a documentation file /projects/compbio/doc/rasmol/manual.doc which you can read. This is the documentation for a slightly earlier release of RASMOL, but should still be helpful.

Look at the 1put structure from various angles. Try the different display options, including spacefill and cartoon. (Cartoon is easiest to look at.)

Q8: Does 1put contain any alpha helices or beta sheets? Can you see where the iron-sulfur cluster is bound?

Finally, now let's look at an alignment of the 1put sequence with other sequences of related structures, and homologs of these sequences with unknown structure. Go directly to the DALI/FSSP structural alignment server at

http://www.embl-heidelberg.de/dali/fssp/fssp.html

This is a system written by Liisa Holm, who is one of our collaborators at EBI (European Bioinformatics Institute). Where it says "Enter PDB code or protein name to search for:" type in "1put". It finds 1put and gives a 1 line discription. Click on "1put" in this description. This brings up the FSSP alignment view for 1put. You see a list of PDB structures. For each PDB structure you see its 4 letter PDB identifier, a "Z" score indicating how similar its structure is to that of 1put (a high Z score indicates close structural relationship), the length of the part of both sequences that can be aligned ("LALI"), the total length of the second sequence ("LSEQ2"), the percentage of amino acids in one that are the same as the corresponding amino acids in the other sequence ("%IDE"), and the common protein name for the second sequence.

D8: Select 1put and 1frd. Choose the view "multiple alignment view (sequence identity <50%)" and click display.

Here you see a multiple alignment of 1put (SWISS-PROT name: putx_psepu) and its sequence homologs, along with 1frd (SWISS-PROT name: ferh_anasp) and its sequence homologs. We will talk in class about how such alignments are produced in general. Generally, symbols like "-" or "." or "~" are used in such alignments as spacers in cases where one sequence lacks a residue in a position where another sequence has one, in order to keep the corresponding positions of the two sequences in register. Notice that adx2_bovin, a very close relative of t25, is listed as one of the homologs of 1put. You might want to save this alignment.

Q9: There are two lower case d's (dd) in the sequence for adx2_bovin in this alignment. The HSSP convention is to use lower case letters to indicate that there are extra amino acids in this particular sequence at this particular place that are not being shown. So there must be other amino acids in adx2_bovin between these two aspartic acids (amino acid letter D). What are they?

Q10: Does the alignment between adx2_bovin and 1put given here agree with BLAST in the places that BLAST aligned t25 and 1put? If not, where is the disagreement?

At this point you are convinced that 1put and t25 have strong sequence similarity as well as functional similarity, and you have a good idea of what parts of the t25 sequence correspond to what parts of the 1put sequence. You can predict that the structure of t25 is similar to the structure of 1put, but with the amino acids of t25 substituted for corresponding amino acids in 1put, as given in the alignment, with these new amino acids still in roughly the same locations in 3D as the amino acids were in 1put. This is called predicting structure by homology. However, the devil is in the details. Could the locations of the amino acids in t25 be shifted somewhat, so that the overall protein can accomodate the changed amino acids? Do the side chains of the amino acids stick out in directions that are different from those directions they have in 1put? To make a prediction for t25 in the comparative modeling category, you would have to dig further to answer these questions, and come up with predicted 3D coordinates for them atoms of t25. Those that can, if any, feel free to make such predictions. We'll find out the answer in a few months. For now though, these questions will be considered beyond our scope.

The fold recognition/ab initio category of the contest is easier: all you need to report is an alignment of the target sequence to a sequence of known structure. No exact 3D coordinates are required. But the targets are harder! More on this when we work on target 37.

Optional things to do:

1. See how the 1frd structure is similar to 1put by displaying it in RASMOL as well.

2. Try to do similar analysis for t37, where there are no known similarities to proteins with known structure. If you find a good structure for it, we'll enter your prediction in the contest.


Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified September 30, 1996.

Back to the CMP 243 Class Page.