CMP 243 Homework 1

Due: Tues. Jan. 20

Using Internet Tools to Research a Protein Sequence.

D1, D2, etc. are things to do, and Q1, Q2, are questions that you must answer. Turn in your answers by the due date.


A few years ago the tumor protein, p53, was elected molecule of the year. Recently a relative, p73, got a lot of attention. (Don't you love these names.) In this assignment we'll get some exposure to some of the key bioinformatics tools and databases on the web by exploring p53. In the next assignment we'll look for p73.

D1: Go into Entrez (remember, you played with Entrez in homework assignment number zero) and search in the protein database for p53. I suggest you specify protein name "p53" and Organism "human" (i.e. p53 [Protein Name] AND human [Organism]) If you just specify protein name = p53 you get p53 from the crab-eating macaque and many other interesting creatures as well. Find the Entrez record for human p53 and display it in GenPept report format.

Q1: How many amino acids are there is the human p53 protein, and what is it's SWISSPROT ID (I don't want the SWISSPROT Accession number) Note: Here and below, the info I am asking for might not all be on the page were you are. Also, when there is a conflict, believe SWISSPROT.) Find an mRNA (possibly a fragment) for p53 and give me its GenBank Accession number and length.

You are encouraged to look at some of the papers cited on this protein, abstracts can be retrieved via the MEDLINE links, however there are a lot of them! Read the summary of p53's properties on its SWISSPROT record (by now you should have found its SWISSPROT record.) Read about it in any recent molecular biology text. I'll try to put a paper about p53 in the file drawer as well.

D2: Go to the bottom of the SWISSPROT record for the human p53. Find the "DOMAIN" fields. Proteins generally consist of many parts, each of which more-or-less would fold the same way by itself, without the rest of the protein, and which often have different functions. These are called "domains" within the protein. The residues (amino acids) in a protein are numbered from 1 to n, and domains of the protein can be identified by subranges. (Sometimes a single domain can have consist of more than one subrange, but that does not happen with p53. Also: different databases may use somewhat different definitions of the domains. The domains listed in this SWISSPROT record are out-of-date; since this record was made we have learned a great deal about the structure and function of p53, and the accepted definitions of the domains have changed.)

Q2: Find the substitutions listed at the bottom of the SWISSPROT page. One such substitution is R -> L in position 110. This means that a variant version of human p53 is known with leucine replacing arginine in position 110. It is accociated with a tumor. What kind?

Find the entry on the SWISSPROT page for the PRODOM database and click on "domain structure". PRODOM uses an automated computational method to discover domains within protein sequences, and then shows a linear coordinate map of the protein sequence with the domains it has found colored in different patterns. Note that PRODOM's domains are quite different from those identified in the SWISSPROT record. To explore the PRODOM domain including position 110 of p53, click on the the row of green boxes (these are the coloring for a PRODOM domain from position 94 to 363). This domain includes a core region of p53 that binds to a specific target signal in DNA.

The detailed PRODOM record for this domain gives a multiple alignment of this piece of p53 (from position 94 to 363) with corresponding pieces of p53 proteins from other organisms. We will talk in class about how such alignments are produced; there are many methods. Generally, symbols like "-" or "." or "~" are used in such alignments as spacers in cases where one sequence lacks a residue in a position where another sequence has one, in order to keep the corresponding positions of the two sequences in register. PRODOM uses "-". Scan across the alignment and you will see instances of this. Find the arginine at position 110 of human p53. What other amino acids occur in this position in the other organisms listed in this multiple alignment? (list them). These amino acid substitutions probably do not disrupt p53's function, since they are tolerated in these other organisms. However the SWISSPROT file for human p53 lists 3 tumor-associated substitutions for position 110. Presumably these are disruptive. Is there an amino acid property that distinguishes the (presumably) disruptive from the (presumably) non-disruptive substitutions? Which property?

Now scan over to near position (column) 150 in this multiple alignment. (This corresponds to about position 250 in human p53, since this piece of human p53 we are looking at in this alignment starts at position 94 in p53.) Find the sequence PPEVGSDY in the first protein sequence of the alignment; it lies between columns 130 and 140. For each of these 8 columns, look at the variations you see in the amino acids in that column, and list the amino acids properties, if any, that the variants/substitutions seen in that column have in common. The idea is to see if you can determine why these substitutions might be tolerated without disrupting function, but you don't need to go into great depth here!

D3: Now note that in the region from column 145 to 155 the amino acids are all perfectly conserved: no substitutions are observed in the corresponding positions in proteins from related species. This indicates that this region may be very critical to the structure/function of p53. Indeed, you should go back to the SWISSPROT record and note the many disease-related variants that arise from substitutions in this region. On the last protein sequence, note that the short "motif" MCNSSCMGGMNRRP is a live web link. Click on it. You are now 05 at a PROSITE record. Prosite is a database of "signature patterns" that are sometimes used to search a protein database for a protein of a specific type. In this case, the prosite database suggests that you should search for the string "MCNSSCMGGMNRR" if you want to find proteins in the p53 family. For other families the PROSITE patterns are more complex, involving possible substitutions and variable spacings between different parts of the pattern. Amos Bairoch and his collaborators have put a lot of effort into finding patterns that retrieve as many of the actual members of the protein family as possible, without turning up any "false positives", i.e. proteins that match the pattern, but don't belong to the family. (You can see on the PROSITE record that they have tested this pattern and claim it returns 20 members of the family with no false positives, but the database is growing all the time, and who is to say that it is not missing some distant relatives of p53?) Database searching is an important issue: this is how many discoveries are made. We'll return to it in future assignments, where we will introduce more powerful methods.

D4: Now go back to the SWISSPROT record for human p53 and find the list of PDB entries. Click on "ENTRY" under 1TSR. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. There is a lot to explore from here, but first note that there is a link to MMDB, Entrez's structure database. Follow this link to Entrez and on from there to retrieve the MEDLINE abstract for the paper in Science that describes the 1TSR structure. (I'll try to put a copy of the full paper in the file drawer for those who are interested.) Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Now go back to PDB page for 1TSR. Under "data retrieval", you can click to download the full 3d coordinates of all the atoms in the 1TSR structure. If you are have access to a program for viewing PDB structures, download this file and take a look, noting some of the areas discussed below. If you don't have access to a structure viewer, you can use the free RASMOL viewer we have installed on the barnyard machines in the CS department. (You'll need to make sure you have an account on these machines, see homework zero.) I have already downloaded the full coordinate file for 1TSR and put it in the file /projects/compbio5/data/pdb/1tsr.pdb. The actual RASMOL program is called /projects/compbio/bin/alpha/rasmol on the barnyard machines. To run RASMOL remotely from another color terminal running X on campus (not connected by modem!), do

xhost oink
rlogin oink
(then login and do)
setenv DISPLAY :0
/projects/compbio/bin/alpha/rasmol /projects/compbio5/data/pdb/1tsr.pdb

The RASMOL program has on-line help. Unfortunately our installation does not seem to have it wired in. Open the file /projects/compbio/doc/rasmol/rasmol.html using your browser and you will get a RASMOL reference manual.

There are two windows when running RASMOL, a display window that contains the graphic image and a separate command window. To experiment, set the display in the display window to various modes (cartoon is a good place to start and spacefill is fun). Rotate the molecule by using the sliders on the edges of the display window.

The DNA double helix is visible in yellow, and you can see 3 separate protein chains all of similar structure. Each of these is a p53 core domain. They are chains A,B and C. Color chain B red by typing the following commands in the command window:


select *:B
color red

To see particular residues, say residue 110 (arginine) in chain B, type

select 110B
color yellow

This is the residue you were looking at previously in the PRODOM multiple alignment and in the SWISSPROT file. Another good one is residue 248 in chain B. Select that one and color it green. Under the "options" menu on the display window, choose "label". This shows the amino acid name for the residue(s) you just selected. I suggest you choose "spacefill" under the display menu as well, so you can see the whole shape of this amino acid. To zoom in to get a closer look, rotate the structure to center this residue, and type

zoom 400
This magnifies 400%.

Q3: What residue is at position 248? Where is it in the structure relative to the DNA sequence? This residue was in the critical, totally conserved region of p53, used by the PROSITE pattern. Why do you think it is so important? (Open-ended, extra credit: investigate this or any other significant tumor-associated locus and use what you know to explain why it might disrupt the function or structure of the protein. There is a web page of p53 variants. Also, explore the SCOP and DALI databases, starting from the 1TSR page.)


Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Jan 15, 1998.

Back to the CMP 243 Class Page.