CMP 243 Homework 1, Part 2

Due: Tues. Oct 5.

Introduction to protein database searching

In part 1 of this assignment we looked at the most famous tumor suppressor protein, p53. In 1997, a new human tumor suppressor protein, p73, was discovered whose sequence shows a relationship with p53 (Science, 12 Sept. 1997, vol. 277, pg. 1605). Details were published in the 22 August issue of Cell by Daniel Caput and friends. We want to see if, hypothetically, we could have found this new protein if it had just been deposited in one of the protein databases, by using bioinformatics tools to search these databases for sequences similar to p53.

The general task we are studying is referred to as searching for homologous proteins, that is, searching for proteins that may have similar structure or similar function to a given protein, perhaps because they are evolutionarily related. We will the most popular tool for this: basic BLAST search using the target sequence. Later we'll explore this and other methods in detail and talk about how they work.

p53 is our "query" or "target" sequence. We hope to use it to find the related protein p73. First let's get some information about p73, so we will know if we succeed. This makes it a better assignment, but of course, in practice, you don't usually already know what you are looking for.

  1. Go to Entrez
  2. Click on "Literature - PubMed"
  3. Change the "Search Field" to "Author Name". Enter "Caput D" and press search.
  4. Narrow search by adding "Journal Name" "Cell" to query.
  5. Retrieve the 2 documents.
  6. Click "Kaghad M, et al.", as this corresponds to the correct Cell issue.
  7. This gives the abstract to the Cell article. We're interested in the amino acid sequences, so click on the button "protein" at the top of the page.
  8. Click on "Display" to see the Genpept report for each relevant sequence.
  9. Check them out. These are the sequences that we want to try to find using computational methods, starting from the p53 family of tumor suppressors. You may wish to save these for later use.

The database we will search is the nonredundant protein database "nr", maintained by NCBI. Retrieve the p53_human sequence in FASTA format using Entrez and save it in a file or in a window. Go to the BLAST database search page, read as much of the BLAST documentation as you can stand to read, and then click on Advanced BLAST. Choose the database to be nr. Select the blastp program, which searches a protein library with a protein sequence. Set the organism name to "Homo sapiens". This restricts BLAST to only report matches of your query sequence to human proteins. We'll look at the other advanced options later, just leave them at the defaults for now. Now paste in the p53 sequence as the query and click "Submit query." Look at the BLAST search results. You may want to save this file, say as the file "p53.blast.hits".

Q1: What does BLAST say was your query sequence (it should be p53_human), and how large does BLAST say the database is that it searched with your query sequence (number of sequences and total number of letters.)? Which of the p73-like sequences that you found in the literature search above also appear on the list of possible homologs of p53 found by BLAST? (These are called "hits" to the database for your query sequence.)

There are three important criteria for evaluating BLAST "hits". The first is how much of the query sequence they match. Some hits only match a small portion of the query sequence. A graphic in the BLAST output shows what portion of the query sequence has a good match to one of the hits. You see that some hits match the whole query sequence, while some match only a small part. The second criterion is the E-value. Each hit is allocated a score. The E-value of a hit with score S is the number of times you would expect a hit with score >= S to occur simply by chance, when searching a database as large as the one you just searched. We will discuss the theory of such scores very soon. An overview is given in the "BLAST course" on the main BLAST web page. You are not required to read this yet; we will go over this later. The third criterion is how good the alignment between the hit reported by BLAST and the query sequence looks in light of your knowledge of proteins. The alignments for all the hits are reported at the bottom of the output. Here is the alignment and the E-value reported for the first p73 hit:

 dbj|BAA32433| (AB010153) p73H [Homo sapiens]
           Length = 586
           
 Score =  269 bits (681), Expect = 5e-72
 Identities = 130/262 (49%), Positives = 177/262 (66%), Gaps = 5/262 (1%)

Query: 94  SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPP 153
           S ++PS   Y G +   + F  S TAKS T TYS  L K++CQ+AKTCP+Q+ V + PP 
Sbjct: 68  SPAIPSNTDYPGPHSSDVSFQQSSTAKSATWTYSTELKKLYCQIAKTCPIQIKVMTPPPQ 127

Query: 154 GTRVRAMAIYKQSQHMTEVVRRCPHHE--RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNT 211
           G  +RAM +YK+++H+TEVV+RCP+HE  R  +   +APP HLIRVEGN   +Y++D  T
Sbjct: 128 GAVIRAMPVYKKAEHVTEVVKRCPNHELSREFNEGQIAPPSHLIRVEGNSHAQYVEDPIT 187

Query: 212 FRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFE 271
            R SV+VPYEPP+VG++ TT+ YN+MCNSSC+GGMNRRPIL I+TLE   G +LGR  FE
Sbjct: 188 GRQSVLVPYEPPQVGTEFTTVLYNFMCNSSCVGGMNRRPILIIVTLETRDGQVLGRRCFE 247

Query: 272 VRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNT---SSSPQPKKKPLDGEYF 328
            R+CACPGRDR+ +E+++RK+           TKR    NT     +   K++  D E  
Sbjct: 248 ARICACPGRDRKADEDSIRKQQVSDSTKNGDGTKRPFRQNTHGIQMTSIKKRRSPDDELL 307

Query: 329 TLQIRGRERFEMFRELNEALEL 350
            L +RGRE +EM  ++ E+LEL
Sbjct: 308 YLPVRGRETYEMLLKIKESLEL 329

The E-value is the quantity labeled "Expect". The "Query" sequence is p53 and the "Sbjct" sequence is p73. At every position in the alignment, the middle character is (1) an amino acid residue if there is a perfect match between query and subject at this position, (2) a "+" is there is not a match, but a common substitution, or (3) blank if the substitution is not common. The dash "-" is used to keep the two sequences in register. This indicates that amino acids may have been inserted or deleted at this point in one sequence relative to the other during evolution.

Q2: What properties of amino acids might explain why BLAST puts a "+" in each of the first 6 places where it puts a plus in this alignment? (Note: this is six different questions, and requires 6 different answers.) Does the overall alignment seem to suggest that these two sequences might be evolutionarily related to you?

Q3: What is the E-value for this hit? In light of this E-value, do you worry that this hit may have happened merely by chance?

Note that BLAST finds hits to many other human proteins as well.

Q4: Which of these hits do you believe reflect a genuine relationship, and which do you believe may have occurred merely by chance? Of the hits you believe are real, are all of them hits to what look like alleles of or pieces of p53 and p73 themselves, or does there appear to be another family human proteins related to p53? (This last question is fairly subjective, and will be graded accordingly.)


Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Oct 1, 1999.

Back to the CMP 243 Class Page.