The main purpose of this homework assignment is to get everyone in the class to go through the basic steps of structure prediction, as practiced here at UCSC. The protocol used is based on the one we used in the CASP8 community-wide experiment on protein structure prediction last summer.
In past years I have collected sequences of proteins and assigned them (usually) randomly to students. This year, I want to try a somewhat different, and more realistic approach. When I ask researchers what proteins they want predictions for, they rarely give me a sequence. Most often they give me a name for a protein (or, even more frequently, a family of proteins). For example, this year, two researchers responded to my request for proteins of interest.
One query:
We are
working on a project in our lab that involves a new class of signaling
proteins called the beta-defensins. Many of these proteins primary
sequences have been predicted via genomic searches but only a
fraction of them have assigned tertiary structures. If you would like
we can provide you some of these sequences for your class and perhaps
our lab can learn something from the work your students will perform
on them, let me know if you are interested and I can send you more
details, thanks!
According to Wikipedia there are thirteen human proteins containing the beta-defensin domain, though the authors of that article do not say how they determined that. They do list PDB files with the structure and give a link to Pfam, which claims 22 human sequences with the PF00711 domain. One can go to the Pfam tree view, click on the Homo sapiens link, and download the sequence accessions (Uniprot ids) for the sequences. This does not give the full-length sequences, though, just the match to the PF00711 hidden Markov model, so you need to go to Uniprot to download the full-length sequences. There are undoubtedly many other ways to get the sequences, but the Uniprot ids provide good short identifiers for the proteins.
A different query:
1 converts 1-phosphoglucose to 6-phosphoglucose and has
relevance to diabetes. We plan screening for inhibitors
2 converts 1-phosphoribose to 5-phosphoribose also may have
relevance to diabetes.
1. phosphoglucomutase mouse and human
2. phosphopentomutase mouse and human
Again, Wikipedia gives information for phosphoglucomutase and phosphopentmutase, those these articles are less well developed than the beta-defensin article. There are structures for both families, so homology modeling would be fairly straightforward. The proteins are fairly large, though, so I decided to assign the beta-defensins for homework, and not the phosphoglucomutase and phosphopentomutase.
I created a directory /projects/compbio/experiments/protein-predict/beta-defensin and downloaded the Pfam list of Uniprot ids. I also created a subdirectory for each uniprot id. I documented this in the README file for the beta-defensin directory. There is a symbolic link, so that this directory is available on the web also.
Since there are 22 beta-defensins and 4 students, I will assign each of you 5 or 6 beta defensins to work on. You only need to do one or two for the assignment, but if you have time, it would be good to get them all done. Once you have done one, repreating the procedure for the others can be pretty quick, as these are short proteins.
| Kwok | Rachel | Jeff | Michael |
|---|---|---|---|
| P60022 | Q9H4P9 | Q5U7J2 | Q86SQ8 |
| Q30KQ9 | O15263 | A1A581 | A6NNS8 |
| A6NEH9 | Q30KQ2 | Q8NET1 | Q30KR0 |
| Q8NG35 | Q30KQ6 | A6NGW9 | A6NDD2 |
| Q546A0 | Q6ZYB2 | A6NIY0 | Q30KP8 |
| P81534 | Q30KR1 |
All your work for the protein should be done in the directory created for it. I have asked the sysadmins (cluster-admin) to add each of you to the "protein" group, so that you should be able to read and write files in the directories.
The first thing to do is to look up the amino-acid sequence for the protein you are assigned, and save it in fasta format in a file with the name XYZ1234.a2m (or whatever the id is). The TARGET name in the Makefile has to match the id of the sequence in the .a2m file used as a seed, as well as the file name. In past years, at least one student ran into trouble because the sequence name did not match the target name. For this assignment, if the directory is XYZ1234, the TARGET macro should be XYZ1234, the seed file should be XYZ1234.a2m, and the id of the sole sequence in the seed should be XYZ1234.
Run "make -k" in your directory, to attempt to predict the whole protein. The first thing this does is to create a file called "README" as a plain-text file. Add your name and the date to this file! All your notes on the prediction should be kept in this file. It should be updated frequently as you observe things, and past entries should not be edited (except for minor typo correction) or removed. If you find you have done something wrong, comment on it at the end of the file and go on from there. Treat the README file as you would a lab notebook, as an ongoing record of what was done, not a polished report.
Note: all your work will be visible on the web, so be aware that your README "lab notebook" is a public document. Avoid obscenities and use full sentences.
Find any easily annotated domains in your protein. This can be done by running your sequence through Pfam and RPSblast. (There are other tools for domain identification also.) Since the protein list was selected by Pfam this year, you should get some obvious hits.
This year, none of the proteins are membrane proteins, so the usual advice to submit the sequence to TMHMM to get predictions of the transmembrane helices is probably irrelevant. Claims have been made that Phobius and HMM_RA perform better than TMHMM. Note: that HMM_RA does not appear to be available either as a download or as a web service, though.
However, the proteins are almost certainly secreted proteins, so you want to run them through SignalP to look for signal peptides.
Our tools do not handle signal peptides transmembrane regions well, so doing a full-protein prediction on these proteins is probably a waste of time. If you have a transmembrane domain, split your protein into separate domains and predict the inside and outside structure separately, in subdirectories. If you have signal peptide, do a subdomain prediction for the part of the protein after the cleavage site.
Document what regions of the protein you plan to predict and why you selected them in your README file.
If you want a subdomain, edit the -first and -length fields of the split-into-domains command to select out the region you are interested in. Then running "make subdomain" should create a subdirectory for the domain of interest. Check that the domain is named consistent with what you thought you were trying to predict, to avoid wasting time on typos. I usually make a separate target with a new name in the Makefile for each subdomain I'm interested in doing predictions for.
Note: there will be a README file automatically created in the subdomain directory, but I want you to keep your notes in the whole-chain directory's README file.
Connect to the subdirectory and run
make -k >& make.log &on one of the School of Engineering Linux machines (the Suns don't have all the tools installed or may have ancient versions). If your paths are set up correctly, this should go through the current automatic prediction process for your protein, which may take many hours, depending which machine you run on. It is probably best to use one of the bioinformatics lab computers, since we use them frequently and make sure that the software we need is installed on them. Other SoE machines may be missing some shared libraries or other essential software.
Things may fail on your first run, with the most likely reason being calling for a program that is not on your default path. You can look at ~karplus/.cshrc for one way to make sure that your path includes /projects/compbio/bin, /projects/compbio/bin/scripts, /projects/compbio/experiments/models.97/scripts, and the appropriate machine-dependent subdirectory of /projects/compbio/bin. There are simpler ways to achieve this (my .cshrc file has accumulated some trash over the years), and you may want to look at some other .cshrc files of grad students in the lab before editing yours.
If something breaks, you can rerun make without having to redo all the things that did work correctly. You may have to remove files that were incorrect outputs of a failing process, to force them to get remade, but other than that, the Makefile should be safely re-runnable.
I strongly recommend looking over the casp8/starter-directory/Make.main file, so that you can get an understanding of the different steps of the protocol and what tools are run for each step. It is probably easiest to do this by following along as the summary.html file is created and reading the make.log file. The Make.main file may well be the most complicated makefile you'll ever encounter, but we have found it more useful to script complicated computer protocols as makefiles than as perl or csh scripts, because of the modularity and the ease of rerunning only the parts that need to be changed after a change in the protocol (or failure of some tool).
When the prediction is done, look through the summary.html file at all the things that were predicted. Summarize any interesting observations in the README file. Do you have a strong prediction (low E-value)? Are the secondary structure predictions consistent with tertiary prediction? Do you have any residue-residue predictions? Are they consistent with the tertiary prediction?
Use Rasmol, PyMol, or other visualization tool to look at both the XYZ1234.undertaker-align.pdb.gz file, which has sidechain substitutions done for the top 5 alignments, and the decoys/XYZ1234.try1-opt2.pdb.gz file, which is the fully automatic tertiary prediction. If you use Rasmol, there is no need to ungzip the files, and there are several rasmol scripts to make viewing easier. For example,
rasmol -script ehl2 decoys/XYZ1234.try1-opt2.pdb.gzwill switch to cartoon view and color by predicted secondary structure. You can use this to do a quick check of consistency between the secondary and tertiary predictions. (Where they disagree, you may want to look at the sequence logos for the secondary structure prediction, and see how strong the predictions were there.)
I generally use several of the scripts in a single rasmol session to examine the prediction and look for flaws that need fixing.
One thing to look for in particular is whether the automatic structure prediction ended up looking like one of the top alignments. If not, it could mean that undertaker is drifting away from the best templates. There are various ways to try to fix this after it occurs, or to prevent it from happening, which we will discuss in class.
There is a lot of advice about what to do next for a prediction in the CASP7 README file (/projects/compbio/experiments/protein-predict/casp7/README or on the web at http://www.soe.ucsc.edu/~karplus/casp7/README). The CASP8 README file is not as complete, but may have some more up-to-date ideas.
Even if you don't get a complete fold for a protein, you may get some subdomains that look pretty good, with a compact structure and a good match in secondary structure. If you have anything that looks promising, you can submit the structure to a structure-based search engine (I usually use VAST), to find out what sorts of proteins the pieces resemble. This may provide some hints about possible functions of the target proteins.
The beta-defensins are held together by disulfide bonds, so it is
probably worthwhile, after the initial run on the secreted region, to
make a try2.costfcn by copying the try1.costfcn and adding
ConstraintSet known_ssbond SSBond C46 C49to the beginning of the file, and
known_ssbond 5to the SetCost command. Of course, the cysteine identifiers should be replaced by the correct pairs for the protein you are working on (which you probably can figure out by looking at the automatically produced try1 model). You can have multiple SSBond commands in the constraint set, since defensins usually have multiple disulfides.
To run a prediction (after the automatic try1 prediction) using the
try2.costfcn cost function file, use the command
(make XYZ1234.do2 >& do2.log; gzip -9f do2.log)&
Make sure that all files in the directory and subdirectories are publicly readable, so that we can look at everything on the web. Run
find . -exec chmod og+rX '{}' \;
on each of the directories you started from to make everything
readable, including files in the subdirectories.
Send e-mail to me telling me which proteins you have finished, so that I can look at the reports, the README files, and the predictions themselves.
|
|
| BME 220 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building