Each student will be assigned 2 proteins to work on---a short one and a medium-length one. The proteins may not be equally hard to do predictions on, and I don't have preliminary results on them, so proteins have been assigned randomly and some people may be luckier than others. The sequences for the proteins can be found in hw2.fasta in (naturally) FASTA format.
| John Archie
PFI1610c and PF07_0073 | Sylvia Do
PFE1350c and PF10_0077 | Murat Iskar
PF07_0043 and PF13_0232 | John Kim
PFC0775w and PF07_0081 |
| Jonathan Magasin
MAL7P1_170 and PFC0365w | student who dropped
taken over by Kevin Karplus PF13_0036 and PF07_0057 | Daniel Sam
MAL7P1_81 and PF10_0084 | Eric Scott
PFE0185c and PF10_0272 |
| Chris Szeto
PF07_0084 and PF11_0142 | Shyamini Vasili
PFF0090w and PF11_0098 | Marcos Woehrmann
PF07_0080 and PFE0095c |
All your work for the protein should be done in the directory created for it. Because not everyone in the class is a member of the unix group "protein", I made the directories publicly readable and writable. This open an environment is a bit of a security risk---anyone in the School of Engineering can mess up your stuff. Those of you who are members of the "protein" user group can request that I remove public writability from the directory.
The first thing to do is to look up the amino-acid sequence for the protein you are assigned, and save it in fasta format in a file with the name PFC0075w.a2m (or whatever the id is).
Create a file called "README" as a plain-text file that begins with the date and your name. All your notes on the prediction should be kept in this file. It should be updated frequently as you observe things, and past entries should not be edited (except for minor typo correction) or removed. If you find you have done something wrong, comment on it at the end of the file and go on from there. Treat the README file as you would a lab notebook, as an ongoing record of what was done, not a polished report.
Note: all your work will be visible on the web in http://www.soe.ucsc.edu/~karplus/malaria/ so be aware that your README "lab notebook" is a public document. Avoid obscenities and use full sentences.
Some of the proteins may be membrane proteins, Submit the sequence to TMHMM to get predictions of the transmembrane helices. Record in you README file where the transmembrane helices are, if any.
Find any easily annotated domains in your protein. This can be done by running your sequence through Pfam or RPSblast.
Our tools do not handle transmembrane regions well, so doing a full-protein prediction on these proteins is a waste of time. If you have a transmembrane domain, split your protein into separate domains and predict the inside and outside structure separately, in subdirectories.
Document what region of the protein you plan to predict and why you selected it in your README file.
Copy /cse/classes/bme220/Spring07/starter-directory/Makefile to your directory and change the TARGET macro from XXX0000 to PFC0075w (or whatever your protein is). If you run "make -k" in your directory, that would attempt to predict the whole protein. If you want a subdomain, edit the -first and -length fields of the split-into-domains command to select out the region you are interested in. Then running "make subdomain" should create a subdirectory for the domain of interest. Check that the domain is named consistent with what you thought you were trying to predict, to avoid wasting time on typos.
Note: there will be a README file automatically created in the subdomain directory, but I want you to keep your notes in the whole-chain directory's README file.
Connect to the subdirectory and run
make -k >& make.log &on one of the School of Engineering Linux machines (the Suns don't have all the tools installed or may have ancient versions). If your paths are set up correctly, this should go through the current automatic prediction process for your protein, which may take many hours, depending which machine you run on.
Things may fail on your first run, with the most likely reason being calling for a program that is not on your default path. You can look at ~karplus/.cshrc for one way to make sure that your path includes /projects/compbio/bin, /projects/compbio/bin/scripts, /projects/compbio/experiments/models.97/scripts, and the appropriate machine-dependent subdirectory of /projects/compbio/bin. There are simpler ways to achieve this (my .cshrc file has accumulated some trash over the years), and you may want to look at some other .cshrc files of grad students in the lab before editing yours.
If something breaks, you can rerun make without having to redo all the things that did work correctly. You may have to remove files that were incorrect outputs of a failing process, to force them to get remade, but other than that, the Makefile should be safely re-runnable.
I strongly recommend looking over the starter-directory/Make.main file, so that you can get an understanding of the different steps of the protocol and what tools are run for each step. It is probably easiest to do this by following along as the summary.html file is created and reading the make.log file. The Make.main file may well be the most complicated makefile you'll ever encounter, but we have found it more useful to script complicated computer protocols as makefiles than as perl or csh scripts, because of the modularity and the ease of rerunning only the parts that need to be changed after a change in the protocol (or failure of some tool).
When the prediction is done, look through the summary.html file at all the things that were predicted. Summarize any interesting observations in the README file. Do you have a strong prediction (low E-value)? Are the secondary structure predictions consistent with tertiary prediction? Do you have any residue-residue predictions? Are they consistent with the tertiary prediction?
Use Rasmol, PyMol, or other visualization tool to look at both the PFC00775w.undertaker-align.pdb.gz file, which has sidechain substitutions done for the top 5 alignments, and the decoys/PFC00775w.try1-opt2.pdb.gz file, which is the fully automatic tertiary prediction. If you use Rasmol, there is no need to ungzip the files, and there are several rasmol scripts to make viewing easier. For example,
rasmol decoys/PFC00775w.try1-opt2.pdb.gz script ehl2will switch to cartoon view and color by predicted secondary structure. You can use this to do a quick check of consistency between the secondary and tertiary predictions. (Where they disagree, you may want to look at the sequence logos for the secondary structure prediction, and see how strong the predictions were there.)
I generally use several of the scripts in a single rasmol session to examine the prediction and look for flaws that need fixing.
One thing to look for in particular is whether the automatic structure prediction ended up looking like one of the top alignments. If not, it could mean that undertaker is drifting away from the best templates. There are various ways to try to fix this after it occurs, or to prevent it from happening, which we will discuss in class.
Make sure that all files in the directory and subdirectories are publicly readable, so that we can look at everything on the web. Run
find . -exec chmod og+rX '{}' \;
on each of the directories you started from to make everything
readable, including files in the subdirectories.
Send e-mail to gerloff and karplus telling us which proteins you have finished, so that we can look at the reports, the README files, and the predictions themselves.
|
|
|
| Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250