UCSC BME 220

Protein Bioinformatics, Spring 2007
Homework 2: structure prediction
Due Friday 4 May 2007, before class.

(Last Update: 11:58 PDT, 6 May 2007 )
The main purpose of this homework assignment is to get everyone in the class to go through the basic steps of structure prediction, as practiced here at UCSC. The protocol used is based on the one we used in the CASP7 community-wide experiment on protein structure prediction last summer.

Each student will be assigned 2 proteins to work on---a short one and a medium-length one. The proteins may not be equally hard to do predictions on, and I don't have preliminary results on them, so proteins have been assigned randomly and some people may be luckier than others. The sequences for the proteins can be found in hw2.fasta in (naturally) FASTA format.

Assignment of proteins to students

John Archie
PFI1610c and PF07_0073
Sylvia Do
PFE1350c and PF10_0077
Murat Iskar
PF07_0043 and PF13_0232
John Kim
PFC0775w and PF07_0081
Jonathan Magasin
MAL7P1_170 and PFC0365w
student who dropped
taken over by Kevin Karplus
PF13_0036 and PF07_0057
Daniel Sam
MAL7P1_81 and PF10_0084
Eric Scott
PFE0185c and PF10_0272
Chris Szeto
PF07_0084 and PF11_0142
Shyamini Vasili
PFF0090w and PF11_0098
Marcos Woehrmann
PF07_0080 and PFE0095c

What to do

You will be assigned a number of proteins from Plasmodium falciparum to work with, each with an identifier like PFC0775w. I will create a directory for each prediction, with names like /projects/compbio/experiments/protein-predict/malaria/PFC0775w/

All your work for the protein should be done in the directory created for it. Because not everyone in the class is a member of the unix group "protein", I made the directories publicly readable and writable. This open an environment is a bit of a security risk---anyone in the School of Engineering can mess up your stuff. Those of you who are members of the "protein" user group can request that I remove public writability from the directory.

The first thing to do is to look up the amino-acid sequence for the protein you are assigned, and save it in fasta format in a file with the name PFC0075w.a2m (or whatever the id is).

Create a file called "README" as a plain-text file that begins with the date and your name. All your notes on the prediction should be kept in this file. It should be updated frequently as you observe things, and past entries should not be edited (except for minor typo correction) or removed. If you find you have done something wrong, comment on it at the end of the file and go on from there. Treat the README file as you would a lab notebook, as an ongoing record of what was done, not a polished report.

Note: all your work will be visible on the web in http://www.soe.ucsc.edu/~karplus/malaria/ so be aware that your README "lab notebook" is a public document. Avoid obscenities and use full sentences.

Some of the proteins may be membrane proteins, Submit the sequence to TMHMM to get predictions of the transmembrane helices. Record in you README file where the transmembrane helices are, if any.

Find any easily annotated domains in your protein. This can be done by running your sequence through Pfam or RPSblast.

Our tools do not handle transmembrane regions well, so doing a full-protein prediction on these proteins is a waste of time. If you have a transmembrane domain, split your protein into separate domains and predict the inside and outside structure separately, in subdirectories.

Document what region of the protein you plan to predict and why you selected it in your README file.

Copy /cse/classes/bme220/Spring07/starter-directory/Makefile to your directory and change the TARGET macro from XXX0000 to PFC0075w (or whatever your protein is). If you run "make -k" in your directory, that would attempt to predict the whole protein. If you want a subdomain, edit the -first and -length fields of the split-into-domains command to select out the region you are interested in. Then running "make subdomain" should create a subdirectory for the domain of interest. Check that the domain is named consistent with what you thought you were trying to predict, to avoid wasting time on typos.

Note: there will be a README file automatically created in the subdomain directory, but I want you to keep your notes in the whole-chain directory's README file.

Connect to the subdirectory and run

make -k >& make.log &
on one of the School of Engineering Linux machines (the Suns don't have all the tools installed or may have ancient versions). If your paths are set up correctly, this should go through the current automatic prediction process for your protein, which may take many hours, depending which machine you run on.

Things may fail on your first run, with the most likely reason being calling for a program that is not on your default path. You can look at ~karplus/.cshrc for one way to make sure that your path includes /projects/compbio/bin, /projects/compbio/bin/scripts, /projects/compbio/experiments/models.97/scripts, and the appropriate machine-dependent subdirectory of /projects/compbio/bin. There are simpler ways to achieve this (my .cshrc file has accumulated some trash over the years), and you may want to look at some other .cshrc files of grad students in the lab before editing yours.

If something breaks, you can rerun make without having to redo all the things that did work correctly. You may have to remove files that were incorrect outputs of a failing process, to force them to get remade, but other than that, the Makefile should be safely re-runnable.

I strongly recommend looking over the starter-directory/Make.main file, so that you can get an understanding of the different steps of the protocol and what tools are run for each step. It is probably easiest to do this by following along as the summary.html file is created and reading the make.log file. The Make.main file may well be the most complicated makefile you'll ever encounter, but we have found it more useful to script complicated computer protocols as makefiles than as perl or csh scripts, because of the modularity and the ease of rerunning only the parts that need to be changed after a change in the protocol (or failure of some tool).

When the prediction is done, look through the summary.html file at all the things that were predicted. Summarize any interesting observations in the README file. Do you have a strong prediction (low E-value)? Are the secondary structure predictions consistent with tertiary prediction? Do you have any residue-residue predictions? Are they consistent with the tertiary prediction?

Use Rasmol, PyMol, or other visualization tool to look at both the PFC00775w.undertaker-align.pdb.gz file, which has sidechain substitutions done for the top 5 alignments, and the decoys/PFC00775w.try1-opt2.pdb.gz file, which is the fully automatic tertiary prediction. If you use Rasmol, there is no need to ungzip the files, and there are several rasmol scripts to make viewing easier. For example,

rasmol decoys/PFC00775w.try1-opt2.pdb.gz
script ehl2
will switch to cartoon view and color by predicted secondary structure. You can use this to do a quick check of consistency between the secondary and tertiary predictions. (Where they disagree, you may want to look at the sequence logos for the secondary structure prediction, and see how strong the predictions were there.)

I generally use several of the scripts in a single rasmol session to examine the prediction and look for flaws that need fixing.

One thing to look for in particular is whether the automatic structure prediction ended up looking like one of the top alignments. If not, it could mean that undertaker is drifting away from the best templates. There are various ways to try to fix this after it occurs, or to prevent it from happening, which we will discuss in class.

What to turn in

Create a file PFC00775w-report.pdf summarizing what you found and put it in the directory. There may be pictures or citations in the report that aren't in the README file, but there should be no new observations---make all your notes in the README file first!

Make sure that all files in the directory and subdirectories are publicly readable, so that we can look at everything on the web. Run

	find . -exec chmod og+rX '{}' \;
on each of the directories you started from to make everything readable, including files in the subdirectories.

Send e-mail to gerloff and karplus telling us which proteins you have finished, so that we can look at the reports, the README files, and the predictions themselves.


Notes added after assignment due

It looks like I forgot to warn students that the target name had to match the id of the sequence in the .a2m file used as a seed. At least one student ran into trouble because the sequence name did not match the target name.
slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
Dietlind Gerloff portrait
Dietlind Gerloff's home page
BME-slug-icon
BS, MS, and PhD programs
Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250