Project Ideas

Latest additions


Protein Structural Similarity:

What's the significance of detecting similar substructures in two different proteins? In combination with sequence or functional similarities, such similarity is generally accepted as evidence of homology. In the absence of such supporting evidence, however, it's uncertain whether such similarity is the result of convergent or divergent evolution. Such significance measures are also important in assessing predicted models of protein structure. If two models predict different portions of structure correctly, which is a better prediction? If some parts of a structure are more difficult to predict, the a model that gets these portions correct might be considered to be a better prediction.

Several projects are possible relating to this problem:

Protein design
Consider two proteins with considerable structural similarity, but no significant sequence similarity. The significance of this structural similarity is uncertain. Did the similar structural motifs arise independently, or are these proteins homologs that have diverged so far from their last common ancestor that all significant sequence similarity has been lost? Using protein design, can we identify a pair sequences that are specific for the structures of these two folds and that do have significant sequence similarity? The existence of such a pair of sequences is evidence for homology between the two folds-- if these proteins had occured naturally, one would have concluded that they were related by divergent evolution.

Null models for protein structure comparison
In the structure-structure superposition alignment algorithm Mammoth, the statistical significance of a structural match is estimated by comparison to the distribution of scores seen for random pairs of protein . An open question, however, is how to construct this distribution of random scores. In random comparisons, a small protein is more likely to be matched over a large fraction of it's length than a large protein, and Mammoth conditions it's reported p-values on the length of the proteins being compared. The likelihood of a particular match occuring by chance, however, also depends on both secondary structure type and supersecondary structure content-- certain arrangements of secondary structure elements are very common in protein structures while others are rare. A possible project in this area would be to determine whether p-values estimated from a protein-class-specific null model are an improvement over those estimated from a distribution of scores including all protein classes.
Suboptimal structure matches
Identifying significant structural matches can be a useful method for identifying accurate structure predictions (Bonneau Pfam).Models with two regions matching, but not simultaneously -- fix these? combining good portions of models?

Protein Structure Prediction

The 6th round of the biannual CASP experiment starts this coming summer and is an opportunity for double blind assessment of progress in structure prediction methods. While significant improvement in de novo prediction methods has been seen over the last several CASPs, little or no progress has been made in comparative modelling. More and more structures are being experimentally solved, with the goal of obtaining representative structures for all protein sequences. This means that structural information for the vast majority of proteins will be obtained from comparative models, and the quality of this structural information is limited by our ability to model homologous structures. The quality of comparative models depends on 1) accurate alignments 2) modeling insertions well and 3) identifying regions of structural divergence between two homologous structures and predicting these differences.

Several projects related to protein prediction for CASP 6 are possible.

Friesner, Honig and co-workers have recentlly described a heirarchical approach to loop modeling that uses clustering to select conformations for further refinement. Incorportaing a similar strategy into Rosetta is likely to improve the loop modeling methods currenlty used.

A second possibility would be to try to incorporate structural information from multiple parent structures into the Rosetta comparative modeling protocol. Currently, the method uses only a single parent strucure.

Other possibilities include trying to refine comparative models to attempt to make models closer to the target protein structure than the closest parent structure and trying to develop good energy functions for refinement and/or model selection.

With respect to improving alignments, one could try using designed sequences to improve the PSSMs for parent structures, withthe hope that this might result in an improvement in alignment quality.

A more practical problem is automating the process of generating structure databases from which protein fragments can be selected. Currently, this process requires manually running an assortment of tools.


Function Prediction

Incorporating structural information into function prediction methods is a hot topic in bioinformatics, and great area for possible projects. For example, one could try predicting protein-protein interactions by homology modeling using complexes as template structures. Aloy and Russell have reported such an approach, that could be extended by attempting to include backbone flexibility in the modeling process.


Proteome Browser

Lots of development is currently underway in both the family browser and the proteome browser. Talk to Jim Kent or Fan Hsu for a current list of possible projects related to the browsers.


Rosetta Software engineering

For those interested in a a software engineering problem as opposed to a scientific research problem, many possibilities exist for either refactoring portions of Rosetta to improve speed or to add features to improve/build a user interface.


More Ideas

Reviews/Tutorials

For those of you who aren't interested or able to take on a programming-intensive project, consider a literature review or tutorial. The area of function evolution, function prediction from structure and the relationship between structural and functional evolution are generating a lot of interest at the moment. This would be a really interesting topic to research and a review or tutorial on the topic would be of broad interest.

A second interesting topic for a review/tutorial would be protein domains. How are domains identified and classified? What are the major domain collections? What tools are available? What's the domain distribution over various genomes?