Protein Structural Similarity:
What's the significance of detecting similar substructures in two different proteins? In combination with sequence or functional similarities, such similarity is generally accepted as evidence of homology. In the absence of such supporting evidence, however, it's uncertain whether such similarity is the result of convergent or divergent evolution. Such significance measures are also important in assessing predicted models of protein structure. If two models predict different portions of structure correctly, which is a better prediction? If some parts of a structure are more difficult to predict, the a model that gets these portions correct might be considered to be a better prediction.
Several projects are possible relating to this problem:
The 6th round of the biannual CASP experiment starts this coming summer and is an opportunity for double blind assessment of progress in structure prediction methods. While significant improvement in de novo prediction methods has been seen over the last several CASPs, little or no progress has been made in comparative modelling. More and more structures are being experimentally solved, with the goal of obtaining representative structures for all protein sequences. This means that structural information for the vast majority of proteins will be obtained from comparative models, and the quality of this structural information is limited by our ability to model homologous structures. The quality of comparative models depends on 1) accurate alignments 2) modeling insertions well and 3) identifying regions of structural divergence between two homologous structures and predicting these differences.
Several projects related to protein prediction for CASP 6 are possible.
Friesner, Honig and co-workers have recentlly described a heirarchical approach to loop modeling that uses clustering to select conformations for further refinement. Incorportaing a similar strategy into Rosetta is likely to improve the loop modeling methods currenlty used.
A second possibility would be to try to incorporate structural information from multiple parent structures into the Rosetta comparative modeling protocol. Currently, the method uses only a single parent strucure.
Other possibilities include trying to refine comparative models to attempt to make models closer to the target protein structure than the closest parent structure and trying to develop good energy functions for refinement and/or model selection.
With respect to improving alignments, one could try using designed sequences to improve the PSSMs for parent structures, withthe hope that this might result in an improvement in alignment quality.
A more practical problem is automating the process of generating structure databases from which protein fragments can be selected. Currently, this process requires manually running an assortment of tools.
Function Prediction
Incorporating structural information into function prediction methods is a hot topic in bioinformatics, and great area for possible projects. For example, one could try predicting protein-protein interactions by homology modeling using complexes as template structures. Aloy and Russell have reported such an approach, that could be extended by attempting to include backbone flexibility in the modeling process.
Proteome Browser
Lots of development is currently underway in both the family browser and the proteome browser. Talk to Jim Kent or Fan Hsu for a current list of possible projects related to the browsers.
Rosetta Software engineering
For those interested in a a software engineering problem as opposed to a scientific research problem, many possibilities exist for either refactoring portions of Rosetta to improve speed or to add features to improve/build a user interface.
Reviews/Tutorials
For those of you who aren't interested or able to take on a programming-intensive project, consider a literature review or tutorial. The area of function evolution, function prediction from structure and the relationship between structural and functional evolution are generating a lot of interest at the moment. This would be a really interesting topic to research and a review or tutorial on the topic would be of broad interest.
A second interesting topic for a review/tutorial would be protein domains. How are domains identified and classified? What are the major domain collections? What tools are available? What's the domain distribution over various genomes?