The C. elegans genome is represented by the sequences of nucleic acids, denoted by the letters A, C, G and T, that make up each of the six chromosomes of this animal. Altogether there are about 100 million letters in this database. In this sequence of letters lie all the approximately 20,000 genes of this organism. It is of great scientific importance to find all of the C. elegans genes. This task is too difficult to do by hand, thus computational methods must be devised to do it. However, there is no easy way of recognizing where the genes are in the genome sequence. Genes are characterized by subtle sequence patterns that we do not entirely understand. Our project will be to apply machine learning methods from AI to build a system that "learns" to find genes in C. elegans genomic DNA.
Most of the genes in the C. elegans genome have already been found. The system we build will use these genes as training examples to learn how to recognize the other genes. We will test our system on parts of the C. elegans genome that were not included in the training set. In some of these test parts, we will know where the genes are, and so we can check if our system has indeed learned to find genes. In other parts, we won't know for sure where the genes are, and thus we will have the opportunity to discover new genes.
In order to get ready for the project, please read pages 525-587 in the text, paying special attention to the material on neural networks. We will skip ahead in the planned lectures and go over this material next in class, so you will have a good foundation to get started on your project. It would also be useful to read some introductory material on genes. We will provide an introductory article. (An advanced article, for those who are interested, is also available. It is the first article listed on my publications page.) No previous knowledge of molecular biology is required for this project.
The project will be given in a series of weekly or bi-weekly assignments. Melissa will provide data sets for you to use, and she will provide neural network training and testing programs for you to use. She has written these programs using the MatLab software, which contains a complete set of neural network tools. You will use this software for your project. At some point we might break up into teams to handle different parts of the genefinding problem. You can stay in the same groups throughtout the project. You can also work alone if you chose.
We are assuming that all students will be doing this project. However, if you would rather do a different AI project, you have a project that you are already involved in, and another professor agrees to join us in supervising the work on this other project, then you can write a project proposal (3 pages) and turn it in by February 4. We will let you know shortly thereafter if the project is approved. The same applies to those who want to write a research paper on a topic in AI instead of doing the gene-finding project. Those who are doing the gene-finding project do not need to write a project proposal.
Keep your eye out for the first project assignment!