CMP140 Projects
Winter 1999

We have put together a project plan for this course in which you will apply neural networks to try to find features of genes in DNA sequences. Now is a very exciting time in science, since just a month ago they determined the first complete genome for an animal, the worm C. elegans. (See the C. elegans genome project.) In a few years they will obtain the sequence for the entire human genome, achieving a huge scientific milestone. The new science of "genomics", the study of entire genomes, is one of the important new application areas for AI.

The C. elegans genome is represented by the sequences of nucleic acids, denoted by the letters A, C, G and T, that make up each of the six chromosomes of this animal. Altogether there are about 100 million letters in this database. In this sequence of letters lie all the approximately 20,000 genes of this organism. It is of great scientific importance to find all of the C. elegans genes. This task is too difficult to do by hand, thus computational methods must be devised to do it. However, there is no easy way of recognizing where the genes are in the genome sequence. Genes are characterized by subtle sequence patterns that we do not entirely understand. Our project will be to apply machine learning methods from AI to build a system that "learns" to find genes in C. elegans genomic DNA.

Most of the genes in the C. elegans genome have already been found. The system we build will use these genes as training examples to learn how to recognize the other genes. We will test our system on parts of the C. elegans genome that were not included in the training set. In some of these test parts, we will know where the genes are, and so we can check if our system has indeed learned to find genes. In other parts, we won't know for sure where the genes are, and thus we will have the opportunity to discover new genes.

In order to get ready for the project, please read pages 525-587 in the text, paying special attention to the material on neural networks. We will skip ahead in the planned lectures and go over this material next in class, so you will have a good foundation to get started on your project. It would also be useful to read some introductory material on genes. We will provide an introductory article. (An advanced article, for those who are interested, is also available. It is the first article listed on my publications page.) No previous knowledge of molecular biology is required for this project.

The project will be given in a series of weekly or bi-weekly assignments. Melissa will provide data sets for you to use, and she will provide neural network training and testing programs for you to use. She has written these programs using the MatLab software, which contains a complete set of neural network tools. You will use this software for your project. At some point we might break up into teams to handle different parts of the genefinding problem. You can stay in the same groups throughtout the project. You can also work alone if you chose.

We are assuming that all students will be doing this project. However, if you would rather do a different AI project, you have a project that you are already involved in, and another professor agrees to join us in supervising the work on this other project, then you can write a project proposal (3 pages) and turn it in by February 4. We will let you know shortly thereafter if the project is approved. The same applies to those who want to write a research paper on a topic in AI instead of doing the gene-finding project. Those who are doing the gene-finding project do not need to write a project proposal.

Keep your eye out for the first project assignment!


Questions regarding about page content should be directed to cline@cse.ucsc.edu.
Last modified January 21, 1999.

Back to the CMP 140 Class Page.