Application and Analysis of Microarrays
BME 210, Winter 2004

Lab 7: Supervised learning to predict Archaeal stress


Objective

 

In this lab we want to find out if we can tell a cold shock experiment from a salt shock experiment from a pH shock, etc performed on P. aerophilum.(Pae)  If we can, this is evidence that there is a combination of genes that can be used as a molecular fingerprint for the Pae stress response.

 

One could use a similar approach to predict the internal molecular state of a cell or an organism.  It might be useful, for example, to be able to monitor whether a virus is sensing stressful conditions, or predict whether a microorganism is about to sporulate.  Supervised approaches are actively being researched for different applications in medicine.  For example, it would greatly aid cancer treatment if we could better characterize the tissue of origin for many types of cancer.  Using microarrays, we might be able to find a combination of gene expression levels that provide clues about which tissue and how de-differentiated cells from a tumor have become.  Supervised learning provides a formal framework for developing such diagnostic tools.

 

ANNOUNCEMENTS

 

You only have to do either section II or both sections III and IV to get full credit.

 

Extra credit will be given if you do the entire lab.

 

Due date

 

Answers to the questions (either in hardcopy form or emailed to Josh) are due in class on Thursday, March 11.  In addition to answering the questions, please hand in all of the requested plots and the code you wrote.

 

As always, you need to turn in your own work, but you are encouraged to work in groups (for example, you may elect to work with the same people as those in your project team).

 

 

I. Setup

 

Download the saved R binary file lab7.RData and save it to your R working directory.  Once you’ve done that, start R and then issue the command: load(“lab7.RData”).  This should give you the following objects in your R workspace:

 

  • stress.submatrix – An expression matrix, containing only genes with significant expression profiles as judged by the ANOVA analysis from lab 4.
  • stress.factor – A factor that lists which type of shock was done in each hybridization (cold-shock, pH-shock, etc.).
  • kegg – a matrix containing 0’s and 1’s describing which Pae genes in the stress.submatrix belong to a particular KEGG category (columns).
  • kmeans.assign – cluster assignments to the genes in the stress.submatrix after running k-means clustering from last lab using k=15.
  • fake.train, fake.test –matrices of fake expression profiles.  Type “rownames(fake.train)” to see to which fake class each gene is assigned in the training set.  There are 3 classes: up, down, and hi, corresponding to 3 different idealized expression profiles.  The first 10 genes belong to the up class.  Their profiles rise and have the idealized profile c(1,2,3,4,5).  The next 5 genes belong to the down class and have the idealized profile c(5,4,3,2,1).   The last 8 genes belong to the hi class and have the idealize profile c(6,6,6,6,6).  The gene profiles do not match the ideal profiles exactly since some white noise (Normal with mean=0, variance=1) was added to each measurement.  The fake test data is identical except that everything has been multiplied by a factor of 2 (so the up’s rise faster and the down’s drop faster, and the hi’s are twice as strong).  The standard error for the test data is also twice as large.

 

Download and source the file knn-error.R.  This script contains a function called knn.error()that runs k-nearest-neighbors and computes the error between the predicted classes and the known classes for the test data.  It’s arguments are:

    1. train.data – a matrix containing examples in the rows.  The row names are treated as the class label
    2. test.data=NULL – another matrix.  The rows are treated as the known class labels.  If known, it uses the training data
    3. k=1 – the number of neighbors to use in the classification (default is 1: use the closest neighbor)

 

II. Measure the cluster enrichment to KEGG functional groups

 

We would like to find out if the clusters we obtained in the last lab correspond at all to known biology.  To do that, we can compare each cluster we obtained from k-means, with curated categories maintained by the KEGG database.  The matrix kegg already contains assignments of the Pae stress genes to their KEGG categories where possible.

 

To test whether a particular cluster derived from k-means is enriched for a KEGG functional group, we can test whether its overlap with some group is much more than we’d expect if we were to draw a group of genes of the same size and get the observed overlap or better by chance.  The hypergeometric distribution gives us a formal way of measuring this significance.

 

a. Download the partially completed function hyper.overlap().  Write the rest of it so that the –log10 of the hypergeometric p-values are returned as a matrix (hint: type ?phyper).  The function takes in a gold standard matrix of 0’s and 1’s where the gene functional categories are listed across the top and the genes are listed down the rows.  A 1 in cell (i,j) indicates gene i is in category j.

 

b. Call your function with: O <- hyper.overlap(kegg, kmeans.assign).  Make an image map of the result.

 

c. Extra credit.  Which KEGG category appears to be the most significant?  Does this enrichment make sense with the gene expression profiles of the cluster(s) it overlaps with?  Can you argue for or against the presence of significant enrichment in this clustering solution?

 

III. Measure prediction performance

 

a. Training error on fake data.  Run knn.error() on the fake data contained in fake.train.  Pass in the fake.train matrix as both the training and the testing set.  Get the training errors for k=1..23 neighbors.  Make a plot of the training error as a function of k.  Please explain the following: What is the k-NN procedure doing here for k=1?  Why do you get perfect prediction (0 error) for small values of k?  What’s the maximum error you can obtain on this problem and why?  What is the k-NN procedure doing when k=23 (and can you think of a faster way to achieve the exact same accuracy)?

 

b. Test error on fake data.  Now pass in the fake.test matrix as the test.data argument to the function and obtain the errors for k=1..23 neighbors.  Make a plot of the test error as a function of k.  What’s the lowest error you can achieve on the test data?  What value of k gives you the lowest error?

 

c. Leave-one-out cross validation on fake data.  Construct a fake data matrix by combining the rows from the fake.train and fake.test matrices.  You can do this using the rbind() function.  For each value of k=1..23, perform leave-one-out cross-validation and obtain the average leave-one-out error for each value of k.  Make a plot of these average errors as a function of k.  What is the best k to use for this problem as determined by leave-one-out cross-validation?  What value do you think you should use?

 

IV. k-nearest neighbors on the stress data

 

Now its time to predict the stress condition from a hybridization profile. 

 

a. Training error on the stress data.  Since the k-nearest neighbors function knn.error() assumes the predictors are contained in the rows of a matrix, we need to transpose the stress.submatrix.  To make a matrix that you can use, do the following in R:

 

rownames(X) <- stress.factor

X <- t(stress.submatrix)

 

This should create a new matrix X that you can now pass to the knn.error() function.  Run knn.error() on the stress data.  Repeat what you did in part II.a for the stress data, obtaining resubstitution errors for k=1..30 (since there are 30 hybridizations).  Make a plot of the errors you obtain.

 

b. Leave-one-out cross-validation on stress data.  Perform leave-one-out cross-validation as you did in part II.c on the stress data.  Make a plot of the leave-one-out errors.  What is the best accuracy you achieve and for what value of k did you achieve it?  Based on this finding, do you think you can you use k-NN on this data to accurately predict one stress condition from another (why or why not)?

 

c. Extra credit.  Write a function called knn.error.matrix() that returns a matrix of errors such that cell (i,j) in the matrix counts the number of times class i in the test set is classified as class j by k-nearest neighbors.  Obtain the error matrix from leave-one-out cross validation using k=1.  Plot the image and describe what you see. 

 

Yeah, you’re done, no more labs!

 

Comments to Josh Stuart