|
|
Application and Analysis of
Microarrays Lab 7: Supervised learning to
predict Archaeal stress |
|
Objective |
In this
lab we want to find out if we can tell a cold shock experiment from a salt
shock experiment from a pH shock, etc performed on P. aerophilum.(Pae) If we can,
this is evidence that there is a combination of genes that can be used as a
molecular fingerprint for the Pae
stress response.
One could use a similar approach to predict the
internal molecular state of a cell or an organism. It might be useful, for example, to be able
to monitor whether a virus is sensing stressful conditions, or predict whether
a microorganism is about to sporulate.
Supervised approaches are actively being researched for different
applications in medicine. For example,
it would greatly aid cancer treatment if we could better characterize the
tissue of origin for many types of cancer.
Using microarrays, we might be able to find a combination of gene
expression levels that provide clues about which tissue and how
de-differentiated cells from a tumor have become. Supervised learning provides a formal
framework for developing such diagnostic tools.
|
ANNOUNCEMENTS |
You only have to do either section
II or both sections III and IV to get full credit.
Extra credit will be given if you do the entire lab.
|
Due date |
Answers to the questions (either in hardcopy form or
emailed to Josh) are due in class on Thursday,
March 11. In addition to
answering the questions, please hand in all of the requested plots and the code
you wrote.
As always, you need to turn in your own work, but you
are encouraged to work in groups (for example, you may elect to work with the
same people as those in your project team).
|
I. Setup |
Download the saved R binary file lab7.RData and save it to your R working directory. Once you’ve done that, start R and then issue
the command: load(“lab7.RData”). This should give you the following objects in
your R workspace:
Download and source the file knn-error.R. This script
contains a function called knn.error()that runs k-nearest-neighbors and computes the error between the predicted
classes and the known classes for the test data. It’s arguments are:
|
II. Measure the cluster enrichment to KEGG functional
groups |
We would like to find out
if the clusters we obtained in the last lab correspond at all to known biology. To do that, we can compare each cluster we obtained
from k-means, with curated categories maintained by the KEGG database. The matrix kegg already contains
assignments of the Pae stress genes to their KEGG categories where
possible.
To test whether a
particular cluster derived from k-means is enriched for a KEGG
functional group, we can test whether its overlap with some group is much more
than we’d expect if we were to draw a group of genes of the same size and get
the observed overlap or better by chance.
The hypergeometric distribution gives us a formal way of measuring this
significance.
a. Download the
partially completed function hyper.overlap(). Write the rest of it so that the –log10
of the hypergeometric p-values are returned as a matrix (hint: type ?phyper).
The function takes in a gold standard matrix of 0’s and 1’s where the
gene functional categories are listed across the top and the genes are listed
down the rows. A 1 in cell (i,j) indicates
gene i is in category j.
b. Call your
function with: O <- hyper.overlap(kegg,
kmeans.assign). Make an image map of the
result.
c. Extra credit. Which KEGG category appears to be the most significant? Does this enrichment make sense with the gene
expression profiles of the cluster(s) it overlaps with? Can you argue for or against the presence of significant
enrichment in this clustering solution?
|
III. Measure prediction performance |
a. Training error on fake data. Run knn.error() on the
fake data contained in fake.train. Pass in the fake.train matrix as both the training and the testing set. Get the training errors for k=1..23
neighbors. Make a plot of the training
error as a function of k. Please explain the following: What is the k-NN procedure doing here for k=1?
Why do you get perfect prediction (0 error) for
small values of k? What’s the maximum error you can obtain on
this problem and why? What is the k-NN procedure doing when k=23 (and can you think of a faster way
to achieve the exact same accuracy)?
b. Test error on fake data. Now pass in the fake.test matrix as the test.data
argument to the function and obtain the errors for k=1..23 neighbors. Make a plot of the test error as a function
of k.
What’s the lowest error you can achieve on the test data? What value of k gives you the lowest error?
c. Leave-one-out cross validation on
fake data. Construct a
fake data matrix by combining the rows from the fake.train and fake.test
matrices. You can do this using the rbind() function. For each value of k=1..23, perform leave-one-out
cross-validation and obtain the average leave-one-out error for each value of k.
Make a plot of these average errors as a function of k.
What is the best k to use for
this problem as determined by leave-one-out cross-validation? What value do you think you should use?
|
IV. k-nearest
neighbors on the stress data |
Now its time to predict the stress condition from a
hybridization profile.
a. Training error on the stress data. Since the k-nearest
neighbors function knn.error() assumes the predictors are contained
in the rows of a matrix, we need to transpose the stress.submatrix. To make a
matrix that you can use, do the following in R:
rownames(X) <-
stress.factor
X <- t(stress.submatrix)
This should create a new matrix X that you can now pass to the knn.error() function. Run knn.error() on the stress data. Repeat what you did in part II.a for the stress data, obtaining
resubstitution errors for k=1..30 (since there are 30 hybridizations). Make a plot of the errors you obtain.
b. Leave-one-out cross-validation on
stress data. Perform
leave-one-out cross-validation as you did in part II.c on the stress data.
Make a plot of the leave-one-out errors.
What is the best accuracy you achieve and for what value of k did you achieve it? Based on this finding, do you think you can
you use k-NN on this data to accurately
predict one stress condition from another (why or why not)?
c. Extra credit. Write a function called knn.error.matrix() that returns a matrix of errors such
that cell (i,j) in the matrix counts
the number of times class i in the
test set is classified as class j by k-nearest neighbors. Obtain the error matrix from leave-one-out
cross validation using k=1. Plot the image and describe what you
see.
|
Yeah, you’re done, no
more labs! |
Comments to Josh Stuart