BME 110 Computational Biology Tools
Study
Section Practice Questions
Using NCBI
Resources (BLAST), UCSC
Genome
Browser, and the UCSC Archaeal
Browser,
Instructions: Use
the sequences at the bottom of the page for the Study
Section Practice
Using Various Flavors of BLAST
1. Use the mystery-seq-A when searching the non-redundant
(NR)
database, using no filters, using the programs indicated below:
(a) What species and domain of life is this sequence derived
from? What program is best-suited to this purpose, assuming you
will get a 100% identity match? Why?
(b) What is the most distantly related sequence/species found using a
BlastN search (wordsize = 15, e-value 1e-5)? How many total
hits? (Hint: you may have to change the program selection between
MegaBlast, Discontiguous megablast, or blastn to get this wordsize
option).
(c) What is the most distantly related sequence/species found using a
BlastN (wordsize = 7, e-value 1e-5)? Is this a eukaryote, bacterium, or
archaeon?
2. Using a more sensitive version of Blast than BlastN, tell me if a
homolog of the same
gene as in question 1 probably occurs in the human genome (remember:
you can limit the search to only human sequences!).
Tell me which program you used, any non-default search parameters you
used, and the e-value & % identitiy of the top hit you may find in
the human genome. What is your evidence / criteria for believing
this is a true homolog?
3. Grab
the full-length protein sequence (not just the part with a BLAST hit)
of your best hit to the human genome in question 2,
and use it to find the coordinates of the gene in the March 2006 human
genome assembly in the UCSC Genome Browser (if there is more than one
that is >95% identical, just give me the top hit).
(a) Give the coordinates, chromosome, gene name, and number of exons
according to the RefSeq track.
(b) Are there any regions of the introns that are conserved more than
the exons, revealing possible RNA or missed exons? (use the
conservation track to answer this question). If so, what intron
is/are they in? Turn on the sno/miRNA track to "full" (in the
Genes and Gene Predictions Group of tracks). Do you see any that
overlap with any highly conserved regions (like we saw in class?)
(c) Look at the prediction for the UTRs (untranslated regions) for this
gene. Focussing on the "RefSeq Genes" track, is the UTR larger on
the 5' or 3' end of this gene?
(d) You suspect that known classes of repetitive elements have been
involved in the evolution or regulation of this gene. Turn on the
"Repeat Masker" track (one of the last tracks at the bottom of the
page). What is the most numerous class of repeititve element?
(i.e. SINE, LINE, LTR, DNA, Simple, etc.?) Do any of these
repeitive elements overlap any of the predicted protein-coding exons?
4. You notice that there are not many strong hits to a particular
protein of interest: (mystery-seq-B). Use PSI-BLAST to find
more hits than possible with BlastP alone (all default parameters).
(a) How many have an initial e-value of 1e-5 or smaller?
(b) Use the default inclusion cutoff, and run iteration 2. How
many new proteins could be found with an evalue better than 1e-5 on
this iteration? How many iterations do you have to repeat until
no new sequences can be found?
(c) Click on the "Distance tree of results" at the bottom of the page
to see a phylogenetic representation of all the hits found. Is
this protein unique to this species or just Crenarchaea, just Archaea,
just Archaea & Bacteria, or is it found in all three domains of
life?
Use these sequences for the Study Section:
>mystery-seq-A
GTGTTTAGGACACATCTAGTCTCAGAATTAAATCCTAAATTAGATGGATC
AGAAGTAAAGGTAGCAGGATGGGTTCATAATGTAAGGAATTTAGGTGGAA
AGATATTTATTTTATTAAGAGACAAGAGTGGAATAGGACAAATAGTAGTT
GAAAAAGGTAATAATGCATATGATAAAGTCATAAATATAGGATTGGAATC
GACTATCGTTGTAAATGGTGTAGTTAAAGCTGATGCGAGAGCCCCTAATG
GGGTTGAAGTACACGCAAAAGATATAGAAATACTGTCGTATGCAAGGTCT
CCATTACCGTTAGATGTGACGGGCAAGGTTAAGGCTGATATAGATACTAG
ACTTAGGGAAAGATTACTAGATTTAAGAAGATTGGAGATGCAAGCAGTGT
TAAAAATACAATCGGTAGCTGTGAAATCATTTAGGGAAACATTATATAAA
CATGGATTTGTAGAAGTCTTTACTCCAAAGATAATTGCTAGTGCAACGGA
AGGAGGAGCCCAATTATTTCCAGTATTATACTTTGGAAAAGAGGCATTTT
TAGCTCAGAGTCCGCAATTATACAAGGAATTATTAGCAGGTGCTATAGAA
AGAGTATTTGAAATAGCTCCTGCATGGAGAGCAGAAGAGTCAGACACACC
ATATCATCTCTCAGAGTTCATTAGCATGGACGTAGAAATGGCCTTTGCCG
ATTACAACGATATAATGGCTTTAATAGAACAAATAATTTATAACATGATA
AATGATGTAAAGAGAGAATGTGAAAATGAATTAAAGATATTGAATTATAC
TCCACCTAATGTTAGAATACCTATAAAGAAAGTCTCTTACTCAGATGCAA
TAGAGCTTCTGAAAAGTAAAGGTGTTAATATTAAATTTGGCGATGATATA
GGAACGCCTGAACTGAGGGTATTATATAATGAATTAAAGGAAGATCTTTA
CTTCGTAACTGATTGGCCTTGGCTAAGTAGACCATTTTATACAAAGCAGA
AAAAAGATAATCCGCAGCTAAGCGAGAGCTTTGATTTAATTTTCAGATGG
TTAGAGATTGTTTCTGGAAGTTCAAGAAATCACGTTAAAGAAGTCCTAGA
GAACTCACTTAAAGTAAGAGGACTAAATCCAGAAAGTTTTGAATTCTTCC
TAAAATGGTTTGACTATGGGATGCCACCACACGCCGGTTTTGGAATGGGA
TTAGCAAGAGTAATGTTAATGTTAACTGGTCTTCAGAGCGTGAAGGAAGT
AGTACCATTCCCTAGAGATAAGAAGAGACTAACACCATAG
>mystery-seq-B
MGVEICRSLLECLGALGRSQRLYAAAGLVDEEGLEAASRAAGELRVLVGD
SGPVPRPVYERWREVVRVYPSLHAKFYIFAEDAGPSAALVGSADLTAGGL
RGNLEAVVLIRGEAARPLADMFNRLWARALPLTEDYVADWEGPEEALRKP
WGEAVKRANERLAEILGVSAHCLSRHDPLNCARLVARAVRSRFEGCGDLP
ENCAARATGVSAKALLSAPPSAVLAGHYVCWARALAARLLEGKVGRLDSG
MEAYEAAVQAGAESCWGEAKRAAEEELERLEDSNYRDNYVRWPIPYRLLF
LAMTLPATGCRILGREVRTKKRGVARVERELYC