Homework 2 BLAST Practice
Questions
Using NCBI Resources (including BLAST, ORF Finder), UCSC Genome Browser, and the UCSC Archaeal Browser, ExPASy Tools, Primer3 and the Biololgy Workbench at SDSC to analyze these sequences.
Instructions: Use the two groups of sequences at the
bottom of the
page for the Study Section Practice Questions (seq-A, B, C, etc) and
for the
homework problems (seq-1, seq-2, seq-3, etc). You will not
get homework credit if you turn in answers for the study section
sequences.
Using Various Flavors of BLAST
1. Use the mystery-seq-A/seq-1 when searching the
non-redundant
(NR) database, using no filters, using the programs indicated below:
(a) What species and domain of life is this sequence derived
from?
How many hits, and what is the most
distantly related
species
(eukaryote, bacterium, or archaeon) and max identity of
that
sequence and found using the following programs, and an Expect
Threshold =
1e-5, database NR, and all other parameter defaults?:
(b) Megablast (Word size=28)?
(c) BlastN search wordsize =
15)?
(c) BlastN search (wordsize = 7)?
Please show the hit information for the most distantly related sequence
for
each of these searches.
Hint: Rember, you can click on "Edit and resubmit" at the top
of your BLAST search results page
to go back to repeat a search on the same sequence with modified
parameters.
2. Using a more sensitive version of Blast
than
BlastN, tell me if a homolog of the same gene as in question 1 probably
occurs
in the human genome (remember: you can limit the search to only human
sequences!).
(a) Tell me which program you used, what is the best scoring matrix to
use, any
non-default search parameters you used.
(b) If you find a hit, what are the e-value & % identitiy of the
top hit
you may find in the human genome? What is
your
evidence / criteria for believing this is a true homolog? Please
show the
top hit score info.
3. Grab the full-length protein sequence (not just the part with a
BLAST hit)
of your best hit to the human genome in question 2, and use it to find
the
coordinates of the gene in the March 2006 human genome assembly in the
UCSC
Genome Browser (if there is more than one that is >95% identical,
just give
me the top hit).
(a) Give the coordinates, chromosome, gene name, and number of exons
according
to the RefSeq track.
(b) Look at the prediction for the UTRs (untranslated regions) for this
gene. Focussing on the "RefSeq Genes" track,
is the UTR larger on the 5' or 3' end of this gene?
(c) You suspect that known classes of repetitive elements have been
involved in
the evolution or regulation of this gene. Turn on the "Repeat
Masker" track (one of the last tracks at the bottom of the page).
What is the most numerous class and
family of
repeititve element? (i.e.Class: SINE, LINE, LTR, DNA, Simple,
etc.?
Family: Click on a few of the repeat elements in the genome browser to
find out
the family classification). Do any of these repeitive elements
overlap
any of the predicted protein-coding exons?
4. You notice that there are not many strong hits to a particular
protein
of interest: (mystery-seq-B/seq-2). Use PSI-BLAST to find more
hits than
possible with BlastP alone (all default parameters). Use Expect
threshold
= 0.1, Word size =2, rest default parameters
for your search, database = NR.
(a) How many have an initial e-value of 1e-5 or smaller in the first
(blastp)
iteration?
(b) Use the default inclusion cutoff, and run iteration 2. How
many new
proteins could be found with an evalue better than 1e-5 on this
iteration? How many iterations do you
have to
repeat until no new sequences can be found?
If it does not converge after 6
iterations,
stop (answer no more than 6).
(c) Click on the "Distance tree of results" at the bottom of the page
to see a phylogenetic representation of all the hits found. Is
this
protein unique to this species or just Crenarchaea, just Archaea, just
Archaea
& Bacteria, or is it found in all three domains of life?
(d) Based on the evalue and percent identity against the top scoring hit of newly-found sequences,
do you think these hit sequences are orthologs (same function), paralogs (related function), or have unrelated
function?
>mystery-seq-1
GTGTTTAGGACACATCTAGTCTCAGAATTAAATCCTAAATTAGATGGATC
AGAAGTAAAGGTAGCAGGATGGGTTCATAATGTAAGGAATTTAGGTGGAA
AGATATTTATTTTATTAAGAGACAAGAGTGGAATAGGACAAATAGTAGTT
GAAAAAGGTAATAATGCATATGATAAAGTCATAAATATAGGATTGGAATC
GACTATCGTTGTAAATGGTGTAGTTAAAGCTGATGCGAGAGCCCCTAATG
GGGTTGAAGTACACGCAAAAGATATAGAAATACTGTCGTATGCAAGGTCT
CCATTACCGTTAGATGTGACGGGCAAGGTTAAGGCTGATATAGATACTAG
ACTTAGGGAAAGATTACTAGATTTAAGAAGATTGGAGATGCAAGCAGTGT
TAAAAATACAATCGGTAGCTGTGAAATCATTTAGGGAAACATTATATAAA
CATGGATTTGTAGAAGTCTTTACTCCAAAGATAATTGCTAGTGCAACGGA
AGGAGGAGCCCAATTATTTCCAGTATTATACTTTGGAAAAGAGGCATTTT
TAGCTCAGAGTCCGCAATTATACAAGGAATTATTAGCAGGTGCTATAGAA
AGAGTATTTGAAATAGCTCCTGCATGGAGAGCAGAAGAGTCAGACACACC
ATATCATCTCTCAGAGTTCATTAGCATGGACGTAGAAATGGCCTTTGCCG
ATTACAACGATATAATGGCTTTAATAGAACAAATAATTTATAACATGATA
AATGATGTAAAGAGAGAATGTGAAAATGAATTAAAGATATTGAATTATAC
TCCACCTAATGTTAGAATACCTATAAAGAAAGTCTCTTACTCAGATGCAA
TAGAGCTTCTGAAAAGTAAAGGTGTTAATATTAAATTTGGCGATGATATA
GGAACGCCTGAACTGAGGGTATTATATAATGAATTAAAGGAAGATCTTTA
CTTCGTAACTGATTGGCCTTGGCTAAGTAGACCATTTTATACAAAGCAGA
AAAAAGATAATCCGCAGCTAAGCGAGAGCTTTGATTTAATTTTCAGATGG
TTAGAGATTGTTTCTGGAAGTTCAAGAAATCACGTTAAAGAAGTCCTAGA
GAACTCACTTAAAGTAAGAGGACTAAATCCAGAAAGTTTTGAATTCTTCC
TAAAATGGTTTGACTATGGGATGCCACCACACGCCGGTTTTGGAATGGGA
TTAGCAAGAGTAATGTTAATGTTAACTGGTCTTCAGAGCGTGAAGGAAGT
AGTACCATTCCCTAGAGATAAGAAGAGACTAACACCATAG
>mystery-seq-2
MGVEICRSLLECLGALGRSQRLYAAAGLVDEEGLEAASRAAGELRVLVGD
SGPVPRPVYERWREVVRVYPSLHAKFYIFAEDAGPSAALVGSADLTAGGL
RGNLEAVVLIRGEAARPLADMFNRLWARALPLTEDYVADWEGPEEALRKP
WGEAVKRANERLAEILGVSAHCLSRHDPLNCARLVARAVRSRFEGCGDLP
ENCAARATGVSAKALLSAPPSAVLAGHYVCWARALAARLLEGKVGRLDSG
MEAYEAAVQAGAESCWGEAKRAAEEELERLEDSNYRDNYVRWPIPYRLLF
LAMTLPATGCRILGREVRTKKRGVARVERELYC
>emb|AJ248286.2|CNSPAX04[2pts] d) BlastN W=7 Search: 26 total hits, the most distant hit is to Methanosphaera stadtmanae, max seq identity 70%Pyrococcus abyssi complete genome; segment 4/6
Length=287130
Features in this part of subject sequence:
aspS aspartyl-tRNA synthetase
Score = 64.4 bits (70), Expect = 1e-06
Identities = 77/105 (73%), Gaps = 0/105 (0%)
Strand=Plus/Plus
Query 472 ACTCCAAAGATAATTGCTAGTGCAACGGAAGGAGGAGCCCAATTATTTCCAGTATTATAC 531
|| ||||||||||| || | ||||||||||||||| || | | |||||| | |||
Sbjct 16861 ACGCCAAAGATAATAGCGACGGCAACGGAAGGAGGAACCGAGCTGTTTCCATTGAAGTAC 16920
Query 532 TTTGGAAAAGAGGCATTTTTAGCTCAGAGTCCGCAATTATACAAG 576
|||| || || || || |||||||| || || || ||||||
Sbjct 16921 TTTGAGAACGATGCCTTCCTAGCTCAGTCACCACAGTTGTACAAG 16965
This sequence is from the species is from the domain Archaea, however, if you click on "Distance tree of results",
you find the most distantly related species is
Clostridium botulinum B str. Eklund 17B, which is a Firmicute in the domain Bacteria.
>gb|CP000102.1|2. 5 pts totalMethanosphaera stadtmanae DSM 3091, complete genome
Length=1767403
Features in this part of subject sequence:
AspS
Score = 64.4 bits (70), Expect = 1e-06
Identities = 105/149 (70%), Gaps = 3/149 (2%)
Strand=Plus/Plus
Query 493 GCAACGGAAGGAGGAGCCCAATTATTTCCAGTATTATACTTTGGAAAAGAGGCATTTTTA 552
||||| ||||| ||| | ||||||| ||| || ||||||| |||||| |||||| |
Sbjct 155328 GCAACTGAAGGTGGAACAGAATTATTCCCAATAACCTACTTTGAAAAAGAAGCATTTCTT 155387
Query 553 GCTCAGAGTCCGCAATTATA-CAAGGAATTAT--TAGCAGGTGCTATAGAAAGAGTATTT 609
| || ||||| ||| |||| || ||| || | ||| | | | || | ||||||
Sbjct 155388 GGACAAAGTCCTCAACTATATAAACAAATGATGATGGCAACAGGTCTTGACAATGTATTT 155447
Query 610 GAAATAGCTCCTGCATGGAGAGCAGAAGA 638
||||||| | || |||||||||||
Sbjct 155448 GAAATAGGACAAATATTCAGAGCAGAAGA 155476
This species is also in the domain Archaea, although if you click on "Distance tree of results",
you find the most distantly related species is still
Clostridium botulinum B str. Eklund 17B, which is a Firmicute in the domain Bacteria.
>gb|AAX07827.1|cell proliferation-inducing protein 40 [Homo sapiens]
Length=501
GENE ID: 1615 DARS | aspartyl-tRNA synthetase [Homo sapiens]
(Over 10 PubMed links)
Score = 238 bits (797), Expect = 3e-62
Identities = 166/454 (36%), Positives = 251/454 (55%), Gaps = 31/454 (6%)
Frame = +1
Query 19 VSELNPKLDGSEVKVAGWVHNVRNLGGKIFILLRDKSGIGQIVVEKGNNAYDKVI----N 186
V +L + V V VH R G + F++LR + Q +V G++A +++ N
Sbjct 15 VRDLTIQKADEVVWVRARVHTSRAKGKQCFLVLRQQQFNVQALVAVGDHASKQMVKFAAN 74
Query 187 IGLESTIVVNGVV-----KADARAPNGVEVHAKDIEILSYARSPLPLDVT---------- 321
I ES + V GVV K + VE+H + I ++S A LPL +
Sbjct 75 INKESIVDVEGVVRKVNQKIGSCTQQDVELHVQKIYVISLAEPRLPLQLDDAVRPEAEGE 134
Query 322 --GKVKADIDTrlrerlldlrrleMQAVLKIQSVAVKSFRETLYKHGFVEVFTPKIIASA 495
G+ + DTRL R++DLR QAV ++QS FRETL GFVE+ TPKII++A
Sbjct 135 EEGRATVNQDTRLDNRVIDLRTSTSQAVFRLQSGICHLFRETLINKGFVEIQTPKIISAA 194
Query 496 TEGGAQLFPVLYFGKEAFLAQSPQLYKELLAGA-IERVFEIAPAWRAEESDTPYHLSEFI 672
+EGGA +F V YF A+LAQSPQLYK++ A E+VF I P +RAE+S+T HL+EF+
Sbjct 195 SEGGANVFTVSYFKNNAYLAQSPQLYKQMCICADFEKVFSIGPVFRAEDSNTHRHLTEFV 254
Query 673 SMDVEMAFA-DYNDIMALIEQIIYNMINDVKRECENELKILNYTPP----NVRIPIKKVS 837
+D+EMAF Y+++M I + + ++ + E++ +N P P ++
Sbjct 255 GLDIEMAFNYHYHEVMEEIADTMVQIFKGLQERFQTEIQTVNKQFPCEPFKFLEPTLRLE 314
Query 838 YSDAIELLKSKGVNIKFGDDIGTPELRVLYNELKE----DLYFVTDWPWLSRPFYTKQKK 1005
Y +A+ +L+ GV + DD+ TP ++L + +KE D Y + +P RPFYT
Sbjct 315 YCEALAMLREAGVEMGDEDDLSTPNEKLLGHLVKEKYDTDFYILDKYPLAVRPFYTMPDP 374
Query 1006 DNPQLSESFDLIFRWLEIVSGSSRNHVKEVLENSLKVRGLNPESFEFFLKWFDYGMPPHA 1185
NP+ S S+D+ R EI+SG+ R H ++L G++ E + ++ F +G PPHA
Sbjct 375 RNPKQSNSYDMFMRGEEILSGAQRIHDPQLLTERALHHGIDLEKIKAYIDSFRFGAPPHA 434
Query 1186 GFGMGLARVMLMLTGLQSVKEVVPFPRDKKRLTP 1287
G G+GL RV ++ GL +V++ FPRD KRLTP
Sbjct 435 GGGIGLERVTMLFLGLHNVRQTSMFPRDPKRLTP 468
>gi|59803475|gb|AAX07827.1| cell proliferation-inducing protein 40 [Homo sapiens]
MPSASASRKSQEKPREIMDAAEDYAKERYGISSMIQSQEKPDRVLVRVRDLTIQKADEVVWVRARVHTSR
AKGKQCFLVLRQQQFNVQALVAVGDHASKQMVKFAANINKESIVDVEGVVRKVNQKIGSCTQQDVELHVQ
KIYVISLAEPRLPLQLDDAVRPEAEGEEEGRATVNQDTRLDNRVIDLRTSTSQAVFRLQSGICHLFRETL
INKGFVEIQTPKIISAASEGGANVFTVSYFKNNAYLAQSPQLYKQMCICADFEKVFSIGPVFRAEDSNTH
RHLTEFVGLDIEMAFNYHYHEVMEEIADTMVHIFKGLQERFQTEIQTVNKQFPCEPFKFLEPTLRLEYCE
ALAMLREAGVEMGDEDDLSTPNEKLLGHLVKEKYDTDFYILDKYPLAVRPFYTMPDPRNPKQSNSYDMFM
RGEEILSGAQRIHDPQLLTERALHHGIDLEKIKAYIDSFRFGAPPHAGGGIGLERVTMLFLGLHNVRQTS
MFPRDPKRLTP
to BLAT against human genome:
[3pts] (a) Coordinates: 136381359-136459508, chromosome 2, gene name: DARS (or Aspartyl-tRNA synthetase),
number of exons: 16 (lose one point for each item you get wrong)
[2pts](b) The 3' (left) end of the gene has a longer UTR, according to the RefSeq Genes track
[2pts](c) The most numerous class and
family of repeititve element for this gene is: SINE, Alu
There are also several long LINE elements in this gene, but they are not the most numerous repeats.
Family: Click on a few of the repeat elements in the genome browser to
find out the family classification).
Do any of these repeitive elements overlap any of the predicted protein-coding exons?
[1 pt] No, none of the repeats directly overlap any of the protein-coding exons in this gene.
4. 7 points total
(a) How many have an initial e-value of 1e-5 or smaller in the first
(blastp) iteration?
[1 pt] Only one
(b) Use the default inclusion cutoff, and run iteration 2. How
many new proteins could be found with an evalue better than 1e-5 on this
iteration?
[1 pt] 10 sequences identified in the last round now have an evalue less than 1e-5
More than 50 other completely new sequences were found in this iteration!
How many iterations do you have to repeat until no new sequences can be found?
[1 pt] More than six iterations are required for PSI-Blast to converge on this very large
extended protein family.
(c) Click on the "Distance tree of results" at the
bottom of the page
to see a phylogenetic representation of all the hits found. Is
this
protein unique to this species or just Crenarchaea, just Archaea, just
Archaea
& Bacteria, or is it found in all three domains of life?
[2 pts] This protein is found in Archaea (Crenarchaea) and Bacteria,
but not in Eukarya
(d) Based on the evalue and
percent identity
against the top scoring hit of newly-found sequences,
do you think these hit
sequences
are orthologs (same function), paralogs (related function), or have
unrelated function?
[2 pts] Based on the significant E-value (1e-77)
you might expect the top hits are orthologs --
However, if you look at the percent identity for these top hits,
12-13%, this is extremely low,
so you might say this protein is a paralog or a very distant homolog
with unrelated function. Either of these answers
is acceptable if justified properly