BME 110 Computational Biology Tools

Homework 2 BLAST Practice Questions

Using NCBI Resources (including BLAST, ORF Finder), UCSC Genome Browser, and the UCSC Archaeal Browser, ExPASy Tools, Primer3 and the Biololgy Workbench at SDSC to analyze these sequences.

Instructions: Use the two groups of sequences at the bottom of the page for the Study Section Practice Questions (seq-A, B, C, etc) and for the homework problems (seq-1, seq-2, seq-3, etc).  You will not get homework credit if you turn in answers for the study section sequences.

Using Various Flavors of BLAST

1.
  Use the mystery-seq-A/seq-1 when searching the non-redundant (NR) database, using no filters, using the programs indicated below:
(a) What species and domain of life is this sequence derived from? 

How many hits, and what is the most distantly related species (eukaryote, bacterium, or archaeon) and max identity of that sequence and found using the following programs, and an Expect Threshold = 1e-5, database NR, and all other parameter defaults?:
(b) Megablast (Word size=28)?
(c)  BlastN  search wordsize = 15)? 
(c) BlastN search (wordsize = 7)?
Please show the hit information for the most distantly related sequence for each of these searches.
Hint:  Rember, you can click on "Edit and resubmit" at the top of your BLAST search results page
to go back to repeat a search on the same sequence with modified parameters.

2. Using a more sensitive version of Blast than BlastN, tell me if a homolog of the same gene as in question 1 probably occurs in the human genome (remember: you can limit the search to only human sequences!). 

(a) Tell me which program you used, what is the best scoring matrix to use, any non-default search parameters you used.  
(b) If you find a hit, what are the e-value & % identitiy of the top hit you may find in the human genome?  What is your evidence / criteria for believing this is a true homolog?  Please show the top hit score info.

3. Grab the full-length protein sequence (not just the part with a BLAST hit) of your best hit to the human genome in question 2, and use it to find the coordinates of the gene in the March 2006 human genome assembly in the UCSC Genome Browser (if there is more than one that is >95% identical, just give me the top hit).
(a) Give the coordinates, chromosome, gene name, and number of exons according to the RefSeq track. 
(b) Look at the prediction for the UTRs (untranslated regions) for this gene.  Focussing on the "RefSeq Genes" track, is the UTR larger on the 5' or 3' end of this gene?
(c) You suspect that known classes of repetitive elements have been involved in the evolution or regulation of this gene.  Turn on the "Repeat Masker" track (one of the last tracks at the bottom of the page).  What is the most numerous class and family of repeititve element? (i.e.Class:  SINE, LINE, LTR, DNA, Simple, etc.?  Family: Click on a few of the repeat elements in the genome browser to find out the family classification).   Do any of these repeitive elements overlap any of the predicted protein-coding exons?

4.  You notice that there are not many strong hits to a particular protein of interest: (mystery-seq-B/seq-2).  Use PSI-BLAST to find more hits than possible with BlastP alone (all default parameters).  Use Expect threshold = 0.1, Word size =2, rest default parameters
for your search, database = NR.
(a) How many have an initial e-value of 1e-5 or smaller in the first (blastp) iteration? 
(b) Use the default inclusion cutoff, and run iteration 2.  How many new proteins could be found with an evalue better than 1e-5 on this iteration?  How many iterations do you have to repeat until no new sequences can be found?


If it does not converge after 6 iterations, stop (answer no more than 6).
(c) Click on the "Distance tree of results" at the bottom of the page to see a phylogenetic representation of all the hits found.  Is this protein unique to this species or just Crenarchaea, just Archaea, just Archaea & Bacteria, or is it found in all three domains of life?

(d) Based on the evalue and percent identity against the top scoring hit of newly-found sequences,

do you think these hit sequences are orthologs (same function), paralogs (related function), or have unrelated

function?


>mystery-seq-1

GTGTTTAGGACACATCTAGTCTCAGAATTAAATCCTAAATTAGATGGATC
AGAAGTAAAGGTAGCAGGATGGGTTCATAATGTAAGGAATTTAGGTGGAA
AGATATTTATTTTATTAAGAGACAAGAGTGGAATAGGACAAATAGTAGTT
GAAAAAGGTAATAATGCATATGATAAAGTCATAAATATAGGATTGGAATC
GACTATCGTTGTAAATGGTGTAGTTAAAGCTGATGCGAGAGCCCCTAATG
GGGTTGAAGTACACGCAAAAGATATAGAAATACTGTCGTATGCAAGGTCT
CCATTACCGTTAGATGTGACGGGCAAGGTTAAGGCTGATATAGATACTAG
ACTTAGGGAAAGATTACTAGATTTAAGAAGATTGGAGATGCAAGCAGTGT
TAAAAATACAATCGGTAGCTGTGAAATCATTTAGGGAAACATTATATAAA
CATGGATTTGTAGAAGTCTTTACTCCAAAGATAATTGCTAGTGCAACGGA
AGGAGGAGCCCAATTATTTCCAGTATTATACTTTGGAAAAGAGGCATTTT
TAGCTCAGAGTCCGCAATTATACAAGGAATTATTAGCAGGTGCTATAGAA
AGAGTATTTGAAATAGCTCCTGCATGGAGAGCAGAAGAGTCAGACACACC
ATATCATCTCTCAGAGTTCATTAGCATGGACGTAGAAATGGCCTTTGCCG
ATTACAACGATATAATGGCTTTAATAGAACAAATAATTTATAACATGATA
AATGATGTAAAGAGAGAATGTGAAAATGAATTAAAGATATTGAATTATAC
TCCACCTAATGTTAGAATACCTATAAAGAAAGTCTCTTACTCAGATGCAA
TAGAGCTTCTGAAAAGTAAAGGTGTTAATATTAAATTTGGCGATGATATA
GGAACGCCTGAACTGAGGGTATTATATAATGAATTAAAGGAAGATCTTTA
CTTCGTAACTGATTGGCCTTGGCTAAGTAGACCATTTTATACAAAGCAGA
AAAAAGATAATCCGCAGCTAAGCGAGAGCTTTGATTTAATTTTCAGATGG
TTAGAGATTGTTTCTGGAAGTTCAAGAAATCACGTTAAAGAAGTCCTAGA
GAACTCACTTAAAGTAAGAGGACTAAATCCAGAAAGTTTTGAATTCTTCC
TAAAATGGTTTGACTATGGGATGCCACCACACGCCGGTTTTGGAATGGGA
TTAGCAAGAGTAATGTTAATGTTAACTGGTCTTCAGAGCGTGAAGGAAGT
AGTACCATTCCCTAGAGATAAGAAGAGACTAACACCATAG

>mystery-seq-2
MGVEICRSLLECLGALGRSQRLYAAAGLVDEEGLEAASRAAGELRVLVGD
SGPVPRPVYERWREVVRVYPSLHAKFYIFAEDAGPSAALVGSADLTAGGL
RGNLEAVVLIRGEAARPLADMFNRLWARALPLTEDYVADWEGPEEALRKP
WGEAVKRANERLAEILGVSAHCLSRHDPLNCARLVARAVRSRFEGCGDLP
ENCAARATGVSAKALLSAPPSAVLAGHYVCWARALAARLLEGKVGRLDSG
MEAYEAAVQAGAESCWGEAKRAAEEELERLEDSNYRDNYVRWPIPYRLLF
LAMTLPATGCRILGREVRTKKRGVARVERELYC


 

Problem Set 2 BLAST practice questions - Answer key

60 points total

1.  10pts total

a) Sulfolobus solfataricus P2, Domain of life: Archaea  [2pts]

Do a blastN search with any parameters against the nucleotide NR, the top hit is a 100% match, Evalue=0, score=2379, so this is it.

For 100% matches of long (>100nt) sequences


[2pts]  b) Megablast search: Only the identity hit to Sulfolobus solfataricus (above) is found with Megablast W=28.
[2pts]  c) BlastN W=15 Search:  13 total hits, the most distant hit is to Pyrococcus abyssi, max identity 73%.
>emb|AJ248286.2|CNSPAX04 Download subject sequence spanning the                                    HSP Pyrococcus abyssi complete genome; segment 4/6
Length=287130

Features in this part of subject sequence:
aspS aspartyl-tRNA synthetase

Score = 64.4 bits (70), Expect = 1e-06
Identities = 77/105 (73%), Gaps = 0/105 (0%)
Strand=Plus/Plus

Query 472 ACTCCAAAGATAATTGCTAGTGCAACGGAAGGAGGAGCCCAATTATTTCCAGTATTATAC 531
|| ||||||||||| || | ||||||||||||||| || | | |||||| | |||
Sbjct 16861 ACGCCAAAGATAATAGCGACGGCAACGGAAGGAGGAACCGAGCTGTTTCCATTGAAGTAC 16920

Query 532 TTTGGAAAAGAGGCATTTTTAGCTCAGAGTCCGCAATTATACAAG 576
|||| || || || || |||||||| || || || ||||||
Sbjct 16921 TTTGAGAACGATGCCTTCCTAGCTCAGTCACCACAGTTGTACAAG 16965

This sequence is from the species is from the domain Archaea, however, if you click on "Distance tree of results",
you find the most distantly related species is
Clostridium botulinum B str. Eklund 17B, which is a Firmicute in the domain Bacteria.
[2pts] d) BlastN W=7 Search: 26 total hits, the most distant hit is to Methanosphaera stadtmanae, max seq identity 70%
>gb|CP000102.1| Download subject sequence spanning the                                    HSP Methanosphaera stadtmanae DSM 3091, complete genome
Length=1767403

Features in this part of subject sequence:
AspS

Score = 64.4 bits (70), Expect = 1e-06
Identities = 105/149 (70%), Gaps = 3/149 (2%)
Strand=Plus/Plus

Query 493 GCAACGGAAGGAGGAGCCCAATTATTTCCAGTATTATACTTTGGAAAAGAGGCATTTTTA 552
||||| ||||| ||| | ||||||| ||| || ||||||| |||||| |||||| |
Sbjct 155328 GCAACTGAAGGTGGAACAGAATTATTCCCAATAACCTACTTTGAAAAAGAAGCATTTCTT 155387

Query 553 GCTCAGAGTCCGCAATTATA-CAAGGAATTAT--TAGCAGGTGCTATAGAAAGAGTATTT 609
| || ||||| ||| |||| || ||| || | ||| | | | || | ||||||
Sbjct 155388 GGACAAAGTCCTCAACTATATAAACAAATGATGATGGCAACAGGTCTTGACAATGTATTT 155447

Query 610 GAAATAGCTCCTGCATGGAGAGCAGAAGA 638
||||||| | || |||||||||||
Sbjct 155448 GAAATAGGACAAATATTCAGAGCAGAAGA 155476

This species is also in the domain Archaea, although if you click on "Distance tree of results",
you find the most distantly related species is still
Clostridium botulinum B str. Eklund 17B, which is a Firmicute in the domain Bacteria.

2. 5 pts total
(a)
[1pt]  Use BlastX to get the most sensitive search across domains of life. 

(b) [1 pt] Using these parameters, the best hit to the human genome is significant, with an e-val of 3e-62, % identity=36%.
[2 pt] We belive this is a true homolog because the Evalue is much better than 1e-5, and the percent identity is over 25-30%.
>gb|AAX07827.1| Gene info cell proliferation-inducing protein 40 [Homo sapiens]
Length=501

GENE ID: 1615 DARS | aspartyl-tRNA synthetase [Homo sapiens]
(Over 10 PubMed links)

Score = 238 bits (797), Expect = 3e-62
Identities = 166/454 (36%), Positives = 251/454 (55%), Gaps = 31/454 (6%)
Frame = +1

Query 19 VSELNPKLDGSEVKVAGWVHNVRNLGGKIFILLRDKSGIGQIVVEKGNNAYDKVI----N 186
V +L + V V VH R G + F++LR + Q +V G++A +++ N
Sbjct 15 VRDLTIQKADEVVWVRARVHTSRAKGKQCFLVLRQQQFNVQALVAVGDHASKQMVKFAAN 74

Query 187 IGLESTIVVNGVV-----KADARAPNGVEVHAKDIEILSYARSPLPLDVT---------- 321
I ES + V GVV K + VE+H + I ++S A LPL +
Sbjct 75 INKESIVDVEGVVRKVNQKIGSCTQQDVELHVQKIYVISLAEPRLPLQLDDAVRPEAEGE 134

Query 322 --GKVKADIDTrlrerlldlrrleMQAVLKIQSVAVKSFRETLYKHGFVEVFTPKIIASA 495
G+ + DTRL R++DLR QAV ++QS FRETL GFVE+ TPKII++A
Sbjct 135 EEGRATVNQDTRLDNRVIDLRTSTSQAVFRLQSGICHLFRETLINKGFVEIQTPKIISAA 194

Query 496 TEGGAQLFPVLYFGKEAFLAQSPQLYKELLAGA-IERVFEIAPAWRAEESDTPYHLSEFI 672
+EGGA +F V YF A+LAQSPQLYK++ A E+VF I P +RAE+S+T HL+EF+
Sbjct 195 SEGGANVFTVSYFKNNAYLAQSPQLYKQMCICADFEKVFSIGPVFRAEDSNTHRHLTEFV 254

Query 673 SMDVEMAFA-DYNDIMALIEQIIYNMINDVKRECENELKILNYTPP----NVRIPIKKVS 837
+D+EMAF Y+++M I + + ++ + E++ +N P P ++
Sbjct 255 GLDIEMAFNYHYHEVMEEIADTMVQIFKGLQERFQTEIQTVNKQFPCEPFKFLEPTLRLE 314

Query 838 YSDAIELLKSKGVNIKFGDDIGTPELRVLYNELKE----DLYFVTDWPWLSRPFYTKQKK 1005
Y +A+ +L+ GV + DD+ TP ++L + +KE D Y + +P RPFYT
Sbjct 315 YCEALAMLREAGVEMGDEDDLSTPNEKLLGHLVKEKYDTDFYILDKYPLAVRPFYTMPDP 374

Query 1006 DNPQLSESFDLIFRWLEIVSGSSRNHVKEVLENSLKVRGLNPESFEFFLKWFDYGMPPHA 1185
NP+ S S+D+ R EI+SG+ R H ++L G++ E + ++ F +G PPHA
Sbjct 375 RNPKQSNSYDMFMRGEEILSGAQRIHDPQLLTERALHHGIDLEKIKAYIDSFRFGAPPHA 434

Query 1186 GFGMGLARVMLMLTGLQSVKEVVPFPRDKKRLTP 1287
G G+GL RV ++ GL +V++ FPRD KRLTP
Sbjct 435 GGGIGLERVTMLFLGLHNVRQTSMFPRDPKRLTP 468


3.  8 points total
Grabbed this full-length protein from NCBI by clicking on the accession number link, and saved it as a FASTA file:
>gi|59803475|gb|AAX07827.1| cell proliferation-inducing protein 40 [Homo sapiens]
MPSASASRKSQEKPREIMDAAEDYAKERYGISSMIQSQEKPDRVLVRVRDLTIQKADEVVWVRARVHTSR
AKGKQCFLVLRQQQFNVQALVAVGDHASKQMVKFAANINKESIVDVEGVVRKVNQKIGSCTQQDVELHVQ
KIYVISLAEPRLPLQLDDAVRPEAEGEEEGRATVNQDTRLDNRVIDLRTSTSQAVFRLQSGICHLFRETL
INKGFVEIQTPKIISAASEGGANVFTVSYFKNNAYLAQSPQLYKQMCICADFEKVFSIGPVFRAEDSNTH
RHLTEFVGLDIEMAFNYHYHEVMEEIADTMVHIFKGLQERFQTEIQTVNKQFPCEPFKFLEPTLRLEYCE
ALAMLREAGVEMGDEDDLSTPNEKLLGHLVKEKYDTDFYILDKYPLAVRPFYTMPDPRNPKQSNSYDMFM
RGEEILSGAQRIHDPQLLTERALHHGIDLEKIKAYIDSFRFGAPPHAGGGIGLERVTMLFLGLHNVRQTS
MFPRDPKRLTP

to BLAT against human genome:

[3pts] (a) Coordinates: 136381359-136459508, chromosome 2, gene name: DARS (or Aspartyl-tRNA synthetase),
number of exons: 16 (lose one point for each item you get wrong)
 
[2pts](b) The 3' (left) end of the gene has a longer UTR, according to the RefSeq Genes track
[2pts](c) The most numerous class and
family
of repeititve element for this gene is: SINE, Alu
There are also several long LINE elements in this gene, but they are not the most numerous repeats.

Family: Click on a few of the repeat elements in the genome browser to
find out the family classification).  

Do any of these repeitive elements overlap any of the predicted protein-coding exons?
[1 pt] No, none of the repeats directly overlap any of the protein-coding exons in this gene.

4. 7 points total
(a) How many have an initial e-value of 1e-5 or smaller in the first
(blastp) iteration? 
[1 pt] Only one

(b) Use the default inclusion cutoff, and run iteration 2.  How
many new proteins could be found with an evalue better than 1e-5 on this
iteration? 
[1 pt] 10 sequences identified in the last round now have an evalue less than 1e-5
More than 50 other completely new sequences were found in this iteration!

How many iterations do you have to repeat until no new sequences can be found?
[1 pt] More than six iterations are required for PSI-Blast to converge on this very large
extended protein family.

(c) Click on the "Distance tree of results" at the bottom of the page to see a phylogenetic representation of all the hits found.  Is this protein unique to this species or just Crenarchaea, just Archaea, just Archaea & Bacteria, or is it found in all three domains of life?
[2 pts] This protein is found in Archaea (Crenarchaea) and Bacteria, but not in Eukarya

(d) Based on the evalue and percent identity against the top scoring hit of newly-found sequences,
do you think these hit sequences are orthologs (same function), paralogs (related function), or have unrelated function?

[2 pts] Based on the significant E-value (1e-77) you might expect the top hits are orthologs --
However, if you look at the percent identity for these top hits, 12-13%, this is extremely low,
so you might say this protein is a paralog or a very distant homolog with unrelated function.  Either of these answers
is acceptable if justified properly