BME 110 Computational Biology Tools

Homework 3 / Study Section Practice Questions

Using resources discussed in class like Rfam, ClustalW, Jalview, mfold, and tRNAscan-SE, snoscan to analyze these sequences.
Use the two groups of sequences at the bottom of the page for the Study Section Practice Questions (SeqA, SeqB, SeqC, etc) and for the homework problems (Seq1, Seq2, Seq3, etc).  You will not get homework credit if you turn in answers for the study section sequences.

Using ClustallW to Align Proteins & Jalview

1.  Use BlastP with SeqA/Seq1 to collect sequences for an alignment.  Search with an Expect threshold = 1, Word size = 2, BLOSUM62 scoring matrix, searching the NR database with the rest defaults.  Choose the top 12 hits to sequences (including the identity self-hit) by checking off boxes, and then click on "Get selected Sequences" just below the summary list of hits, then save them to a FASTA file.  Copy them into a text editor, and change the names to something helpful - name them by the first letter of the genus, and the first four letters of genus and any other identifying gene name information (i.e "Pcali-Cas1" or Paero-PAE0200 ).

Go to the ClustalW Web site, and enter your 12 sequences.  Use the defaults except make BLOSUM your scoring matrix (Interactive, Alignment=full, score-type=percent, matrix=blosum). Leave the phylogentic tree options off. Submit. 

(a) From the score table, which two sequences were most similar among the 12 (you can sort by score/identity; give the species name/abbreviation)?
Which were most distant?
(b) Save your alignment file (*.aln) so you can load it in the stand-alone version of Jalview.  Give the "Alignment score" (at the top of the ClustalW page) and paste in your Guide Tree file (*.dnd) so we know how your tree construction looks).  Remember, you should have 12 sequences.

Using Phylogenetic Trees


2. Load your alignment file (*.aln) from problem 1 into Jalview.   Color the alignment by ClustalX colors.  The
sequences are already aligned, so there is no need to recalculate the alignment.  Normally, at this point we would remove all the columns that have gaps in the majority of sequences, but to make sure everyone gets the same final answer, we will skip this step.
(a) There are two ways of creating trees for proteins in Jalview:  Under Calculate->Calculate Tree, you can use either Average Distance using BLOSUM62 or Neighbor Joining Using BLOSUM62.  Run both tree building programs.  Did they give you the same tree?  If not, which species moved relative to the others?  Use "File->Save As->PNG" in the tree window to save each of your two trees, and paste them into your document. 
(b) Now, you'd like compare this single protein family tree against the "species tree" which is based on ribosomal RNA.  You have a ribosomal RNA-based species tree for the archaea at archaea.ucs.edu and hereRelative to your original sequence, label the other 11 members of the tree as either "Ortholog", "Paralog", or "Xenolog".

Practice with RNA Analysis


3.  For SeqB/Seq2, use the tools discussed in class (Rfam, tRNAscan-SE, mfold, snoscan, BlastX) to:

(a) identify what type of RNA gene it is (show the output of the program that identified it),

(b) if this RNA interacts with other RNAs in the cell (either base pairing or catalytically), give the name of those RNA(s).  If it is a tRNA, give the type of amino acid it carries and its anticodon (i.e. Proline UGG).

(c) run "mfold" on the sequence, turn on "p-num" for "Choose structure annoation" option, for "Enter the percent suboptimality number" enter 10.  Give the free energy score and paste the most stable structure into your homework.  Based on the p-num color scheme and the coloring of the majority of base pairings, do you think this is a highly probable secondary structure (if highly probable means >98% probability for more than half the base pairs)?

(d) Use the "Compare Structures" function to compare the diferences between the optimal structure and the second most optimal structure (if it exisits).  How many base pairings are not shared (unique) between structures?  Does this out-number the number of base pairings that are shared? 
If this is a tRNA, does it match the structure predicted by tRNAscan-SE?

(e) Use the Biology Workbench to randomly shuffle the sequence once (RANDSEQ; use "Uniform" for the Randomizatin Method), and then run "mfold" on the shuffled sequence, and give the top 3 free energy scores [kcal/mol].  By how much is it more stable (more negative) or less stable (less negative) are these shuffled sequence scores relative to the optimal original sequence free energy score?  Give the shuffled sequence and the optimal shuffled sequence structure in your write-up. 

4. Answer the same questions as in problem #3, but for SeqC/Seq3.

5. Answer the same questions as in problem #3, but for SeqD/Seq4.


Study Section Practicse Question Sequences:
>SeqA (study section)
MQIAVASYGTRIRTRKGLLVVERGGERREYPLHQVDEVFILTGGVSITSRALRALLRAGAVVAVFDQRGE
PLGIFMRPVGDATGEKRRCQYAAAAGGRGLQWAREWVWKKMRGQLQNVKAWRRRLAHYGDYVEQIGRALE
ALRAAASPGEVMEAEAAAAEAYWRAYGEVTGFPGRDQEGGDPVNAALNYGYGVLKALCFKSILLAGLDPY
VGFLHVDKSGRPSLVLDFMEQWRPRVDAVVAKVAGELATENGLLDHKSRLRVAAAVLEELGAGARPVSAE
IHREARALARAICT
>SeqB (study section)
GCCAGGTTGGCCGAGCGGTCTAAGGCGCCAGATTTAAGCTCTGGTTCTCG
AGAGAGAGCGTGGGTTCGAACCCCACACCTGGCATCCGGACCAGACCGAG
>SeqC (study section)
CCTGGCGGCCTTAGCGCGGTGGTCCCACCTGACCCCATGCCGAACTCAGA
AGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAG
TAGGGAACTGCCAGGC
>SeqD (study section)
CACACTCTCTACGTCTCGTTTTAAACCGTCGTTGCCACGCAGAGGCATGT
ATCTGCGCTCCGAGGAGTTGGCCAAATTTTCCGTGAAGAGAAAGAAGCTA
GCTACAAAACGTGCGTCCACGCGCTCTAGGGGGAAACTCATTGGGGGAGG
GGATGATGAACGACTTGGGAGGGGTCCCTGCCCGAAACCAGGCACGCTGA
CCCACCCCCTCCCCATGTGAACTACTTATATGTTTAGAAAAACTTTTCTC
CGCCGGGCCGCAATTTCACAAACATTGGGACAGGTATTTCCCACCACAAT
CGCTTTAAGCAGAGCCGTGCGCTAATTCCAGCCTTAGCGCGCCTCCTTTA
AC

Homework Problem Sequences:

>Seq1
MNWFLKSFALLHDPPHKALWFEDADYRVYDKNSHEEEAAKLFDEIFKGMG
FGGVPGSEESAAVAEADRLAASFDRWALPPSPGNYWVKAKYMINPFGRER
REIKKPTPQEFRERLGKFVGEVRRVLEAAKGDERWLYFAFYAAYELAWIK
AGLPTLPADTRVPTHSIFDHLYATASMLNWTRGGGCLMVVDLPGIQSILK
SARKAGDYRAGSLLVSLVAWRVAWRFMERHGPDILLSPTPRFNPLFYAQL
DRRVKGVWGLYAGAMSYYVGKKFGVSQQLPVDEWLHGLIRRSSLFPGTLY
LALPECKGEDEAYSYFDEVLKEVLEAARGGEAPNLPFDLGGLDDGVKRVV
EAGLKGLESVRYLPIRVAVAYVEEGLRNLEKWTTERFGIDDEVKKLLGGD
LRRFLFARLLEMVNERKRRALPKYPSWFDKEGRPRFSQTYKGTWMHSSLD
PSQPAVVKFGGVFKDGNLTYDDETHSWLKSLGIEEKKDANLTKVFKPKEA
LGPVDLIKRALYLRSAKRAGIDSVEVVALNYYYLKHYFDSERCREIKGLV
ERVLDGEDVEDVFGSSEAADRRLAECKGAGEEPWTPGLEYVVIRADGDNV
GKLLRGCLPKEPQMPQDVEVVRDREQFEKDMKHALRVLGAMRETAKHICG
GGYLVVPSPAYYAAVSAALMVTAIGDAAIVEKSQGELVFAGGDDLLAFSA
KPPSFDIVKETRENYWGEGGFHSLSESYFLPALTAYGRSYSLRVAHAVTD
MMSVEVDKAAELLDEAKDRVPGKDALAISTSTGHVGFTKVSAVGSVKAIA
EAYARRTLGRNLPYDVEAWGEAAEVEYVLRYLVGRNTDKKELADKVVEAA
CYVDGRGEKWKNAVELLKALRAWI
>Seq2
GGGGATGTTTTGGATTTGCCAGGTAACGAAATTTATATAGCATATCTTGGTTACGCAAGT
AAAAACGTCGAGGTTATAATAACCGGAAATAACAAACAAACTGTAACAAATACACAAGAT
TTTGCTGGACAAACTCCAGTTTATCAAATGAATTTTGCAAACAGTTTTTCTTCACAATTA
GCTTTTGCTTAAATAAATAAGCAAAAACCGCACTTAAATTAACTTACGATTTAAGGAAGG
CAATAAGGTAAGTTTTGCTAATTGGGTTTGTCTTTGCTCTTTTAGCAAATATCAAAAAGA
CTAGTAATTTATGGATTTTGTTTGTTTTTTCATAAATTGCGAACTTACAAAATAAACTAA
ATATGTAGAATATATAAATTAAAGTGATTTTGTACATGGGTTCGACTCCCATCATCTCCA
CCA
>Seq3
GCCCAGGTGGCTCAGCGGTTTAGCGCCGCCTTCAGCCCAGGGTGTGATCCTGGAGACCTG
GGATCGAGTCCCACGTCGGGCT 
>Seq4
GGCGACAAGATTGCTAAGGAAGAAAACAAATAAAGCAATTAGATATAAGC
GCTTCCGCTCAATCTTCGCAGTCAATGCCAAAAGCAGCGGCCCAGATACT
GCATATCCCAGCGCAAACACACTGATCAGCTGTCCGGCTGAAACAATGGA
AATGTCTAAATCATTTGCGATCTGAGGAAGAATTCCCCCCACAATTAACT
CAACCAATCCGACTGCAATCGTAGAAGCTGCAAGCAGGAAAACTTTGAAA
TTCATAACAAACTCCTTTACTTAAATGTTTTGATAAATAAAAAAAATCCT
GATTACAAAAAATGTCATAAACAAATTTTGTAATCAGGATTTTACGGTTC
CTGGTAGACACCCTCAAACCATATTATTGAGGTTATACAAGTGATAATAG
CTATTTAATTGAT