BME 110 Computational Biology Tools
Homework 3 / Study
Section Practice Questions
Using resources discussed in class like Rfam, ClustalW,
Jalview, mfold, and tRNAscan-SE, snoscan to analyze these sequences.
Use the two groups of sequences at the bottom of the page for the Study
Section Practice Questions (SeqA, SeqB, SeqC, etc) and for the homework
problems (Seq1, Seq2, Seq3, etc). You will not get
homework credit if you turn in answers for the study section sequences.
Using ClustallW to Align Proteins & Jalview
1. Use BlastP with SeqA/Seq1 to collect sequences for an
alignment. Search with an Expect threshold = 1, Word size = 2,
BLOSUM62 scoring matrix, searching the NR database with the rest
defaults. Choose the top 12 hits to sequences (including the
identity self-hit) by
checking off boxes, and then click on "Get selected Sequences" just
below the summary list of hits, then save them to a FASTA file.
Copy them into
a text editor, and change the names to something helpful - name them by
the first letter of the genus, and the first four letters of genus and
any other identifying gene name information (i.e "Pcali-Cas1" or
Paero-PAE0200 ).
Go to the ClustalW Web
site, and enter your 12 sequences. Use the defaults except make
BLOSUM your scoring matrix (Interactive,
Alignment=full, score-type=percent, matrix=blosum). Leave the
phylogentic tree options off. Submit.
(a) From the score table, which two sequences were most similar among
the 12 (you can sort by score/identity; give the species
name/abbreviation)?
Which were most distant?
(b) Save your alignment file (*.aln) so you can load it in the
stand-alone version of Jalview. Give the "Alignment score" (at
the top of the ClustalW page) and paste in your Guide Tree file (*.dnd)
so we know how your tree construction looks). Remember, you
should have 12 sequences.
Using Phylogenetic Trees
2. Load your alignment file (*.aln) from problem 1 into
Jalview. Color the alignment by ClustalX
colors. The
sequences are already aligned, so there is no need to recalculate the
alignment. Normally, at this point we would remove all the
columns that have gaps in the majority of sequences, but to make sure
everyone gets the same final answer, we will skip this step.
(a) There are two ways of creating trees for proteins in Jalview:
Under Calculate->Calculate Tree, you can use either Average Distance
using BLOSUM62 or Neighbor Joining Using BLOSUM62. Run both tree
building programs. Did they give you the same tree? If not,
which species moved relative to the others? Use "File->Save
As->PNG" in the tree window to save each of your two trees, and
paste them into your document.
(b) Now, you'd like compare this single protein family tree against the
"species tree" which is based on ribosomal RNA. You have a
ribosomal RNA-based species tree for the archaea at archaea.ucs.edu and
here. Relative to your original sequence,
label the other 11 members of the tree as either "Ortholog", "Paralog",
or "Xenolog".
Practice with RNA Analysis
3. For SeqB/Seq2, use the tools discussed in
class (Rfam,
tRNAscan-SE, mfold,
snoscan, BlastX)
to:
(a) identify what type of RNA gene it is (show the output
of the program that identified it),
(b) if this RNA interacts with other RNAs in the cell (either base
pairing or catalytically), give the name of those RNA(s). If it
is
a tRNA, give the type of amino acid it carries and its anticodon (i.e.
Proline UGG).
(c) run "mfold" on the sequence, turn on "p-num" for "Choose structure
annoation" option, for "Enter the percent suboptimality number" enter
10. Give the free energy score and paste the most stable
structure into your homework. Based on the p-num color
scheme and the coloring of the majority of base pairings, do you
think this is a highly probable secondary structure (if highly probable
means >98% probability for more than half the base pairs)?
(d) Use the "Compare Structures" function to compare the diferences
between the optimal structure and the second most optimal structure (if
it exisits). How many base pairings are not shared (unique)
between structures? Does this out-number the number of base
pairings that are shared?
If this is a tRNA, does it match the structure predicted by tRNAscan-SE?
(e) Use the Biology Workbench to randomly shuffle the sequence once
(RANDSEQ; use "Uniform" for the Randomizatin Method), and
then run "mfold" on the shuffled sequence, and give the top 3 free
energy scores [kcal/mol]. By how much is it more stable (more
negative) or
less stable (less negative) are these shuffled sequence scores relative
to the optimal original sequence free energy score? Give the
shuffled sequence and the optimal shuffled sequence structure in your
write-up.
4. Answer the same questions as in problem #3, but for SeqC/Seq3.
5. Answer the same questions as in problem #3, but for SeqD/Seq4.
Study Section Practicse Question Sequences:
>SeqA (study section)
MQIAVASYGTRIRTRKGLLVVERGGERREYPLHQVDEVFILTGGVSITSRALRALLRAGAVVAVFDQRGE
PLGIFMRPVGDATGEKRRCQYAAAAGGRGLQWAREWVWKKMRGQLQNVKAWRRRLAHYGDYVEQIGRALE
ALRAAASPGEVMEAEAAAAEAYWRAYGEVTGFPGRDQEGGDPVNAALNYGYGVLKALCFKSILLAGLDPY
VGFLHVDKSGRPSLVLDFMEQWRPRVDAVVAKVAGELATENGLLDHKSRLRVAAAVLEELGAGARPVSAE
IHREARALARAICT
>SeqB (study section)
GCCAGGTTGGCCGAGCGGTCTAAGGCGCCAGATTTAAGCTCTGGTTCTCG
AGAGAGAGCGTGGGTTCGAACCCCACACCTGGCATCCGGACCAGACCGAG
>SeqC (study section)
CCTGGCGGCCTTAGCGCGGTGGTCCCACCTGACCCCATGCCGAACTCAGA
AGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAG
TAGGGAACTGCCAGGC
>SeqD (study section)
CACACTCTCTACGTCTCGTTTTAAACCGTCGTTGCCACGCAGAGGCATGT
ATCTGCGCTCCGAGGAGTTGGCCAAATTTTCCGTGAAGAGAAAGAAGCTA
GCTACAAAACGTGCGTCCACGCGCTCTAGGGGGAAACTCATTGGGGGAGG
GGATGATGAACGACTTGGGAGGGGTCCCTGCCCGAAACCAGGCACGCTGA
CCCACCCCCTCCCCATGTGAACTACTTATATGTTTAGAAAAACTTTTCTC
CGCCGGGCCGCAATTTCACAAACATTGGGACAGGTATTTCCCACCACAAT
CGCTTTAAGCAGAGCCGTGCGCTAATTCCAGCCTTAGCGCGCCTCCTTTA
AC
Homework Problem Sequences:
>Seq1
MNWFLKSFALLHDPPHKALWFEDADYRVYDKNSHEEEAAKLFDEIFKGMG
FGGVPGSEESAAVAEADRLAASFDRWALPPSPGNYWVKAKYMINPFGRER
REIKKPTPQEFRERLGKFVGEVRRVLEAAKGDERWLYFAFYAAYELAWIK
AGLPTLPADTRVPTHSIFDHLYATASMLNWTRGGGCLMVVDLPGIQSILK
SARKAGDYRAGSLLVSLVAWRVAWRFMERHGPDILLSPTPRFNPLFYAQL
DRRVKGVWGLYAGAMSYYVGKKFGVSQQLPVDEWLHGLIRRSSLFPGTLY
LALPECKGEDEAYSYFDEVLKEVLEAARGGEAPNLPFDLGGLDDGVKRVV
EAGLKGLESVRYLPIRVAVAYVEEGLRNLEKWTTERFGIDDEVKKLLGGD
LRRFLFARLLEMVNERKRRALPKYPSWFDKEGRPRFSQTYKGTWMHSSLD
PSQPAVVKFGGVFKDGNLTYDDETHSWLKSLGIEEKKDANLTKVFKPKEA
LGPVDLIKRALYLRSAKRAGIDSVEVVALNYYYLKHYFDSERCREIKGLV
ERVLDGEDVEDVFGSSEAADRRLAECKGAGEEPWTPGLEYVVIRADGDNV
GKLLRGCLPKEPQMPQDVEVVRDREQFEKDMKHALRVLGAMRETAKHICG
GGYLVVPSPAYYAAVSAALMVTAIGDAAIVEKSQGELVFAGGDDLLAFSA
KPPSFDIVKETRENYWGEGGFHSLSESYFLPALTAYGRSYSLRVAHAVTD
MMSVEVDKAAELLDEAKDRVPGKDALAISTSTGHVGFTKVSAVGSVKAIA
EAYARRTLGRNLPYDVEAWGEAAEVEYVLRYLVGRNTDKKELADKVVEAA
CYVDGRGEKWKNAVELLKALRAWI
>Seq2
GGGGATGTTTTGGATTTGCCAGGTAACGAAATTTATATAGCATATCTTGGTTACGCAAGT
AAAAACGTCGAGGTTATAATAACCGGAAATAACAAACAAACTGTAACAAATACACAAGAT
TTTGCTGGACAAACTCCAGTTTATCAAATGAATTTTGCAAACAGTTTTTCTTCACAATTA
GCTTTTGCTTAAATAAATAAGCAAAAACCGCACTTAAATTAACTTACGATTTAAGGAAGG
CAATAAGGTAAGTTTTGCTAATTGGGTTTGTCTTTGCTCTTTTAGCAAATATCAAAAAGA
CTAGTAATTTATGGATTTTGTTTGTTTTTTCATAAATTGCGAACTTACAAAATAAACTAA
ATATGTAGAATATATAAATTAAAGTGATTTTGTACATGGGTTCGACTCCCATCATCTCCA
CCA
>Seq3
GCCCAGGTGGCTCAGCGGTTTAGCGCCGCCTTCAGCCCAGGGTGTGATCCTGGAGACCTG
GGATCGAGTCCCACGTCGGGCT
>Seq4
GGCGACAAGATTGCTAAGGAAGAAAACAAATAAAGCAATTAGATATAAGC
GCTTCCGCTCAATCTTCGCAGTCAATGCCAAAAGCAGCGGCCCAGATACT
GCATATCCCAGCGCAAACACACTGATCAGCTGTCCGGCTGAAACAATGGA
AATGTCTAAATCATTTGCGATCTGAGGAAGAATTCCCCCCACAATTAACT
CAACCAATCCGACTGCAATCGTAGAAGCTGCAAGCAGGAAAACTTTGAAA
TTCATAACAAACTCCTTTACTTAAATGTTTTGATAAATAAAAAAAATCCT
GATTACAAAAAATGTCATAAACAAATTTTGTAATCAGGATTTTACGGTTC
CTGGTAGACACCCTCAAACCATATTATTGAGGTTATACAAGTGATAATAG
CTATTTAATTGAT