CMP243 Lecture 1, Lydia Gregoret
Cmp243 Lecture 1 Sept. 30, 1996
Speil about why learn background
Our ultimate goal is for you guys to be well-versed in not only sequence
analysis and the theory behind that, but have some inkling of what the
sequences you are looking at actually represent. Especially if you are
coming from a computer science background, you will want to be able to
converse with biologists, understand what kinds of problems are faced
by biologists, and ultimately, make intelligent suggestions of what
experiments to try. On the one hand, it's sort-of cute if the computer geek
can't remember that the different letters of the alphabet represent amino
acids, but in the end it undermines your credibility. I hope you will take
away from the next two lectures some of the reasons biologists and
biochemists are interested in studying macromolecules in the first place.
For those of you with a bio background, some of today's lecture will be
review, but I hope it will put you in a frame of mind to look at
macromolecules from the sequence perspective instead of the functional
"blobology" you may be more acustomed to.
Introduction to biological macromlecules
In this course, you will be dealing with a lot of sequence data. There are 4
major classes of macromolecules in cells: carbohydrates, lipids,
proteins, and nucleic acids.
Lipids: (fat-based) biological membranes, energy storage
Carbohydrates: (sugar-based) energy storage, structural, cell-cell
communication
Nucleic acids:
DNA: the genetic material
RNA: intermediary betwen DNA and proteins,
enzymatic function
proteins: structural, enzymes, hormones, immune
system, cell-cell communication, signalling - the
WORKHORSES of the CELL
In this course, we will focus on nucleic acids (DNA, RNA) and
proteins, because these are the macromolecules with sequence information
that gets deposited in databases.
Structure of Nucleic Acids
As you probably know, there are 4 building blocks for DNA and 4 for RNA
that are called nucleotides. The two sets are almost the same but a little
different.
(overhead of nucleotide structures)
RNA has a ribose sugar, therefore RiboNucleic Acid. DNA is missing an
OH group from the ribose sugar, therefore DeoxyriboNucleic Acid. Within
a group (ribo or deoxyribo) the sugar (ribose) and phosphate parts are the
same, but the bases differ in the four kinds of nucleotides. The word
"base" is the opposite of "acid". The base parts have lots of nitrogens and
are derivatives of ammonia (NH3), a basic (alkaline) substance.
Purines and pyrimidines: refer to the chemical group that they belong to.
Pyrimidines have two rings and purines have one.
In RNA or DNA, a purine always base pairs with a pyrimidine and visa
versa. More specifically, A pairs with T (or U in RNA) and C pairs with
G. Why is this so? Because A and T (U) make two hydrogen bonds and G
and C make 3.
(overhead of H-bonding in nucleotides)
These are the so-called Watson-Crick base pairs. Other, non-Watson-Crick
combinations also possible (other combinations, other orientations).
What is a hydrogen bond anyway? Analogous to magnetic attraction
(hbond) vs. breaking the magnet in half (covalent bond e.g. C-C).
Why is it important?
It is important to remember who base pairs with whom. It can be used
as the basis for prediction. For example, observation of covariation
in RNA molecules implies conservation of secondary structure. It should
make intuitive sense that G:C base pairs are stronger than A:T base pairs
because they have more H-bonds. Organisms that live in warm
environments will be G:C rich to have greater DNA and RNA stability.
This is another basis for prediction of RNA structure: the most
stable structure will have the most G:C pairs. Programs try and
optimize this parameter.
DNA structure
DNA is double-helical (Watson and Crick). It has higher-order
structure in chromosomes, but base pairing leads to double-helix. Lik
e David said, it is fairly regular and people do not worry too much about
predicting the 3D structure of DNA. There certainly are some contexts in
which it is important, like DNA repair.
(overhead of DNA structure)
RNA structure
RNA has helices and other kinds of structure. You still have base-pairing,
but it is intramolecular (within the molecule).
RNA primary structure is the nucleotide sequence.
RNA secondary structure is the intramolecular (within the molecule) base
pairing.
(overhead of tRNA structure)
Apart from tRNA, we still don't know that much about what RNA looks
like because it is technically difficult to determine its three-dimensional
structure (especially compared to proteins). David said last week that there
are probably 1000 different proteins in the PDB. Well, there are probably
about 10 different RNAs. Two major methods of solving RNA 3D
structure are x-ray crystallography and NMR (nuclear magnetic resonance
spectroscopy). With the former, it is hard to crystallize. With the latter,
there is a size limitation and you can't solve the structure of very large
RNAs. There is a lot of work in this area here at UCSC, primarily in the
labs of Chuck Wilson, Jody Puglisi, and Harry Noller.
(display recent Science issue and mention that Jennifer Doudna
will be here to talk about the work in November)
Nucleic acid structure summary
What do you need to know about nucleic acid structure:
* the terms "purine" and "pyrimidine" (bases. Analogy to ammonia)
* what base-pairs with what
* how many hydrogen bonds each type of pair makes
We will talk more about how information is encoded in DNA and about
gene structure.
Structure of Proteins
Proteins are much more diverse than DNA and RNA in terms of both
structure and function (altthough I probably should not be saying that
on the campus of the Markey RNA center). I think they are the most
interesting because they are so diverse.
Two major classes or proteins:
* membrane proteins (can stick out either end of membrane or just be
tethered to it) serve as receptors, channels
Draw a picture of various kinds of membrane proteins and give an example
of what they might do.
* globular or soluble proteins: most enzymes fall into this class. We
know most about them in terms of structure because they are easy to get in a
form for visualization (crystallized or at high concentration in NMR tube)
Proteins are made of amino acids. We will get to the structure of amino
acids shortly, but first I want to say a little bit more about protein
structure.
We need some appreciation of protein structure before appreciating the
structures of amino acids.
Organization of protein structure
Primary structure of proteins (what you find in the database). Proteins
are built of 20 amino acid building blocks (a lot more than DNA and
RNA!). More about these shortly.
(overhead of primary structure)
Secondary structure of proteins
(overheads of helices and of sheets)
Secondary structure is stabilized by hydrogen bonds
It is not currently possible to predict the secondary structure of proteins
from their amino acid sequences. Many methods have been developed, but
none is accurate. This probably has to do with the fact that higher order
structure influences secondary structure.
Tertiary structure
of proteins is the overall assembly of pieces of
secondary structure. While it is known that sequence codes for structure,
how it does so is still a big mystery that maybe one of you will solve. The
tertiary structure of a protein is held together by many forces:
hydrogen bonds
van der Waals attractions
electrostatic (opposite charge) attractions
sequestration of hydrophobic (water-fearing) amino acids on the
inside of the protein
(overhead of sequence & structure of flavodoxin)
In tertiary structure, amino acids that are distant from one another along the
polypeptide chain come together and can be close in space. This is
important for determining the overall fold of a protein as well as defining
binding pockets for small molecules, other proteins, hormones, etc.
(overhead of thyroid hormone receptor and binding pocket)
Quaternary structure
(assembly of individual molecules into dimers,
trimers, etc. Can be homomeric (same subunits) or heteromeric(different
subunits). For example, hemoglobin is a tetramer of two a and two b
chains.
(overhead summarizing protein structure)
Amino acid structure
Now that we know something about the overall organization of proteins,
lets take a closer look at the building blocks. Like I said, there are 20
different amino acids, all with different properties. I do think it is
important
for you to know the properties and names of the amino acids and here is
why. Let's say you obtained an alignment between two sequences that
looked like:
DAVKAGIKALQEASGFIR
EAKASTMAGLHSAAPFVR
would you be able to say, qualitatively, whether this alignment makes
sense? If this alignment was obtained as a result of a program you wrote,
apart from the statistical information you might get, can you just say, "yeah,
that's OK" or perhaps, "wait, that's preposterous - there must be a bug
somewhere!" At this point, you might even be wondering what it is up on
the board anyway. In the databases, protein sequences are stored as one
letter codes.
(overhead of amino acid structures)
These are groups grossly into being nonpolar, polar and charged (also
polar). Polar groups are those with N or O in them. These are involved in
hydrogen bonding and interactions with water. Nonpolar ones are CH2's
and CH3's. They like each other better than they like water and will be
found on the interior of a protein (or in a membrane-spanning region).
Groupings are varied and to an extent, arbitrary. One could come up with
many ways to organize the amino acids. There is also overlap between
groups. For example, lysine is a positively-charged amino acid. But, it has
such a long chain leading up to that charge, that that portion is nonpolar or
hydrophobic.
small weird aromatic nonpolar polar
G, Gly P, Pro Y, tYr V, Val S, Ser
A, Ala G, Gly W, trp A, Ala T, Thr
V, Val C, Cys F, phe F, phe N, Asn
S, Ser H, his L, leu Q, gln
D, asp I, Ile D, asp
N, asN M, Met E, glu
C, Cys W, trp Y, tyr
Y, tYr K, lys
P, Pro R, aRg
G, Gly H, His
You should have some sense of the properties of all amino acids and I will
expect you to know the one letter codes as well.
Back to the Alignment:
DAVKAGIKALQEASGFIR
EAKASTMAGLHSAAPFVR
Is it OK? D/E is conservative. Both negatively-charged. A/A is identical.
V/K? etc. You will have the opportunity to compare alignments on your
own very shortly when you do the homework for this week.