CMP243 Lecture 1, Lydia Gregoret

Cmp243 Lecture 1 Sept. 30, 1996 Speil about why learn background

Our ultimate goal is for you guys to be well-versed in not only sequence analysis and the theory behind that, but have some inkling of what the sequences you are looking at actually represent. Especially if you are coming from a computer science background, you will want to be able to converse with biologists, understand what kinds of problems are faced by biologists, and ultimately, make intelligent suggestions of what experiments to try. On the one hand, it's sort-of cute if the computer geek can't remember that the different letters of the alphabet represent amino acids, but in the end it undermines your credibility. I hope you will take away from the next two lectures some of the reasons biologists and biochemists are interested in studying macromolecules in the first place. For those of you with a bio background, some of today's lecture will be review, but I hope it will put you in a frame of mind to look at macromolecules from the sequence perspective instead of the functional "blobology" you may be more acustomed to.

Introduction to biological macromlecules

In this course, you will be dealing with a lot of sequence data. There are 4 major classes of macromolecules in cells: carbohydrates, lipids, proteins, and nucleic acids. Lipids: (fat-based) biological membranes, energy storage Carbohydrates: (sugar-based) energy storage, structural, cell-cell communication Nucleic acids: DNA: the genetic material RNA: intermediary betwen DNA and proteins, enzymatic function proteins: structural, enzymes, hormones, immune system, cell-cell communication, signalling - the WORKHORSES of the CELL In this course, we will focus on nucleic acids (DNA, RNA) and proteins, because these are the macromolecules with sequence information that gets deposited in databases.

Structure of Nucleic Acids

As you probably know, there are 4 building blocks for DNA and 4 for RNA that are called nucleotides. The two sets are almost the same but a little different. (overhead of nucleotide structures) RNA has a ribose sugar, therefore RiboNucleic Acid. DNA is missing an OH group from the ribose sugar, therefore DeoxyriboNucleic Acid. Within a group (ribo or deoxyribo) the sugar (ribose) and phosphate parts are the same, but the bases differ in the four kinds of nucleotides. The word "base" is the opposite of "acid". The base parts have lots of nitrogens and are derivatives of ammonia (NH3), a basic (alkaline) substance. Purines and pyrimidines: refer to the chemical group that they belong to. Pyrimidines have two rings and purines have one. In RNA or DNA, a purine always base pairs with a pyrimidine and visa versa. More specifically, A pairs with T (or U in RNA) and C pairs with G. Why is this so? Because A and T (U) make two hydrogen bonds and G and C make 3. (overhead of H-bonding in nucleotides) These are the so-called Watson-Crick base pairs. Other, non-Watson-Crick combinations also possible (other combinations, other orientations). What is a hydrogen bond anyway? Analogous to magnetic attraction (hbond) vs. breaking the magnet in half (covalent bond e.g. C-C).

Why is it important?

It is important to remember who base pairs with whom. It can be used as the basis for prediction. For example, observation of covariation in RNA molecules implies conservation of secondary structure. It should make intuitive sense that G:C base pairs are stronger than A:T base pairs because they have more H-bonds. Organisms that live in warm environments will be G:C rich to have greater DNA and RNA stability. This is another basis for prediction of RNA structure: the most stable structure will have the most G:C pairs. Programs try and optimize this parameter.

DNA structure

DNA is double-helical (Watson and Crick). It has higher-order structure in chromosomes, but base pairing leads to double-helix. Lik e David said, it is fairly regular and people do not worry too much about predicting the 3D structure of DNA. There certainly are some contexts in which it is important, like DNA repair. (overhead of DNA structure)

RNA structure

RNA has helices and other kinds of structure. You still have base-pairing, but it is intramolecular (within the molecule). RNA primary structure is the nucleotide sequence. RNA secondary structure is the intramolecular (within the molecule) base pairing. (overhead of tRNA structure) Apart from tRNA, we still don't know that much about what RNA looks like because it is technically difficult to determine its three-dimensional structure (especially compared to proteins). David said last week that there are probably 1000 different proteins in the PDB. Well, there are probably about 10 different RNAs. Two major methods of solving RNA 3D structure are x-ray crystallography and NMR (nuclear magnetic resonance spectroscopy). With the former, it is hard to crystallize. With the latter, there is a size limitation and you can't solve the structure of very large RNAs. There is a lot of work in this area here at UCSC, primarily in the labs of Chuck Wilson, Jody Puglisi, and Harry Noller. (display recent Science issue and mention that Jennifer Doudna will be here to talk about the work in November)

Nucleic acid structure summary

What do you need to know about nucleic acid structure: * the terms "purine" and "pyrimidine" (bases. Analogy to ammonia) * what base-pairs with what * how many hydrogen bonds each type of pair makes We will talk more about how information is encoded in DNA and about gene structure.

Structure of Proteins

Proteins are much more diverse than DNA and RNA in terms of both structure and function (altthough I probably should not be saying that on the campus of the Markey RNA center). I think they are the most interesting because they are so diverse.

Two major classes or proteins:

* membrane proteins (can stick out either end of membrane or just be tethered to it) serve as receptors, channels Draw a picture of various kinds of membrane proteins and give an example of what they might do. * globular or soluble proteins: most enzymes fall into this class. We know most about them in terms of structure because they are easy to get in a form for visualization (crystallized or at high concentration in NMR tube) Proteins are made of amino acids. We will get to the structure of amino acids shortly, but first I want to say a little bit more about protein structure. We need some appreciation of protein structure before appreciating the structures of amino acids.

Organization of protein structure

Primary structure of proteins (what you find in the database). Proteins are built of 20 amino acid building blocks (a lot more than DNA and RNA!). More about these shortly. (overhead of primary structure)

Secondary structure of proteins

(overheads of helices and of sheets) Secondary structure is stabilized by hydrogen bonds It is not currently possible to predict the secondary structure of proteins from their amino acid sequences. Many methods have been developed, but none is accurate. This probably has to do with the fact that higher order structure influences secondary structure.

Tertiary structure

of proteins is the overall assembly of pieces of secondary structure. While it is known that sequence codes for structure, how it does so is still a big mystery that maybe one of you will solve. The tertiary structure of a protein is held together by many forces: hydrogen bonds van der Waals attractions electrostatic (opposite charge) attractions sequestration of hydrophobic (water-fearing) amino acids on the inside of the protein (overhead of sequence & structure of flavodoxin) In tertiary structure, amino acids that are distant from one another along the polypeptide chain come together and can be close in space. This is important for determining the overall fold of a protein as well as defining binding pockets for small molecules, other proteins, hormones, etc. (overhead of thyroid hormone receptor and binding pocket)

Quaternary structure

(assembly of individual molecules into dimers, trimers, etc. Can be homomeric (same subunits) or heteromeric(different subunits). For example, hemoglobin is a tetramer of two a and two b chains. (overhead summarizing protein structure)

Amino acid structure

Now that we know something about the overall organization of proteins, lets take a closer look at the building blocks. Like I said, there are 20 different amino acids, all with different properties. I do think it is important for you to know the properties and names of the amino acids and here is why. Let's say you obtained an alignment between two sequences that looked like: DAVKAGIKALQEASGFIR EAKASTMAGLHSAAPFVR would you be able to say, qualitatively, whether this alignment makes sense? If this alignment was obtained as a result of a program you wrote, apart from the statistical information you might get, can you just say, "yeah, that's OK" or perhaps, "wait, that's preposterous - there must be a bug somewhere!" At this point, you might even be wondering what it is up on the board anyway. In the databases, protein sequences are stored as one letter codes. (overhead of amino acid structures) These are groups grossly into being nonpolar, polar and charged (also polar). Polar groups are those with N or O in them. These are involved in hydrogen bonding and interactions with water. Nonpolar ones are CH2's and CH3's. They like each other better than they like water and will be found on the interior of a protein (or in a membrane-spanning region). Groupings are varied and to an extent, arbitrary. One could come up with many ways to organize the amino acids. There is also overlap between groups. For example, lysine is a positively-charged amino acid. But, it has such a long chain leading up to that charge, that that portion is nonpolar or hydrophobic.

small weird aromatic nonpolar polar

G, Gly P, Pro Y, tYr V, Val S, Ser A, Ala G, Gly W, trp A, Ala T, Thr V, Val C, Cys F, phe F, phe N, Asn S, Ser H, his L, leu Q, gln D, asp I, Ile D, asp N, asN M, Met E, glu C, Cys W, trp Y, tyr Y, tYr K, lys P, Pro R, aRg G, Gly H, His You should have some sense of the properties of all amino acids and I will expect you to know the one letter codes as well. Back to the Alignment: DAVKAGIKALQEASGFIR EAKASTMAGLHSAAPFVR Is it OK? D/E is conservative. Both negatively-charged. A/A is identical. V/K? etc. You will have the opportunity to compare alignments on your own very shortly when you do the homework for this week.