CHAPTER 2 - PROTEIN STRUCTURE
G: PREDICTING PROTEIN PROPERTIES
- BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND PROTEOMICS
BIOCHEMISTRY - DR. JAKUBOWSKI
Last Update:
02/28/13
|
Learning Goals/Objectives for Chapter 2G:
After class and this reading, students will be able to:
- find web based proteomics protein to analyze protein sequences
and structures
- describe the basis for methods used to predict the secondary
structure and hydrophobic structures of proteins
- analyze secondary structure and hydropathy plots from web-based
proteomics programs.
- describe differences between integral and peripheral membranes
proteins, and how each could be purified.
- explain how hydropathy and secondary structure plots can be used
to predict membrane spanning sequences of proteins
- describe in general the theoretical and empirically based
methods to predict protein tertiary structure from a primary
sequence
- describe possible early intermediates in protein folding as
determined by theoretical methods
|
Bioinformatics, Computational Biology and
Proteomics
With the solving of the human genome, intensive effort has
been devoted to analysis of the human genome to determine the number and
transcriptional regulation of the encoded genes. Much has been learned
from comparative genomics, as genomes from mice, rats, chimpanzees, and a
variety of prokaryotes are compared in an effort to help understand the nature
of genes and their transcriptional regulation. The vast amount of genomic
data that has to be "mined" has required the development of computational and
computer programs to enable the analysis. Two relatively new fields have
subsequently arisen: bioinformatics and computational biology. (In a
personal note, the words computational biology seem somewhat restrictive since
the field of computational chemistry, which has a longer history, has
significant overlap with "computational biology". I prefer computational
biochemistry). These fields have significant overlap (as do physical
chemistry/chemical physics and biochemistry/molecular biology/chemical biology),
so I defer to others to define them.
The NIH Biomedical Information Science and Technology
Initiative Consortium: "This consortium has agreed on the following
definitions of bioinformatics and computational biology, recognizing that no
definition could completely eliminate overlap with other activities or preclude
variations in interpretation by different individuals and organizations.
Bioinformatics: Research, development, or
application of computational tools and approaches for expanding the use of
biological, medical, behavioral or health data, including those to acquire,
store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and
application of data-analytical and theoretical methods, mathematical modeling
and computational simulation techniques to the study of biological, behavioral,
and social systems."
The National Center for
Biotechnology Information (NCBI 2001) offers this definition of
bioinformatics:
bioinformatics: "Bioinformatics is the field of science in
which biology, computer science, and information technology merge into a
single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics with which
to assess relationships among members of large data sets; the analysis and
interpretation of various types of data including nucleotide and amino acid
sequences, protein domains, and protein structures; and the development and
implementation of tools that enable efficient access and management of
different types of information."
What comes after the solving of the genome? The
transcriptome, the complete set of transcribed RNA sequences and their
biological functions, and the immensely complex proteome, the
complete set of translated protein, are obvious candidates.
Here are some definitions of proteomics:
Proteomics: "the qualitative and quantitative comparison of
proteomes (PROTEin complement to a genOME) under different conditions to
further unravel biological processes" from
ExPASy (Expert Protein
Analysis System)
server of the Swiss Institute of
Bioinformatics
Proteomics (Pasteur Institute): "Proteomics aims at quantifying the
expression levels of the complete protein complement (the proteome) in a
cell at any given time. While proteomics research was initially focused on
two-dimensional gel electrophoresis for protein separation and
identification, proteomics now refers to any procedure that characterizes
the function of large sets of proteins. It is thus often used as a synonym
for functional genomics."
Richard
Burgess, UW Madison, includes the following activities in proteomics: (C&E New, July
31, 2000, pg 31)
, which will revolutionize our understanding
of normal and disease processes in cells.
- High-throughput expression and purification of
proteins
- Protein profiling, using 2D gel electrophoresis and
mass spectrometry to study proteins expressed in a cell
- Protein-protein interaction studies to see which
proteins function together using the yeast two hybrid method
- Pathway analysis to understand signal
transduction and
other complex cell processes
- Large scale protein folding and 3D structure studies
- Bioinformatics analysis of proteomic data
This web book has been developed as a first semester
biochemistry text. I have made a conscious choice to limit the scope
of the material to exclude content covered in detail in a molecular
biology/genetics class. Hence, this text will not discuss in
significant detail the genome and transcriptome, and mechanisms of
replication, transcription, or translation. However, with its
emphasis on protein structure and function, proteomics is a logical
candidate for inclusion.
In the last several years, computational
biology/chemistry and web-based programs have become available for the
systematic analysis of individual proteins, and for the comparative analysis
of many proteins, based on either their DNA or amino acid sequence.
Clearly the ultimate goal in the description of a protein would be to
determine, from the amino acid or nucleotide sequence, the three dimensional
structure of a protein and its biological function, including all its
binding partners. Here is a list of typical properties of a
protein that can be determined by input of an appropriate sequence (for a
protein with known or unknown 3D structure) into web-based programs:
-
protein sequence from a DNA sequence, and the reverse
-
isoelectric point
-
Ramachandrian plot
-
glycosylation/phosphorylation sites
-
secondary structure prediction
-
hydrophobicity prediction
-
3D structure based on structures of homology protein
(homology modeling)
-
determination of evolutionary relationships among
organisms.
Here is a list of proteome web resources and tutorials
Computational biochemistry programs (such as Insight II,
MOE, SwissPdb Viewer, VMD, NAMD, Autodock) are available to calculate
surface electrostatic potentials, minimize energy, dock ligand molecules,
and perform molecular dynamics simulations.
The ExPASy (Expert Protein Analysis
System)
proteomics
server of the Swiss Institute of
Bioinformatics offers an incredible array of
tools to study protein structure and
function. These include:
Voluminous databases of biomolecule sequence and
structural data, as well as analysis software packages, are available at a
variety of web sites, including:
-
BioGrid: General
Repository for Interaction (protein, NA) Datasets
-
GenBank:
DNA sequence database (over 100 billion bases as of 9/05), from the
National
Center for Biotechnology Information - NCBI
-
Swiss-Pro:
protein sequence database with annotation (description of the function
of a protein, its domains structure, post-translational modifications,
variants, etc.), from the Swiss Institute of
Bioinformatics
-
ProSite:
database of protein families and domains. It consists of biologically
significant sites, patterns and profiles that help to reliably identify
to which known protein family (if any) a new sequence belongs. From the Swiss Institute of
Bioinformatics
-
Swiss-2D Gel
Database: from the Swiss Institute of
Bioinformatics
-
RSCB Protein Data
Bank: Protein and nucleic acid 3D structures from x-ray
crystallography and NMR spectroscopy (about 33,000 as of 9/15/05)
-
SWISS-MODEL
Repository: 3D comparative protein structure models (675,000)
generated by the fully automated homology-modeling pipeline SWISS-MODEL.
(again from Swiss Institute of
Bioinformatics)
The NCBI has an extensive array of available tools
(free), including:
The table below (directly taken from
Wikipedia) shows some of the incredible information
available the proteome and genome of each human chromosome.
Table: Human proteome and genome from
Wikipedia
(Data
source:
Ensembl genome browser release 68, July 2012)
| Chromsome |
Length (mm) |
BP |
Variations |
Confirmed Proteins |
Putative Proteins |
Pseudogenes |
miRNA |
rRNA |
snRNA |
snoRNA |
misc ncRNA |
Links |
|
1 |
85 |
249,250,621 |
4,401,091 |
2,012 |
31 |
1,130 |
134 |
66 |
221 |
145 |
106 |
EBI |
|
2 |
83 |
243,199,373 |
4,607,702 |
1,203 |
50 |
948 |
115 |
40 |
161 |
117 |
93 |
EBI |
|
3 |
67 |
198,022,430 |
3,894,345 |
1,040 |
25 |
719 |
99 |
29 |
138 |
87 |
77 |
EBI |
|
4 |
65 |
191,154,276 |
3,673,892 |
718 |
39 |
698 |
92 |
24 |
120 |
56 |
71 |
EBI |
|
5 |
62 |
180,915,260 |
3,436,667 |
849 |
24 |
676 |
83 |
25 |
106 |
61 |
68 |
EBI |
|
6 |
58 |
171,115,067 |
3,360,890 |
1,002 |
39 |
731 |
81 |
26 |
111 |
73 |
67 |
EBI |
|
7 |
54 |
159,138,663 |
3,045,992 |
866 |
34 |
803 |
90 |
24 |
90 |
76 |
70 |
EBI |
|
8 |
50 |
146,364,022 |
2,890,692 |
659 |
39 |
568 |
80 |
28 |
86 |
52 |
42 |
EBI |
|
9 |
48 |
141,213,431 |
2,581,827 |
785 |
15 |
714 |
69 |
19 |
66 |
51 |
55 |
EBI |
|
10 |
46 |
135,534,747 |
2,609,802 |
745 |
18 |
500 |
64 |
32 |
87 |
56 |
56 |
EBI |
|
11 |
46 |
135,006,516 |
2,607,254 |
1,258 |
48 |
775 |
63 |
24 |
74 |
76 |
53 |
EBI |
|
12 |
45 |
133,851,895 |
2,482,194 |
1,003 |
47 |
582 |
72 |
27 |
106 |
62 |
69 |
EBI |
|
13 |
39 |
115,169,878 |
1,814,242 |
318 |
8 |
323 |
42 |
16 |
45 |
34 |
36 |
EBI |
|
14 |
36 |
107,349,540 |
1,712,799 |
601 |
50 |
472 |
92 |
10 |
65 |
97 |
46 |
EBI |
|
15 |
35 |
102,531,392 |
1,577,346 |
562 |
43 |
473 |
78 |
13 |
63 |
136 |
39 |
EBI |
|
16 |
31 |
90,354,753 |
1,747,136 |
805 |
65 |
429 |
52 |
32 |
53 |
58 |
34 |
EBI |
|
17 |
28 |
81,195,210 |
1,491,841 |
1,158 |
44 |
300 |
61 |
15 |
80 |
71 |
46 |
EBI |
|
18 |
27 |
78,077,248 |
1,448,602 |
268 |
20 |
59 |
32 |
13 |
51 |
36 |
25 |
EBI |
|
19 |
20 |
59,128,983 |
1,171,356 |
1,399 |
26 |
181 |
110 |
13 |
29 |
31 |
15 |
EBI |
|
20 |
21 |
63,025,520 |
1,206,753 |
533 |
13 |
213 |
57 |
15 |
46 |
37 |
34 |
EBI |
|
21 |
16 |
48,129,895 |
787,784 |
225 |
8 |
150 |
16 |
5 |
21 |
19 |
8 |
EBI |
|
22 |
17 |
51,304,566 |
745,778 |
431 |
21 |
308 |
31 |
5 |
23 |
23 |
23 |
EBI |
|
X |
53 |
155,270,560 |
2,174,952 |
815 |
23 |
780 |
128 |
22 |
85 |
64 |
52 |
EBI |
|
Y |
20 |
59,373,566 |
286,812 |
45 |
8 |
327 |
15 |
7 |
17 |
3 |
2 |
EBI |
|
mtDNA |
0.0054 |
16,569 |
929 |
13 |
0 |
0 |
0 |
2 |
0 |
0 |
22 |
EBI |
This chapter will focus on predictions of secondary and
tertiary structures of proteins based on computation biochemistry and
bioinformatics. Specific exercises (for those enrolled in the class)
using web-based bioinformatics programs will be found in Laboratory and Problem Sets.
Secondary Structure
As we have seen previously, amino acids vary in their propensity to
be found in alpha helices, beta strands, or reverse turns (beta bends, beta
turns).
These difference can be rationalized from the structure
of each amino acid, as described before.
Figure: Amino Acid Structure and propensity for secondary
structure

From the data bases, propensities can be calculated to determine the likelihood that
a given amino acid will be in one of those structures. Glycine for example would have a high
propensity to be in reverse turns, while Pro, a helix breaker, would have a low propensity to be
in an alpha
helix. A number is assigned to each amino acid for
each category of secondary structure. High numbers favor the likelihood that that amino
acid would be in that structure. One of the earliest propensity scales was
from Chou-Fasman, where H indicates high propensity for secondary structure, h
intermediate propensity, i is inhibitory, b is a intermediate breaker, and
B is a significant breaker of secondary structure.
Chou-Fasman Amino Acid Propensities
| A.A. |
Helix |
Sheet |
| Designation |
P |
Designation |
P |
| Ala |
H |
1.42 |
i |
0.83 |
| Cys |
i |
0.70 |
h |
1.19 |
| Asp |
I |
1.01 |
B |
0.54 |
| Glu |
H |
1.51 |
B |
0.37 |
| Phe |
h |
1.13 |
h |
1.38 |
| Gly |
B |
0.57 |
b |
0.75 |
| His |
I |
1.00 |
h |
0.87 |
| Ile |
h |
1.08 |
H |
1.60 |
| Lys |
h |
1.16 |
b |
0.74 |
| Leu |
H |
1.21 |
h |
1.30 |
| Met |
H |
1.45 |
h |
1.05 |
| Asn |
b |
0.67 |
b |
0.89 |
| Pro |
B |
0.57 |
B |
0.55 |
| Gln |
h |
1.11 |
h |
1.10 |
| Arg |
i |
0.98 |
i |
0.93 |
| Ser |
i |
0.77 |
b |
0.75 |
| Thr |
i |
0.83 |
h |
1.19 |
| Val |
h |
1.06 |
H |
1.70 |
| Trp |
h |
1.08 |
h |
1.37 |
Tyr |
b |
0.69 |
H |
1.47 |
Next a stretch or "window" of amino acids about 7 amino acids is taken, starting
from the N-terminal of the protein. First the average alpha helical propensities for amino
acids 1-7 are determined and assigned, let's say, to the middle (4th) amino acid in that
sequence. Then alpha helical propensities for amino acids 2-8 (the next
window) are averaged and assigned to
the middle (5) amino acid in that range. The window slide down the protein
sequence until all but the first and last
few amino acids have an average value assigned to them. If a contiguous stretch of amino
acids has high average propensity, they are probably in an alpha helix in the native
protein. This process is repeated using beta strand and reverse turn propensities. The
final assignments of most probably secondary structure are made. Of course this system was
tested against proteins whose tertiary structure was known. See the results for secondary structure prediction for one protein. In this
example, the average propensity for four contiguous amino acids is calculated
(starting with amino acids 1-4, then amino acids 5-8, etc, and continuing to
the end of the polypeptide). Next this process is repeated for contiguous
stretches 2-5, 6-9, etc, and continuing to the end. The original
Chou Fasman
propensities have been updated using known protein structure to give better
predictions.
Amphiphilic Helices
Additional information about putative helices can be
obtained by determining if they are amphiphilic (one side of the helix
containing mostly hydrophobic side chains, with the opposite side containing
polar or charged side chains. A helical wheel projection can be made. In this a circle is draw
representing a downward cross-sectional view of the helix axis.
Figure: Helical wheel projection

The side chains are placed on the outside
of the circle, staggered in a fashion determined by the fact that there are 3.6 amino acids
per turn of the helix. If one side of the wheel contains predominantly nonpolar side
chains while the other side has polar side chains, the helix is amphiphilic. Imagine how
such helices might be packed in a protein.
Hydrophobic Structure
In a completely analogous fashion, a hydrophobic propensity or hydopathy can be calculated. In this system, empirical
measures of the hydrophobic nature of the side chains are used to assign a number to a
given amino acid. Many hydropathy scales are used. Several are based on the Dmo transfer of the side chains from
water to a nonpolar solvent. Two commonly used scales are the Kyte-Doolittle Hydropathy
and Hopp-Woods scales (used more like a hydrophilicity index to predict
surface or water accessible structures that might be recognized by the immune system)
Hydrophobicity Indices for Amino Acids
|
Amino Acid |
Kyte-Doolittle |
Hopp-Woods |
|
Alanine |
1.8 |
-0.5 |
|
Arginine |
-4.5 |
3.0 |
|
Asparagine |
-3.5 |
0.2 |
|
Aspartic acid |
-3.5 |
3.0 |
|
Cysteine |
2.5 |
-1.0 |
|
Glutamine |
-3.5 |
0.2 |
|
Glutamic acid |
-3.5 |
3.0 |
|
Glycine |
-0.4 |
0.0 |
|
Histidine |
-3.2 |
-0.5 |
|
Isoleucine |
4.5 |
-1.8 |
|
Leucine |
3.8 |
-1.8 |
|
Lysine |
-3.9 |
3.0 |
|
Methionine |
1.9 |
-1.3 |
|
Phenylalanine |
2.8 |
-2.5 |
|
Proline |
-1.6 |
0.0 |
|
Serine |
-0.8 |
0.3 |
|
Threonine |
-0.7 |
-0.4 |
|
Tryptophan |
-0.9 |
-3.4 |
|
Tyrosine |
-1.3 |
-2.3 |
|
Valine |
4.2 |
-1.5 |
For a water-soluble protein, a continuous stretch
of amino acids found to have a high average hydropathy is probably buried in the interior
of the protein. Consider the example of bovine a-chymotrypsinogen, a 245
amino acid protein, whose sequence is shown below in single letter code.
1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
61 TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
121 VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
181 ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
241 TLAAN
A hydrophathy
plot for chymotrypsinogen (sum of hydropathies of seven consecutive
residues) shows many stretches that are presumably buried in the interior of the
protein.
Figure: hydrophathy
plot for chymotrypsinogen

Membrane Proteins
So far we have
discussed predominantly globular proteins that are soluble in water.
Proteins are also found associated with membranes. Two major classes of membrane proteins
are found in nature.
- peripheral membrane proteins: water soluble proteins
bound reversibly and non-covalently to the membrane through electrostatic attractions
between charged polar head groups of the phospholipids and the protein.
These proteins can often be released from the membrane by addition of high
salt, since they are often attracted to the bilayer by electrostatic
interactions between charged phospholipid head groups and polar/charged
groups on the protein surface.
- integral membrane proteins: actually insert into the
bilayer. These can be released from the membrane and effectively
solubilized by the addition of single chain amphiphiles (detergents) which
form a mixed micelle with the integral membrane protein. Nonionic
detergents (Trition X-100, octylglucoside, etc) are often used in the
purification of membrane proteins. Ionic detergents (like SDS) not
only solubilize the integral membrane proteins, but also denature them.
Figure: Types of membrane proteins

In some of these integral membrane proteins, large extracellullar
and intracellular domains of the protein are present, connected by the intramembrane
regions. The intramembrane spanning region often consists of either a single alpha helix,
or 7 different helical regions which zig-zag through the membrane. These
transmembrane sequences can readily be determined through hydropathy
calculations. For example, consider the integral membrane bovine protein
rhodopsin. Its 348 amino acid sequence (in single letter code) is
shown below:
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIP
EGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAV
YNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA
Rhodopsin hydropathy plot calculations
shows that is contains seven transmembrane helices which
wind through the membrane in a serpentine fashion..
Figure: Rhodopsin hydropathy plot

Figure: seven transmembrane helices

Rhodopsin Hydropathy Results
| No. |
N terminal |
transmembrane region |
C terminal |
type |
length |
| 1 |
40 |
LAAYMFLLIMLGFPINFLTLYVT |
62 |
PRIMARY |
23 |
| 2 |
71 |
PLNYILLNLAVADLFMVFGGFTT |
93 |
SECONDARY |
23 |
| 3 |
113 |
EGFFATLGGEIALWSLVVLAIER |
135 |
SECONDARY |
23 |
| 4 |
156 |
GVAFTWVMALACAAPPLVGWSRY |
178 |
SECONDARY |
23 |
| 5 |
207 |
MFVVHFIIPLIVIFFCYGQLVFT |
229 |
PRIMARY |
23 |
| 6 |
261 |
FLICWLPYAGVAFYIFTHQGSDF |
283 |
PRIMARY |
23 |
| 7 |
300 |
VYNPVIYIMMNKQFRNCMVTTLC |
322 |
SECONDARY |
23 |
In summary, hydropathy plots are hence useful in finding buried regions in
water soluble proteins, transmembrane helices in integral membrane proteins as
well as short stretches of polar/charged amino acids that might form surface
loops recognizable by immune system antibodies. The window size used in
hydropathy plots would obviously affect the calculated results. Windows of
20 amino acids are useful to determine transmembrane helices while windows of
5-7 amino acids are used to find surface-exposed hydrophilic sites.
Membrane proteins call be solubilized by addition of
single chain amphiphiles (detergents). The nonpolar tails of the
detergents interact with the hydrophobic transmembrane domain of the membrane
protein forming a "mixed" micelle-like structure. Nonionic detergents like
Triton X-100 and octyl-glucoside are often used to solubilize membrane proteins
in their near native state. In contrast, ionic detergents like sodium
dedecyl sulfate (with a negatively charged head group) denature proteins during
the solubilization process. To study membrane proteins in a more
native-like environment, proteins solubilized by nonionic detergent can be
reconstituted into bilayer liposome structures using methods similar to those
from Lab 1 in which you prepared dye-capsulated large unilamellar vesicles (LUVs).
However, it can be difficult to study the intra- and extracellular domains of
membrane proteins in liposomes, given that one of those domains is hidden inside
the liposome. A novel technique that removes this barrier was recently
developed by Sligar. He created an amphiphilic protein disc with an
opening in the center. The inner opening is lined with nonpolar residues,
while the outer surface of the disc is polar. When the discs were added to
phosphlipids, small bilayers formed inside the disc. Membrane proteins
like the b-2 adrenergic
receptor could be reconstituted in the nanodisc bilayers, allowing solvent
exposure of both the intracellular and extracellular domains of the receptor
protein.
Figure: Nanodisc with membrane
protein

Protein Tertiary Structure
We are getting closer to predicting the tertiary structure of a
protein, but as we have seen from molecular mechanics and dynamics calculations, it is a
huge computational task. There are two basic approaches which are often combined.
- calculations using energy minimization and statistical
mechanics: These "semi-empirical" techniques don't assume any
given secondary structure propensities or hydrophobicities. Such methods have
produced limited success with small proteins whose actual structure is known.
- homology modeling based on proteins of known structure:
The structures of about
87,979 (2/13) different
biological macromolecules are known. This can serve
as an empirical data base of possible conformations. Instead of an infinite
number of prototypical structures, it is becoming clear that there may be a reasonably low
number (in the hundreds) of basic structural motifs that are used over and over in
nature. By aligning the amino acid sequences of different proteins, and
comparing their properties (such as secondary structure propensities, hydrophobicities,
etc.), probable low energy structures of the new protein can be determined. This
initial structure can be run through multiple minimization and dynamic simulations to
produce a tentative "lowest" energy structure. The structure should be
compact (checked through calculation of packing density) and experimental techniques (such
as spectroscopic methods) should be employed to validate the structure.
Many mechanisms of the actual folding process have been
postulated, most
of which have some experimental support. In one, a hydrophobic collapse of the
protein produces a seed structure upon which secondary structure and final tertiary
collapse results. Alternatively, initial formation of an alpha helix might serve as
the seed structure. A combination of the two is likely. In one scenario, two
small amphiphilic helices might form which interact through their nonpolar faces to
produce the initial seed structure.
Many studies have been done on a domain of the protein
villin. A company at Stanford University (Folding
at Home) actually allows you to process protein folding data on your own
computer when you're not using it (an example of distributed computing).
The example below shows one simulation of length greater than 1
ms. In the
simulation, it collapses to a near native-like state then unfolds again as it
iteratively probes conformational space as it "seeks" the global energy minimum.
Zhou and Karplus recently simulated
the folding of residues 10-55 of Staphylococcus aureus protein A which form a 3-helix bundle structure.
Figure: 3-helix bundle

Using molecular dynamics, they
carried out 100 folding simulations. Two types of folding trajectories were
noted.
- In the first type, helices form early
(70% within 10 ns), but the fraction of native interhelical contacts (indicating proper
packing of the helices together) and the overall packing density are not similar to the
native state. Then the helices diffuse and collide with each (in the rate-limiting
step) until the native state is reached at about 19
ms. In this model,
non-obligatory intermediates can occur (due to collapse to non-native interhelical packing
in the rate-limiting step) which could slow down folding.
Figure: helices form early

- In another type, there is a simultaneous and
quick partial helix formation and collapse (90% at 200 ns), to a state which is
similar to the molten globule. At this point, only about 20% of the native contacts
are present. The final tertiary structure is achieved after a slow process of
forming native contacts within the compact state, which takes about 500
ms.
Figure: simultaneous and
quick partial helix formation and collapse

The Fersht lab has been combining experimental and theoretical approaches to
the folding/unfolding of another three helix bundle protein, Engrailed homeodomain.
Figure: Engrailed homeodomain

This protein is
among the fastest folding and unfolding proteins known (ms
time scale). This time frame is now also amenable to study through molecular
dynamics simulations. Both sets of data support a folding pathway in which
the unfolded state (U) collapses in a microsecond to an intermediate state (I)
characterized by significant native secondary structure and mobile side chains
that is less compact than the native state (N). The I state hence
resembles the molten globule state. To more clearly understand the
unfolded state, they generated a mutant (Leu16Ala) which was only marginally
stable at room temperature (2.5 kcal/mol). Spectroscopic measurements (CD,
NMR) showed this state to resemble the intermediate (I) state, with much native
secondary structure and a 33% greater radius of gyration than the N state.
In effect they could study the transient intermediate of the wild type protein
more easily by making that state more stable through mutagenesis. These
studies showed that the intermediate is on the folding pathway and not
inhibitory to the process. Using molecular dynamic simulations, the
intermediate to native state transition was shown to proceed via a transition
state (TS) in which the native secondary structure is almost all present and the
helices are engaged in the final packing process.
Figure: Complete Folding Pathway of Engrailed Homeodomain
by Experiment and Simulation
Bradley et al (2005) have taken another step forward in prediction of
tertiary structure for small proteins
(< 85 amino acids). They describe
the two biggest stumbling blocks to such predictions as the huge number of
conformations which must be explored (i.e. all of conformational space) and
accurate determination of the energy of the solvated structures.
Searching conformational space is difficult since the energy landscape
around the global energy minimum can be very steep and sharp, since
modest side chain displacements arising from subtle main chain movements
cause significant side chain packing and energy changes. The
narrowness of the energy well makes it difficult to find the global minimum
in stochastic conformational search processes. Energy calculations
also require better (more realistic) energy functions (force fields) which
show the native state to be clearly differentiated as the global minimum
from the denatured (non-native) states. They conducted energy
calculations on many different small proteins and produced for each protein
a low resolution model. To reach this low resolution model for a given
protein, they found many sequence homologs of the given target protein.
These homologs were naturally occurring sequence variants found by a
relatively conservative BLAST sequence search, with sequence identities of
30-60 percent. They also contained insertions and deletions compared
to the target sequence, which probably are involved in surface loop
structures. The target and homolog sequences were folded,
generating a more diverse population of low-resolution models as starting
points for all-atom refinement of the structure. Then, using a new force field that stressed
short range interactions (van der Waals, H-bonding), which would expected to
be more important for final folding of the low resolution models than long
range electrostatic forces), they were able to refine the models and
condense to a final low energy that was very close in main and side chain
packing to the experimental crystal structure (resolution < 1. angstroms).
The holy grail in protein folding research has always been to predict the
tertiary structure of a protein given its primary sequence. A similar
but conceptually easier problem is to design a protein which will fold to a
given structure with predicted secondary structure. Many possible
sequences could be designed to fold to the desired structure, which makes this
problem easier compared to the folding of a given sequence to just one native
state. Kuhlman et al. have recently accomplished such a feat for a
synthetic protein of 93 amino acids which they designed to fold to a unique
topology not yet observed in nature. This represents a significant
advance over earlier attempts in which mimics of known proteins were made.
Such structures would be expected to fold in analogous fashions to the parent
protein because of the necessary constraints placed by the need to fold to a
compact state.
Jmol:
Top7 - A designed 93 amino acid protein with a novel fold
Jmol
Several web sites exist that allow users to download
protein folding software onto their own PC. By distributing folding
calculations to many home PC, their untapped computational power can be linked
to provide the vast computational time needed to perform these calculations.
Online
LIterature:
Recent References
- Bradley, P. et al. Toward high-resolution de novo structure prediction for
small proteins. Science. 309, 1868 (2005)
- Boyle J. A. Bioinformatics in Undergraduate Education.
Biochemistry and Molecular Biology Education. 32, 236 (2004)
- Feig, A. L., & Jabri, E. . Incorporation of Bioinformatics Exercises into
the Undergraduate Biochemistry. Biochemistry and Molecular Biology
Education. 30, 224 (2002)
- Mayor et al. The complete folding pathway of a protein from nanoseconds to
microseconds. Nature 421, pg 863 (2003)
- Zhou and Karplus. Interpreting the folding kinetics of
helical proteins. Nature 401, pg 400(1999)
Navigation