The Evolution and Impact of Predictive Protein Folding Algorithms
Abstract
One of the most steadfast problems in computational biology, with some of the most
promising applications, is three-dimensional (3D) structural protein prediction. With
current protein folding optimization algorithms it is possible to predict the native
structure of relatively small proteins (<100 amino acids) to aid in drug design and
proteomic efforts. However, proteins larger in scale are much more complex and
difficult to decipher from an amino acid sequence. This paper introduces the protein
fold problem, examines the historic efforts and some recent methods for approaching it
at a high level, and concludes with a summary of the problem and a broad overview of
the current obstacles.
Introduction
With the conclusion of the human genome project, a great deal of information and data
about genomics and proteomics have come to light; including the spawning of new
peripheral fields like transcriptomics, peptidomics, epigenomics, etc. Within proteomics,
“computational structure-based protein design is one of the most promising tools for
engineering proteins with new functions”.1 The process at which a chain of amino
acids, a polypeptide being synthesized, folds upon itself to form a 3D protein structure,
is complex and governed by a series of sulfur-sulfur covalent bonds, the covalent
bonds holding the individual residues together,2 hydrophobic interactions, as well as the
weaker non-covalent bonds; hydrogen bonds, electrostatic attractions and van der
Waals attractions.3 If one were able to predict, with 100% certainty, the final folded
shape of a protein, based upon its amino acid sequence, a whole new field of custom
protein engineering could be formed. Imagine a software program showing a 3D protein
scaffold structure, where the researcher could turn and manipulate the protein to view it
at any angle, along its X, Y or Z axes. The program could have a menu of protein
substructures, features, domains, reactant and docking sites, etc., where the user could
select them and add them to the protein at the desired location by the researcher. Once
the researcher has designed the desired protein, the system could generate a
complementary amino acid sequence for synthesis in a laboratory. This is a very
simplistic vision of an extremely complex process as noted for how long the protein fold
problem has existed. Leading one to postulate that such protein engineering software
may still be years out.
Meta-Analysis
A broad, loose search of PubMed4 located 2841 published papers with a subject related
to protein folding algorithms, going back into the 1980’s, well before the initiation of the
human genome project. A direct topic query from Web of Science5 located 390 papers
and OVID MedLine6 located 242 papers using focused keywords. Looking more closely
at the OVID MedLine results, and charting year over year (Figure 1 ), shows the effort
given to the protein folding problem since the early 1990’s; the early years of the
massive data collection event of the human genome project. The human genome
project, on a downward trajectory toward completion in 2006, seems to coincide with
largest spike of published papers on protein folding algorithms, around 2003 and 2004.
Figure 1. OVID MedLine protein folding algorithm papers year over year.
A secondary spike is noted after the conclusion of the human genome project,
beginning in 2006 and going through 2008.
The complexity of the protein folding problem is the root of why it has not been
considered solved. “Long polypeptide chains are very flexible, as many of the peptide
bonds that link the carbon atoms in the polypeptide backbone allow free rotation of the
atoms they join. Thus, proteins can in principle fold in an enormous number of
ways.”3 The final folded protein structure is known as its conformation. It is known that
a protein folds into a final conformation in which its free energy is minimized. It is this
free energy minimization that is at the heart of the protein folding algorithms. The
empirical potential functions are a set of mathematical functions describing the various
energy models involved, where forces between the particles and their potential energies
are calculated using interatomic potentials or molecular mechanics force fields.7
Early Research
Locating the global minimum across the empirical potential functions is complicated due
to the multitude of localized minima over the multidimensional energy surface. The
larger the surface, or longer the peptide chain, the more complex. To deal with this
multi-minima problem, researchers at Cornell University in 1987 employed a Monte
Carlo minimization approach and thus were able to locate the lowest-energy minimum
for enkaphalin, a brain pentapeptide, in the absence of water.8 Monte Carlo methods
depend on continued, repetitious, random sampling in order to obtain numerical results
and rely on the optimization of this randomness to solve what some may consider to be
a deterministic problem. Their work was ground-breaking at the time and suggested the
process of protein folding may be a Markov process. That is, a state machine where
one can predict the next state based upon output parameters. A hidden Markov model,
which best represents the protein folding problem, is where the states are unknown or
hidden, but the input and output parameters are known.9
A few years later Seetharamulu and Crippen describe their persistent efforts for
“locating nearly the global minimum [energy of a polypeptide chain] as a means of
predicting globular protein conformation.”10 Their work mentions numerous pre-existing
programs for calculating the quantum mechanical and empirical energies. And
numerous programs for minimizing the energy as a function of conformation. Their
unrelenting efforts toward an optimization of the empirical potential functions did
advance as they were able to correctly predict the native conformation of the proteins
apamin and melittin, where previous programs failed. This proved to be a great
advance as they were able to predict the structure of actual, albeit small (<30 amino
acids), proteins.
Current Research
Fast forward to 2016 and the problem still exists and seems as though the cumulative
research is only asymptotically solving the problem. The early focus and methods,
around free energy minimization being the root of the algorithms, are still being
used. The issue now being the computational expense of the models. Even with the
advances in microprocessors and general progress in computing power from the past
two decades, the search space itself becomes astronomically large with larger amino
acid chains. This is mostly due to the free rotation of the peptide-peptide bonds and the
energy models themselves being excessively complex. As such, the success of fold
prediction algorithms has been limited to small scale proteins (<100 amino acids). To
deal with this computational expense issue, researchers are working to limit
unnecessary computations by working through an on-lattice model. The idea being to
reduce the search space and the complexity of the hydrophobic-polar energy
model. On a set of benchmark proteins, they claim the optimized genetic algorithm
“significantly outperforms state-of-the-art algorithms.”11
Researchers, Bošković and Brest, recently published a paper using a differential
evolution algorithm against a three-dimensional AB off-lattice model. Where the AB
model represents the amino acid sequence as a string, s = {s1, s2, …, sL}, si ∈ [A,B]. A
represents a hydrophobic amino acid and B represents a hydrophilic amino acid. The
three-dimensional structure of an AB sequence is defined by bond angles, torsional
angles and the unit-length chemical bond between the two adjacent amino acids. This
simplified AB off-lattice model, by taking into account the hydrophobic interactions,
utilizes the main driving forces of protein structure formation. “This model still imitates
the main features of protein structure realistically.”12
Other research seems to be going in the opposite direction. Instead of limiting the
search space, an intermediary step is utilized where more information is added. The
complexity and intense computational resources make protein fold prediction a difficult
problem. As a result, intermediary steps in three-dimensional protein structure
prediction, structural class predictions (SCP) and protein fold recognition (PFR) are
being used. Classifying a protein to its structure or fold can be used as precursor to
predicting the final conformation. SCP and PFR generally utilize syntactic-based
information, evolutionary-based information and physicochemical-based information to
extract features. Raicar, et al., focused their efforts on the physicochemical-based
information. “Features which are dependent on physiochemical attributes can reveal
global properties of proteins.”13 The same vein of adding additional information and
knowledge to the models, is shown in the work of Hartlmüller, Göbl and Madl. They are
utilizing surface accessibility data obtained from nuclear magnetic resonance
spectroscopy. “For several proteins, it is demonstrated that surface accessibility data is
an excellent measure of protein fold” and “ significantly improves accuracy and
convergence”.14
Conclusion
The protein fold problem represents one of the most challenging and yet promising
problems in computational biology. The problem attracts researchers from a variety of
disciplines including chemistry, biology, physics, mathematics and computer
science. Understanding that the final conformation of a protein is based upon the
minimization of free energy, has been the historic focus and remains so today. But
minimizing the free energy across the empirical potential functions is exceedingly
complex due to the free-rotation of the bonds holding the amino acids together within
the chain. Most approaches utilize an energy model and an algorithm to process the
model. It is clear that the two must evolve in tandem for future success. By improving
the energy models and adding new information as it is understood, we gain a more
thorough representation of the real world. But a more thorough representation within the
energy models does us no good if our algorithms are not able to utilize the information,
or may even be hindered by using it incorrectly. Some of the latest approaches, of
minimizing the search space and utilizing surface accessibility data gleamed from other
sources, offer improvements to both the models and the algorithms used to process
them. This, it seems, will be the focus of efforts in the near term.
Since the protein fold prediction problem has had substantial attention paid to it over the
past two decades, yet appears to only being asymptotically solved, the promise of large
custom proteins seems to be fleeting. The day when a researcher can sit a down at a
computer terminal and design a large, custom, protein, complete with the structures and
features desired, may not come for a while. But the implications for proteomics and
drug discovery are immense and each algorithmic advance in predicting final protein
conformation, gets us bit closer to the realization.
References
1. Gainza P, Nisonoff HM, Donald BR. Algorithms for protein design. Current Opinion in
Structural Biology. 2016;39:16–26.
2. Yeargers EK, Shonkwiler RW, Herod JV. An introduction to the mathematics of biology:
with computer algebra models. Boston: Birkhäuser; 1996.
3. Alberts B, Bray D, Hopkin K, Johnson A, Lewis J, Raff MC, et al. Essential cell biology.
New York, NY: Garland Science; 2014.
4. Home - PMC - NCBI [Internet]. National Center for Biotechnology Information. U.S.
National Library of Medicine; [cited 2016Dec4]. Available from:
https://www.ncbi.nlm.nih.gov/pmc/
5. Accessing library databases and electronic resources - Web of Knowledge [Internet].
Logging into the proxy (Rutgers University Libraries). [cited 2016Dec4]. Available from:
http://apps.webofknowledge.com.proxy.libraries.rutgers.edu/
6. Accessing library databases and electronic resources - OVID MedLine [Internet].
Logging into the proxy (Rutgers University Libraries). [cited 2016Dec4]. Available from:
http://ovidsp.tx.ovid.com.proxy.libraries.rutgers.edu/sp-3.22.1b/ovidweb.cgi
7. Molecular Dynamics [Internet]. Wikipedia. Wikimedia Foundation; [cited 2016Dec4].
Available from: https://en.wikipedia.org/wiki/Molecular_dynamics
8. Li Z, Scheraga HA. Monte Carlo-minimization approach to the multiple-minima problem
in protein folding. Proceedings of the National Academy of Sciences.
1987Jan;84(19):6611–5.
9. Markov model [Internet]. Wikipedia. Wikimedia Foundation; [cited 2016Dec4]. Available
from: https://en.wikipedia.org/wiki/Markov_model
10. Seetharamulu P, Crippen GM. A potential function for protein folding. Journal of
Mathematical Chemistry. 1991;6(1):91–110.
11. Rashid MA, Khatib F, Hoque MT, Sattar A. An Enhanced Genetic Algorithm for Ab Initio
Protein Structure Prediction. IEEE Transactions on Evolutionary Computation.
2016Aug;20(4):627–44.
12. Bošković B, Brest J. Differential evolution for protein folding optimization based on a
three-dimensional AB off-lattice model. Journal of Molecular Modeling. 2016;22(10).
13. Raicar G, Saini H, Dehzangi A, Lal S, Sharma A. Improving protein fold recognition and
structural class prediction accuracies using physicochemical properties of amino acids.
Journal of Theoretical Biology. 2016;402:117–28.
14. Hartlmüller C, Göbl C, Madl T. Prediction of Protein Structure Using Surface
Accessibility Data. Angewandte Chemie. 2016;128(39):12149–53.