The Evolution and Impact of Predictive Protein Folding Algorithms

Abstract

One of the most steadfast problems in computational biology, with some of the most

promising applications, is three-dimensional (3D) structural protein prediction. With

current protein folding optimization algorithms it is possible to predict the native

structure of relatively small proteins (<100 amino acids) to aid in drug design and

proteomic efforts. However, proteins larger in scale are much more complex and

difficult to decipher from an amino acid sequence. This paper introduces the protein

fold problem, examines the historic efforts and some recent methods for approaching it

at a high level, and concludes with a summary of the problem and a broad overview of

the current obstacles.

Introduction

With the conclusion of the human genome project, a great deal of information and data

about genomics and proteomics have come to light; including the spawning of new

peripheral fields like transcriptomics, peptidomics, epigenomics, etc. Within proteomics,

“computational structure-based protein design is one of the most promising tools for

engineering proteins with new functions”.1 The process at which a chain of amino

acids, a polypeptide being synthesized, folds upon itself to form a 3D protein structure,

is complex and governed by a series of sulfur-sulfur covalent bonds, the covalent

bonds holding the individual residues together,2 hydrophobic interactions, as well as the

weaker non-covalent bonds; hydrogen bonds, electrostatic attractions and van der

Waals attractions.3 If one were able to predict, with 100% certainty, the final folded

shape of a protein, based upon its amino acid sequence, a whole new field of custom

protein engineering could be formed. Imagine a software program showing a 3D protein

scaffold structure, where the researcher could turn and manipulate the protein to view it

at any angle, along its X, Y or Z axes. The program could have a menu of protein

substructures, features, domains, reactant and docking sites, etc., where the user could

select them and add them to the protein at the desired location by the researcher. Once

the researcher has designed the desired protein, the system could generate a

complementary amino acid sequence for synthesis in a laboratory. This is a very

simplistic vision of an extremely complex process as noted for how long the protein fold

problem has existed. Leading one to postulate that such protein engineering software

may still be years out.

Meta-Analysis

A broad, loose search of PubMed4 located 2841 published papers with a subject related

to protein folding algorithms, going back into the 1980’s, well before the initiation of the

human genome project. A direct topic query from Web of Science5 located 390 papers

and OVID MedLine6 located 242 papers using focused keywords. Looking more closely

at the OVID MedLine results, and charting year over year (Figure 1 ), shows the effort

given to the protein folding problem since the early 1990’s; the early years of the

massive data collection event of the human genome project. The human genome

project, on a downward trajectory toward completion in 2006, seems to coincide with

largest spike of published papers on protein folding algorithms, around 2003 and 2004.

Figure 1. OVID MedLine protein folding algorithm papers year over year.

A secondary spike is noted after the conclusion of the human genome project,

beginning in 2006 and going through 2008.

The complexity of the protein folding problem is the root of why it has not been

considered solved. “Long polypeptide chains are very flexible, as many of the peptide

bonds that link the carbon atoms in the polypeptide backbone allow free rotation of the

atoms they join. Thus, proteins can in principle fold in an enormous number of

ways.”3 The final folded protein structure is known as its conformation. It is known that

a protein folds into a final conformation in which its free energy is minimized. It is this

free energy minimization that is at the heart of the protein folding algorithms. The

empirical potential functions are a set of mathematical functions describing the various

energy models involved, where forces between the particles and their potential energies

are calculated using interatomic potentials or molecular mechanics force fields.7

Early Research

Locating the global minimum across the empirical potential functions is complicated due

to the multitude of localized minima over the multidimensional energy surface. The

larger the surface, or longer the peptide chain, the more complex. To deal with this

multi-minima problem, researchers at Cornell University in 1987 employed a Monte

Carlo minimization approach and thus were able to locate the lowest-energy minimum

for enkaphalin, a brain pentapeptide, in the absence of water.8 Monte Carlo methods

depend on continued, repetitious, random sampling in order to obtain numerical results

and rely on the optimization of this randomness to solve what some may consider to be

a deterministic problem. Their work was ground-breaking at the time and suggested the

process of protein folding may be a Markov process. That is, a state machine where

one can predict the next state based upon output parameters. A hidden Markov model,

which best represents the protein folding problem, is where the states are unknown or

hidden, but the input and output parameters are known.9

A few years later Seetharamulu and Crippen describe their persistent efforts for

“locating nearly the global minimum [energy of a polypeptide chain] as a means of

predicting globular protein conformation.”10 Their work mentions numerous pre-existing

programs for calculating the quantum mechanical and empirical energies. And

numerous programs for minimizing the energy as a function of conformation. Their

unrelenting efforts toward an optimization of the empirical potential functions did

advance as they were able to correctly predict the native conformation of the proteins

apamin and melittin, where previous programs failed. This proved to be a great

advance as they were able to predict the structure of actual, albeit small (<30 amino

acids), proteins.

Current Research

Fast forward to 2016 and the problem still exists and seems as though the cumulative

research is only asymptotically solving the problem. The early focus and methods,

around free energy minimization being the root of the algorithms, are still being

used. The issue now being the computational expense of the models. Even with the

advances in microprocessors and general progress in computing power from the past

two decades, the search space itself becomes astronomically large with larger amino

acid chains. This is mostly due to the free rotation of the peptide-peptide bonds and the

energy models themselves being excessively complex. As such, the success of fold

prediction algorithms has been limited to small scale proteins (<100 amino acids). To

deal with this computational expense issue, researchers are working to limit

unnecessary computations by working through an on-lattice model. The idea being to

reduce the search space and the complexity of the hydrophobic-polar energy

model. On a set of benchmark proteins, they claim the optimized genetic algorithm

“significantly outperforms state-of-the-art algorithms.”11

Researchers, Bošković and Brest, recently published a paper using a differential

evolution algorithm against a three-dimensional AB off-lattice model. Where the AB

model represents the amino acid sequence as a string, s = {s1, s2, …, sL}, si ∈ [A,B]. A

represents a hydrophobic amino acid and B represents a hydrophilic amino acid. The

three-dimensional structure of an AB sequence is defined by bond angles, torsional

angles and the unit-length chemical bond between the two adjacent amino acids. This

simplified AB off-lattice model, by taking into account the hydrophobic interactions,

utilizes the main driving forces of protein structure formation. “This model still imitates

the main features of protein structure realistically.”12

Other research seems to be going in the opposite direction. Instead of limiting the

search space, an intermediary step is utilized where more information is added. The

complexity and intense computational resources make protein fold prediction a difficult

problem. As a result, intermediary steps in three-dimensional protein structure

prediction, structural class predictions (SCP) and protein fold recognition (PFR) are

being used. Classifying a protein to its structure or fold can be used as precursor to

predicting the final conformation. SCP and PFR generally utilize syntactic-based

information, evolutionary-based information and physicochemical-based information to

extract features. Raicar, et al., focused their efforts on the physicochemical-based

information. “Features which are dependent on physiochemical attributes can reveal

global properties of proteins.”13 The same vein of adding additional information and

knowledge to the models, is shown in the work of Hartlmüller, Göbl and Madl. They are

utilizing surface accessibility data obtained from nuclear magnetic resonance

spectroscopy. “For several proteins, it is demonstrated that surface accessibility data is

an excellent measure of protein fold” and “ significantly improves accuracy and

convergence”.14

Conclusion

The protein fold problem represents one of the most challenging and yet promising

problems in computational biology. The problem attracts researchers from a variety of

disciplines including chemistry, biology, physics, mathematics and computer

science. Understanding that the final conformation of a protein is based upon the

minimization of free energy, has been the historic focus and remains so today. But

minimizing the free energy across the empirical potential functions is exceedingly

complex due to the free-rotation of the bonds holding the amino acids together within

the chain. Most approaches utilize an energy model and an algorithm to process the

model. It is clear that the two must evolve in tandem for future success. By improving

the energy models and adding new information as it is understood, we gain a more

thorough representation of the real world. But a more thorough representation within the

energy models does us no good if our algorithms are not able to utilize the information,

or may even be hindered by using it incorrectly. Some of the latest approaches, of

minimizing the search space and utilizing surface accessibility data gleamed from other

sources, offer improvements to both the models and the algorithms used to process

them. This, it seems, will be the focus of efforts in the near term.

Since the protein fold prediction problem has had substantial attention paid to it over the

past two decades, yet appears to only being asymptotically solved, the promise of large

custom proteins seems to be fleeting. The day when a researcher can sit a down at a

computer terminal and design a large, custom, protein, complete with the structures and

features desired, may not come for a while. But the implications for proteomics and

drug discovery are immense and each algorithmic advance in predicting final protein

conformation, gets us bit closer to the realization.

References

1. Gainza P, Nisonoff HM, Donald BR. Algorithms for protein design. Current Opinion in

Structural Biology. 2016;39:16–26.

2. Yeargers EK, Shonkwiler RW, Herod JV. An introduction to the mathematics of biology:

with computer algebra models. Boston: Birkhäuser; 1996.

3. Alberts B, Bray D, Hopkin K, Johnson A, Lewis J, Raff MC, et al. Essential cell biology.

New York, NY: Garland Science; 2014.

4. Home - PMC - NCBI [Internet]. National Center for Biotechnology Information. U.S.

National Library of Medicine; [cited 2016Dec4]. Available from:

https://www.ncbi.nlm.nih.gov/pmc/

5. Accessing library databases and electronic resources - Web of Knowledge [Internet].

Logging into the proxy (Rutgers University Libraries). [cited 2016Dec4]. Available from:

http://apps.webofknowledge.com.proxy.libraries.rutgers.edu/

6. Accessing library databases and electronic resources - OVID MedLine [Internet].

Logging into the proxy (Rutgers University Libraries). [cited 2016Dec4]. Available from:

http://ovidsp.tx.ovid.com.proxy.libraries.rutgers.edu/sp-3.22.1b/ovidweb.cgi

7. Molecular Dynamics [Internet]. Wikipedia. Wikimedia Foundation; [cited 2016Dec4].

Available from: https://en.wikipedia.org/wiki/Molecular_dynamics

8. Li Z, Scheraga HA. Monte Carlo-minimization approach to the multiple-minima problem

in protein folding. Proceedings of the National Academy of Sciences.

1987Jan;84(19):6611–5.

9. Markov model [Internet]. Wikipedia. Wikimedia Foundation; [cited 2016Dec4]. Available

from: https://en.wikipedia.org/wiki/Markov_model

10. Seetharamulu P, Crippen GM. A potential function for protein folding. Journal of

Mathematical Chemistry. 1991;6(1):91–110.

11. Rashid MA, Khatib F, Hoque MT, Sattar A. An Enhanced Genetic Algorithm for Ab Initio

Protein Structure Prediction. IEEE Transactions on Evolutionary Computation.

2016Aug;20(4):627–44.

12. Bošković B, Brest J. Differential evolution for protein folding optimization based on a

three-dimensional AB off-lattice model. Journal of Molecular Modeling. 2016;22(10).

13. Raicar G, Saini H, Dehzangi A, Lal S, Sharma A. Improving protein fold recognition and

structural class prediction accuracies using physicochemical properties of amino acids.

Journal of Theoretical Biology. 2016;402:117–28.

14. Hartlmüller C, Göbl C, Madl T. Prediction of Protein Structure Using Surface

Accessibility Data. Angewandte Chemie. 2016;128(39):12149–53.