Protein Structure Prediction (Part I)

Transcription

Protein structure prediction(Part I)CS/CME/BioE/Biophys/BMI 279Oct. 7, 2021Ron Dror1

Outline Why predict protein structure?Can we use (pure) physics-based methods?Knowledge-based methodsTwo major approaches to protein structureprediction– Template-based (“homology”) modeling (e.g., Phyre2)– Ab initio modeling (e.g., Rosetta) An additional approach: analysis of multiplesequences (coevolution) Structure prediction gamesI will include additional material about recent progress in proteinstructure prediction in a subsequent lecture2

Why predict protein structure?3

Problem definition Given the amino acid sequence of a protein,predict its three-dimensional structure Proteins sample many structures. We want theaverage structure, which is roughly what’smeasured experimentally.Average structure the one that the proteins spends most of it time YFKRLGNVSQGMANDKLRGHSITLMYALQNFIDQLDNPDSLDLVCS .4

How are predicted structures used? Drug development– Computational screening of candidate drug compounds– Figuring out how to optimize a promising candidatecompound– Figuring out which binding site to target Identifying the mechanism by which a protein functions– How do genetic mutations alter that function (e.g., causedisease)?– How one might alter that protein’s function (e.g., with adrug)? Interpreting experimental data– For example, a computationally predicted approximatestructure can help in determining an accurate structureexperimentally, as we’ll see later in this course5

Why not just solve the structuresexperimentally? Some structures are very difficult to solve experimentally– Sometimes many labs work for decades to solve the structure of one protein Sequence determination far outpaces experimental structuredetermination– We already have far more sequences than experimental structures, and thisgap will likely growExperimentally derived structures are moreaccurate but difficult to solve thancomputationally derived ds/2015/08/ProteinDBGrowthBar3.png

Can we use (pure) physics-basedmethods?7

Why not just simulate the foldingprocess by molecular dynamics?This ispossible forsome proteins.ChignolinWW domainTrp-cageNTL9BBABBLVillinProtein BExample:Simulation vs.experiment for 12fast-foldingproteins, up to 80residues eachLindorffLarsen et al.,Science, 2011HomeodomainProtein Gα3Dλ-repressor8

For most proteins, this doesn’t (yet) work1. Folding timescales are usually much longer thansimulation timescales.2. Current molecular mechanics force fields aren’talways sufficiently accurate.3. Disulfide bonds form during the real foldingprocess. This is hard to mimic in simulation.Simulating folding is important for understand how the foldingprocess works (that is, how a protein gets from its unfolded state toits folded state—the original “protein folding problem”), but is notnecessary to predict structure. Although many people refer tostructure prediction as “the protein folding problem,” structureprediction is an easier problem (easier, but still tough!).

Can we find simpler physics-basedrules that predict protein structure? For example, look at patterns of hydrophobic,hydrophilic, or charged amino acids? People have tried for a long time without muchsuccess10

Knowledge-based methods11Most methods used in practice utilize knowledge based methods.

Basic idea behind knowledge-based(data-driven) methods We have about 150,000 experimentallydetermined protein structures Can find in PDB (protein data bank) Can we use that information to help us predictnew structures? Yes!We can also usethe 50 millionprotein sequencesin the /startup-data-analytics-metric-

Proteins with similar sequences tend tohave similar structures Proteins with similar sequences tend to behomologs, meaning that they evolved from acommon ancestor The fold of the protein (i.e., its overall structure)tends to be conserved during evolution This tendency is very strong. Even proteins with15% sequence identity usually have similarstructures. Sequence identity refers to each amino acid residue– During evolution, sequence changes more quicklythan structure Also, there only appear to be 1,000–10,000naturally occurring protein folds13

For most human protein sequences, wecan find a homolog with known structureUnstructured(disordered)amino acidsThe plot shows thefraction of amino acids inhuman proteins that canbe mapped to similarsequences in PDBstructures. Differentcolors indicate %sequence identity.14As this graph stops at 2012, percentages are likely even higher in 2021Schwede, Structure 2013

What if we can’t identify a homolog inthe PDB? We can still use information based on knownstructures– We can construct databases of observed structures ofsmall fragments of a protein– We can use the PDB to build empirical, “knowledgebased” energy functions15

Two major approaches to proteinstructure prediction16

Two major approaches to proteinstructure prediction Template-based modeling (homology modeling)– Used when one can identify one or more likelyhomologs of known structure Ab initio structure prediction– Does not require any homologs– Even ab initio approaches usually take advantage ofavailable structural data, but in more subtle ways17

Two major approaches to proteinstructure predictionTemplate-based (“homology”) modeling(e.g., Phyre2)18

Template-based structure prediction:basic workflowQuery sequence seq of protein whose structure you are interested in User provides a query sequence with unknownstructure Search the PDB for proteins with similarsequence and known structure. Pick the bestmatch (the template). Build a model based on that template– One can also build a model based on multipletemplates, where different templates are used fordifferent parts of the protein.19

What does it mean for two sequencesto be similar? Basic measure: count minimum number of aminoacid residues one needs to change, add, ordelete to get from one sequence to another– Sequence identity: amino acids that match exactlybetween the two sequences– Not trivial to compute for long sequences, but thereare efficient dynamic programming algorithms to do so20

What does it mean for two sequencesto be similar? We can do better– Some amino acids are chemically similar to oneanother (example: glutamic acid and aspartic acid) Sequence similarity is like sequence identity, but doesnot count changes between similar amino acidsSimilar in terms of chemical properties (ie. Acidic, basic, nonpolar, aromaticity)Glutamic acidAspartic acid21

What does it mean for two sequencesto be similar? We can do even better– Once we’ve identified some homologs to a querysequence (i.e., similar sequences in the sequencedatabase), we can create a profile describing theprobability of mutation to each amino acid at each position– We can then use this profile to search for more homologs– Iterate between identification of homologs and profileconstruction– Measure similarity of two sequences by comparing theirprofiles– Often implemented using Hidden Markov Models (HMMs) For example, the HHBlits software toolYou are not responsible for knowing about HMMs22

We’ll use the Phyre2 template-basedmodeling server as an example Try it out: http://www.sbg.bio.ic.ac.uk/phyre2/ Why use Phyre2 as an example of templatebased modeling?– Among the better automated structureprediction servers– Among the most widely used, and arguablythe easiest to use– Approach is similar to that of other templatebased modeling methods– Great name!23

Phyre2 algorithmic pipelineLA Kelley et al.,Nature Protocols10:845 (2015)24

Phyre2 algorithmic pipelineIdentify similar sequences inprotein sequence database25

Phyre2 algorithmic pipelineChoose a templatestructure by:(1) comparing sequenceprofiles and(2) predicting secondarystructure for each residuein the query sequenceand comparing tocandidate templatestructures. Secondarystructure (alpha helix,beta sheet, or neither) ispredicted for segments ofquery sequence using aneural network trained onknown structures.26

Phyre2 algorithmic pipelineCompute optimalalignment of querysequence to templatestructure27

Phyre2 algorithmic pipelineBuild a crude backbone model (no side chains) by simply superimposingcorresponding amino acids. Some of the query residues will not be modeled,because they don’t have corresponding residues in the template (insertions).There will be some physical gaps in the modeled backbone, because sometemplate residues don’t have corresponding query residues (deletions).

Phyre2 algorithmic pipelineUse loop modeling to patch up defects in the crude model due to insertions anddeletions. For each insertion or deletion, search a large library of fragments (2-15residues) of PDB structures for ones that match local sequence and fit thegeometry best. Tweak backbone dihedrals within these fragments to make themfit better.29

Phyre2 algorithmic pipelineAdd side chains. Use adatabase of commonlyobserved structures for eachside chain (these structuresare called rotamers). Searchfor combinations ofrotamers that will avoidsteric clashes (i.e., atomsending up on top of oneanother).30

Modeling based on multiple templates In “intensive mode,” Phyre2 will use multiple templatesthat cover (i.e., match wellto) different parts of thequery sequence.– Build a crude backbonemodel for each template– Extract distances betweenresidues for “reliable” partsof each model– Perform a simplified proteinfolding simulation in whichthese distances are used asconstraints. Additionalconstraints enforce predictedsecondary structure– Fill in the side chains, as forsingle-template modelsYou’re not responsible for thisLA Kelley et al.,Nature Protocols10:845 (2015)31

Two major approaches to proteinstructure predictionAb initio modeling (e.g., Rosetta)32

Two major approaches to proteinstructure prediction Template-based modeling (homology modeling)– Used when one can identify one or more likelyhomologs of known structure Ab initio structure prediction– Does not require any homologs– Even ab initio approaches usually take advantage ofavailable structural data, but in more subtle ways33

Ab initio structure prediction Also known as “de novo structure prediction” Many approaches proposed over time Probably the most successful is fragmentassembly, as exemplified by the Rosettasoftware package34

We’ll use Rosetta as an example ofab initio structure prediction Software developed over the last 20–25 years by David Baker(U. Washington) and collaborators Software at: https://www.rosettacommons.org/software Structure prediction server: http://robetta.bakerlab.org/ Why use Rosetta as an example?– Among the better ab initio modeling packages (for someyears it was the best)– Approach is similar to that of many ab initio modelingpackages– Rosetta provides a common framework that has becomevery popular for a wide range of molecular prediction anddesign tasks, such as protein design and RNA structure35prediction

Key ideas behind Rosetta Knowledge-based energy function– In fact, two of them: Gives an approximation of side chainsThe “Rosetta energy function,” which is coarse-grained(i.e., does not represent all atoms in the protein), is usedin early stages of protein structure predictionThe “Rosetta all-atom energy function,” which dependson the position of every atom, is used in late stages A knowledge-based strategy for searchingconformational space (i.e., the space of possiblestructures for a protein)– Fragment assembly forms the core of this method36Called “knowledge-based” because these strategies are informed by structures in PDB and are not necessarily homologs ofthe protein you are predicting

Rosetta energy function At first this was the only energy function used byRosetta (hence the name) Based on a simplified representation of proteinstructure:– Do not explicitly represent solvent (e.g., water)– Assume all bond lengths and bond angles are fixed– Represent the protein backbone using torsion angles(three per amino acid: Φ, Ψ, ω)– Represent side chain position using a single “centroid,”located at the side chain’s center of mass Centroid position determined by averaging over allstructures of that side chain in the PDB37

Rosetta energy functionFrom Rohl et al., Methods in Enzymology 200438You’re not responsible for the details!

Rosetta energy functionFrom Rohl et al., Methods in Enzymology 2004Updated version with more terms: Alford et al., Journal ofChemical Theory and Computation, 201739You’re not responsible for the details!

Rosetta energy function: take-aways The (coarse-grained) Rosetta energy function isessentially entirely knowledge-based– Based on statistics computed from the PDB Many of the terms are of the form –loge[P(A)],where P(A) is the probability of some set A– This is essentially the free energy of set A. Recalldefinition of free energy:GA kBT log e ( P(A))P(A) exp( GAkBT)k b Boltzmann’s constantT temperature40

Rosetta all-atom energy function Still makes simplifying assumptions:– Do not explicitly represent solvent (e.g., water)– Assume all bond lengths and bond angles are fixed Functional forms are a hybrid between molecularmechanics force fields and the (coarse-grained) Rosettaenergy function– Partly physics-based, partly knowledge-based41

Are these potential energy functions orfree energy functions? The energy functions of previous lectures were potentialenergy functions One can also attempt to construct a free energy function,where the energy associated with a conformation is thefree energy of the set of “similar” conformations (for somedefinition of “similar”) The Rosetta energy functions are approximate free energyfunctions (despite sometimes being referred to as potentialenergy functions)– This means that searching for the “minimum” energy is more valid(as a way to determine structure)– Nevertheless, typical protocol is to repeat the search processmany times, cluster the results, and report the largest cluster as42the solution. This rewards wider and deeper wells.

How does Rosetta search theconformational space? Two steps:– Coarse search: fragment assembly– Refinement Perform coarse search many times, and thenperform refinement on each result43

Coarse search: fragment assembly Uses a large database of 3-residue and 9-residue fragments,taken from structures in the PDB Monte Carlo sampling algorithm proceeds as follows:1. Start with the protein in an extended conformation2. Randomly select a 3-residue or 9-residue section3. Find a fragment in the library whose sequence resembles it4. Consider a move in which the backbone dihedrals of theselected section are replaced by those of the fragment.Calculate the effect on the entire protein structure.5. Evaluate the Rosetta energy function before and after themove.6. Use the Metropolis criterion to accept or reject the move.7. Return to step 2 The real search algorithm adds some bells and whistles44

Refinement Refinement is performed using the Rosetta allatom energy function, after building in sidechains Refinement involves a combination of MonteCarlo moves and energy minimization The Monte Carlo moves are designed to perturbthe structure much more gently than those usedin the coarse search– Many still involve the use of fragments45

An additional approach: analysis ofmultiple sequences (coevolution)46

We’ve discussed two approaches toprotein structure prediction Template-based modeling (homology modeling)– Used when one can identify one or more likelyhomologs of known structure Ab initio structure prediction– Does not require any homologs– Even ab initio approaches usually take advantage ofavailable structural data, but in more subtle waysWhat if we know sequences of many homologs, butdon’t have structures for any of them?Tricky to decide if you want to use template based modeling or ab initio structure prediction 47

Amino acids in direct physical contact tend tocovary or “coevolve” across related proteinsFor example, amutation that causesone amino acid to getbigger is more likely topreserve proteinstructure and function(and thus survive) ifanother amino acidgets smaller to TGG.

Can we use this observation to predictstructure? Given many sequences of related proteins(whose structure is assumed to be similar), lookfor amino acids that coevolve. They are probablyin contact This idea has been around for some time, but ithas become practically useful recently, thanks to:– A dramatic increase in amount of sequence dataavailable Allows you to find more sequences of homologs’– Better computational methods49

Some key papers on this approachProtein 3D Structure Computed from EvolutionarySequence VariationPLoS ONE, 2011Debora S. Marks1*., Lucy J. Colwell2., Robert Sheridan3, Thomas A. Hopf1, Andrea Pagnani4, RiccardoZecchina4,5, Chris Sander31 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America, 2 MRC Laboratory of Molecular Biology, Hills Road,Cambridge, United Kingdom, 3 Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America, 4 Human GeneticsFoundation, Torino, Italy, 5 Politecnico di Torino, Torino, ItalyDistance-basedprotein folding powered byAbstractThe evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequencedeeplearninghomologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to theseconstraints.Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineeringJinboXua,1 presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent cing.In this paper we ask whether we can infer evolutionary constraints fromToyotaTechnologicalInstitute at Chicago,Chicago,IL 60637a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set ofEditedby David correlations.Baker, UniversityWashington,Seattle, WA,usingand approvedJuly 15,2019 (receivedreviewDecember14, 2018)observedWeofaddressthis challengea maximumentropymodel forof theproteinsequence,constrained byArticlethe statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength ofDirect coupling analysis (DCA) for protein folding has made veryBoth our ResNet and DCA are global prediction methodsthese inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoringgood progress, but it is not effective for proteins that lack manybecause they predict the contact/distance score or probability ofresidue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. Wesequence homologs, even coupled with time-consuming conforone residuepair by consideringits correlationwithdifferentother residuequantify this observation by computing, from sequence alone, all-atom3D structuresof fifteen testproteins frommation sampling with fragments. We show that we can accuratelypairs at distantsequencewhichinferencesis the keyaretodethe sigfold classes, ranging in size from 50 to 260 residues., including a G-proteincoupledreceptor.positions,These blindedpredict interresidue distance distribution of a protein by deep principle,novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution wheneven for proteins with 60 sequence homologs. Using only the geomany convolutionallayers Åareused, errorit is possiblecapturesignals provide sufficient information to determine accurate 3D proteinstructure to 2.7–4.8Ca-RMSDrelative totothemetric constraints given by the resulting distance matrix we may concorrelationbetweentwo residue pairsThisacrossthe wholeobserved structure, over at least two-thirds of the protein (methodcalled EVfold,detailsanyat http://EVfold.org).discoverystruct 3D models without involving extensive conformation sampling.contact/distanceHowever,ResNet differsDCA inprovides insight into essential interactions constraining protein evolutionand willmatrix.facilitatea comprehensivesurveyfromof theOur method successfully folded 21 of the 37 CASP12 hard1,4 -1923-7AndrewW. Senior*, RichardEvans, JohnJumper, JamesKirkpatrick, onoffunctionalgeneticwith a median family size of 58 effectivesequence homologs11Žídek1,structureAlexander W.R. Nelson, AlexDCABridgland,Tim Green1, Chongli Qin1, Augustin(e.g.,motifs)whilemainlyfocuses on pairwisevariantsonina2019normaland disease genomes.Received:1within 4 2hAprilLinux computer of 20 centralprocessingunits.In 1, Karen Simonyan1, Steve Crossan1, Pushmeet Kohli1,, StigPetersenHugo al context of acontrast, 10DCA-predictedbe usedto2,3,foldof1, Koray Kavukcuoglu1 & Demis Hassabis1Accepted:December 2019contacts cannotDavidDavidanySilverT. Jonescontactmatrix,Computedand 3) existingDCA methodsroughlyCitation:DS, inColwellLJ, SheridanHopf TA, PagnaniA, et al. (2011)3D Structurefrom EvolutionarySequence areVariation.PLoS linearthesehard Markstargetsthe absenceof R,extensiveconformationsam- ProteinPublishedonline:15 imatedfrom ling, and the best CASP12 group folded only 11 of them by inteproteinfamily,ResNet is ashapenonlinearmodel withpredictioncansinglebe usedto dictedAndrej Sali, Universityof CaliforniaSan ProteinFrancisco,structureUnitedStatesof Americagratingcontactsinto of protein families. Deepsampling.validationin14CASP13showsa proteinfromPublisheditsaminothatacidsequence. This estimatedproblem is offundamentalimportanceReceivedRigorousNovember experimental10, 2011; AcceptedNovember, 2011;December7, 20112learning(DL)modelssuchasCMAPpro(20) and Deep Beliefour distance-based folding server successfullyfolded17of32hardas the structureof a proteinlargelyfunction; however,proteinCopyright: ! 2011 Marks et al. This is an open-accessarticle distributedunderthe determinesterms of theitsCreativeCommonsAttributionLicense, which withuse,a medianfamilysize of 36 tion,and reproductioninany ed.can be difficult to determine experimentally. Considerable progress has prediction before,obtained 70% precision on the top L/5 long-range predicted conbutgeneticResNetis a DL Itmethodthatgreatlyoutperforms shallowrecentlybeenby leveraginginformation.isPhysicalpossibleto inferwhich Center (NIH U54Funding: CS and RS have support from the Dana Farber CancermadeInstitute-MemorialSloan-KetteringCancer V(18).DifferentfromNationalResNet andCA143798). LC is supported by an Engineering aminoand PhysicalSciences goushas support from the Germanacid residuesare in contactanalysingcovariation inserverpredictedcorrectfoldsfor from2 membraneproteins grantwhile267915.all ofNo otherAcademicFoundation.RZ hassupportEuropean Communityfinancialsupportreceived3 for theareresearch.funders hadmethods,no roleDCA,DBNand dsthe predictionof protein structures . Here we show that westudy design,datacollectionand resultsanalysis, decisionto publish,or itpreparationof themanuscript.thein otherserversfailed.Thesedemonstratethatis innowtheypredict the label (i.e., contact or distance) of 1 residue pairImproved protein structure prediction usingpotentials from deep learningPNAS, 2019Nature, 2020This paper from DeepMind describes the original AlphaFold method, butthe current AlphaFold method is substantially different, as we’ll see later

Structure prediction games51

FoldIt: Protein-folding game https://fold.it/ Basic idea: allow players to optimize the Rosettaall-atom energy function– Game score is negative of the energy (plus a constant)52

53

EteRNA: RNA design game Similar idea, but:– For RNA rather than protein.– Goal is RNA design. Users collective design RNA sequences, which are testedexperimentally. From Rhiju Das (Stanford) and Adrien Treuille (CMU)54

– Some amino acids are chemically similar to one another (example: glutamic acid and aspartic acid) Sequence similarity is like sequence identity, but does not count changes between similar amino acids 21 Glutamic acid Aspartic acid Similar in terms of chemi