Bioinformatics For Biologists - Barc.wi.mit.edu

Transcription

Bioinformatics for BiologistsComparative Protein Analysis:Part III. Protein Structure Predictionand ComparisonRobert Latek, PhDSr. Bioinformatics ScientistWhitehead Institute for Biomedical ResearchProtein StructureWhy is protein structure information useful?WIBR Bioinformatics Course, Whitehead Institute, October 200321

Predicting Important AAsS265ATP Binding M244E236L266D276E275DomainE258 Y257V270F311F317 A269L301 L284E255 T315L248 K271E281E316 M278E292E286Y253 E279Q252E282L384Q346 M351V339 omain L451E450WIBR Bioinformatics Course, Whitehead Institute, October 20033Surface MappingWIBR Bioinformatics Course, Whitehead Institute, October 200342

Protein InterfacesWIBR Bioinformatics Course, Whitehead Institute, October 20035Property ComparisonsWIBR Bioinformatics Course, Whitehead Institute, October 200363

Syllabus Protein Structure Classification Structure Coordinate Files & Databases Comparing Protein Structures– Aligning 3D Structures Predicting Protein Structure– Specialized Structural Regions– Secondary Structure Prediction– Tertiary Structure Prediction Threading Modeling Structure VisualizationWIBR Bioinformatics Course, Whitehead Institute, October 20037Structure Classification* Proteins can adopt only a limited number ofpossible 3D conformations– Combinations of a helices, b sheets, loops, and coils Completely different sequences can fold intosimilar shapes Protein Structure Classes––––Class a: bundles of a helicesClass b: anti-parallel b sheets (sandwiches and barrels)Class a / b: parallel b sheets with intervening a helicesClass a b: segregated a helices and anti-parallel bsheets– Multi-domain– Membrane/Cell surface proteinsWIBR Bioinformatics Course, Whitehead Institute, October ruc/ProtStruc2.htm4

Structure Families Divide structures into the limited number ofpossible structure families– Homologous proteins can be identified byexamining their respective structures forconserved fold patterns– Representative members can be used formodeling sequences of unknown structureWIBR Bioinformatics Course, Whitehead Institute, October 20039Structure Family Databases SCOP: Structural Classification Of Proteins–– CATH: Classification by Class, Architecture, Topology, and Homology–– – classified first into hierarchical levels like SCOPhttp://www.biochem.ucl.ac.uk/bsm/cath/FSSP: Fold classification based on Structure-structure alignment of proteins– based on a definition of structural similarities. Hierarchical levels to reflect evolutionary andstructural ed on structural alignment of all pair-wise combinations of proteins in PDB by DALI (usedto id common folds and place into tmlMMDB–Aligns 3D structures based on similar arrangements of secondary structural elements B/mmdb.shtmlSARF––categorized on the basis of structural similarity, categories are similar to other dbshttp://123d.ncifcrf.gov/WIBR Bioinformatics Course, Whitehead Institute, October 2003105

Syllabus Protein Structure Classification Structure Coordinate Files & Databases Comparing Protein Structures– Aligning 3D Structures Predicting Protein Structure– Specialized Structural Regions– Secondary Structure Prediction– Tertiary Structure Prediction Threading Modeling Structure VisualizationWIBR Bioinformatics Course, Whitehead Institute, October 200311Coordinates Coordinate Data: location of a molecule’s atoms in space (XYZtriple) XYZ triple is labeled with an atom, residue, chain– Modified aa are labeled with X, H’s not usually listedAtom Residue Chain X Y Z54ALAC 35.4 -9.3 102.5 Data Representation– Chemistry Rules Approach: connect the dots utilizing a standard rules baseto specify bond distances (not consistent among applications)– Explicit Bonding Approach: explicit bonding information is specified in thefile (very consistent)WIBR Bioinformatics Course, Whitehead Institute, October 2003126

Coordinate File Formats MMDB “Molecular Modeling DataBank” Format– ASN.1 standard data description language (explicit bond information) mmCIF “Chemical Interchange Format”– (relational db format) PDB “Protein DataBank” FormatExample Atom#Atom typeChainResidue#tag– Column oriented, “flexible format” (chemistry 59259260260260260X 70017.601StructureZ 427.4825.3125.1925.3223.6121.0320.9419.84WIBR Bioinformatics Course, Whitehead Institute, October 200313Coordinate Databases RCSB (Research Collaboratory for StructuralBioinformatics) http://www.rcsb.org/– Formally know as the Protein Data Bank at Brookhaven NationalLaboratories– Structure Explorer PDB search engine Text and PDB ID (4 letter code) searching MMDB (Molecular Modeling Database @NCBI)– Compilation of structures represented in multiple formats– Provides structure summaries– BLAST sequences to search for available structuresWIBR Bioinformatics Course, Whitehead Institute, October 2003147

Syllabus Protein Structure Classification Structure Coordinate Files & Databases Comparing Protein Structures– Aligning 3D Structures Predicting Protein Structure– Specialized Structural Regions– Secondary Structure Prediction– Tertiary Structure Prediction Threading Modeling Structure VisualizationWIBR Bioinformatics Course, Whitehead Institute, October 200315Sequence & StructureHomology Sequence– Identify relationships between sets of linearprotein sequences Structure– Categorize related structures based on 3D shapes Structure families do not necessarily share sequencehomologyWIBR Bioinformatics Course, Whitehead Institute, October 2003168

Structure Comparison Compare Structures that are:– Identical Similarity/difference of independent structures, x-ray vs. nmr, apo vs.holo forms, wildtype vs. mutant– Similar Predict function, evolutionary history, important domains– Unrelated Identify commonalities between proteins with no apparent commonoverall structure - focus on active sites, ligand binding sites Superimpose Structures by 3D Alignment for ComparisonWIBR Bioinformatics Course, Whitehead Institute, October 200317Structural Alignment Structure alignment forms relationships in 3D space– similarity can be redundant for multiple sequences Considerations––––Which atoms/regions between two structure will be comparedWill the structures be compared as rigid or flexible bodiesCompare all atoms including side chains or just the backbone/CaTry to maximize the number of atoms to align or focus on onelocalized region (biggest differences usually in solvent-exposedloop structures)– How does the resolution of each structure affect comparisonWIBR Bioinformatics Course, Whitehead Institute, October 2003189

Translation and Rotation Alignment– Translate center of mass to acommon origin– Rotate to find a suitablesuperposition Algorithms– Identify equivalent pairs (3) of atomsbetween structures to seed alignment Iterate translation/rotation tomaximize the number of matchedatom pairs– Examine all possible combinationsof alignments and identify theoptimal solutionWIBR Bioinformatics Course, Whitehead Institute, October 200319Alignment Methods Initially examine secondary structuralelements and Ca-Cb distances to identifyfolds and the ability to align Gap penalties for structures that havediscontinuous regions that do not align(alignment-gap-alignment)– Anticipate that two different regions mayalign separately, but not in the samealignment Proceed with alignment method:– Fast, Secondary Structure-Based– Dynamic Programming– Distance MatrixWIBR Bioinformatics Course, Whitehead Institute, October 20032010

Fast Alignment by SS Secondary structure elements can be represented by avector starting at the beginning of the element– Position & length Compare the arrangement of clustered vectors between twostructures to identify common folds Sometimes supplement vectors with information about thearrangement of the side chains (burial/exposure) Significance of alignment– Likelihood that a cluster of secondary structural elements would beexpected between unrelated structuresWIBR Bioinformatics Course, Whitehead Institute, October 200321VAST and SARF Implement automaticmethods to assignsecondary structure vastsearch.html SARFhttp://123d.ncifcrf.gov/WIBR Bioinformatics Course, Whitehead Institute, October 20032211

Exhaustive Alignment Dynamic Programming– Local environment defined in terms of Interatomic distances, bondangles, side chain identity, side chain burial/exposure– Align structures by matching local environments - for example, drawvectors representing each Ca-Cb bond, superimpose vectors Distance Matrix– Graphic procedure similar to a dot matrix alignment of two sequencesto identify atoms that lie most closely together in a 3D structure (basedon Ca distances)– Similar structures have super-imposable graphsWIBR Bioinformatics Course, Whitehead Institute, October 200323DALI Distance Alignment DALI - http://www2.embl-ebi.ac.uk/dali/ Aligns two structures Determines if a new structure is similar toone already in database (for classification)WIBR Bioinformatics Course, Whitehead Institute, October 20032412

Alignment Quality Calculate deviation betweentwo aligned structures RMSD (Root Mean SquareDeviation)– Goodness of fit between two setsof coordinates– Best if 3 Å– Calculate Ca-Ca distances, sumsquare of distances, divide by thenumber of pairs, square rootRMSD N Di2/NWIBR Bioinformatics Course, Whitehead Institute, October 200325Syllabus Protein Structure Classification Structure Coordinate Files & Databases Comparing Protein Structures– Aligning 3D Structures Predicting Protein Structure– Specialized Structural Regions– Secondary Structure Prediction– Tertiary Structure Prediction Threading Modeling Structure VisualizationWIBR Bioinformatics Course, Whitehead Institute, October 20032613

Predicting SpecializedStructures Leucine Zippers– Antiparallel a helices held together by interactions between Lresidues spaced at ever 7th position Coiled Coils– 2 or three a helices coiled around each other in a left-handedsupercoil– Multicoil .pl– COILS2 http://www.ch.embnet.org/software/COILS form.html Transmembrane Regions– 20-30aa domains with strong hydrophobicity– PHDhtm, PHDtopology, TMpred (TMbase)– ctprotein.htmlWIBR Bioinformatics Course, Whitehead Institute, October 200327Predicting Secondary Structure Recognizing Potential Secondary Structure– 50% of a sequence is usually alpha helices and beta sheet structures– Helices: 3.6 residues/turn, N 4 bonding– Strands: extended conformation, interactions between strands, disruptedby beta bulges– Coils: A,G,S,T,P are predominant– Sequences with 45% sequence identity should have similar structures Databases of sequences and accompanying secondary structures(DSSP)WIBR Bioinformatics Course, Whitehead Institute, October 20032814

SS Prediction AlgorithmsChou-Fasman/GOR Analyze the frequency of each of the 20 aa in everysecondary structure (Chou, 1974) A,E,L,M prefer a helices; P,G break helices Use a 4-6aa examination window to predict probability ofa helix, 3-5aa window for beta strands– Extend regions by moving window along sequence 50-60% effective (Higgins, 2000) GOR method assumes that residues flanking the centralwindow/core also influence secondary structureWIBR Bioinformatics Course, Whitehead Institute, October 200329SS Prediction AlgorithmsNeural Networks Examine patterns in secondary structures bycomputationally learning to recognizecombinations of aa that are prevalent within aparticular secondary structure Program is trained to distinguish between patternslocated in a secondary structure from those thatare not usually located in it PHDsec (Profile network from HeiDelberg)– 70% correct otein/submit def.htmlWIBR Bioinformatics Course, Whitehead Institute, October 20033015

SS Prediction AlgorithmsNearest Neighbor Generate an iterated list of peptide fragments bysliding a fixed-size window along sequence Predict structure of aa in center of the window byexamining its k neighbors (Yi, 1993)– Propensity of center position to adopt a structure withinthe context of the neighbors Method relies on an initial training set to teach ithow neighbors influence secondary structure NNSSP simple.htmlWIBR Bioinformatics Course, Whitehead Institute, October 200331SS Prediction Tools NNpredict - 65 % effective*, outputs H,E,– http://www.cmpharm.ucsf.edu/ nomi/nnpredict.html PredictProtein - query sequence examined againstSWISS-PROT to find homologous sequences– MSA of results given to PHD for prediction– 72% effective*– t def.html Jpred - integrates multiple structure predictionapplications and returns a consensus, 73% effective*– http://www.compbio.dundee.ac.uk/ www-jpred/submit.htmlWIBR Bioinformatics Course, Whitehead Institute, October 20033216

Tertiary Structure Prediction Goal– Build a model to use for comparison with otherstructures, identify important residues/interactions,determine function Challenges– Reveal interactions that occur between residuesthat are distant from each other in a linearsequence– Slight changes in local structure can have largeeffects on global structure Methods– Sequence Homology - use a homologous sequenceas a template– Threading - search for structures that have similarfold configurations without any obvious sequencesimilarityWIBR Bioinformatics Course, Whitehead Institute, October 200333Threading - Approaches Sequence is compared for its compatibility(structural similarity) with existing structures Approaches to determine compatibility– Environmental Template: environment of ea. aa in astructure is classified into one of 18 types, evaluate ea.position in query sequence for how well it fits into aparticular type (Mount, 2001)– Contact Potential Method: analyze the closeness ofcontacts between aa in the structure, determine whetherpositions within query sequence could produce similarinteractions (find most energetically favorable) (Mount,2001)WIBR Bioinformatics Course, Whitehead Institute, October 20033417

Threading Process Sequence moved position-by-position through a structure Protein fold modeled by pair-wise inter-atomic calculations to align asequence with the backbone of the template– Comparisons between local and non-local atoms– Compare position i with every other position j and determine whetherinteractions are feasible Optimize model with pseudo energy minimizations - mostenergetically stable alignment assumed to be most favorable Thread the smallest segment reasonable! Computationally intensive. 123D http://123d.ncifcrf.gov/123D MGGNLLYLTGSVDKRTIEKYEREAKDAGRQGWYLSWVMDTNKEERWIBR Bioinformatics Course, Whitehead Institute, October 200335Model Building Perform automated model constructions– SWISS-MODEL Compare sequence to ExPdb to find a template (homology) Define your own templates (from threading) http://www.expasy.ch/swissmod/SWISS-MODEL.html– GENO3D PSI-BLAST to identify homologs possessing structures to be used astemplates http://geno3d-pbil.ibcp.frWIBR Bioinformatics Course, Whitehead Institute, October 20033618

Model Evaluation Manually examine model and alignments Find similar structures through databasesearches– DALI How does the model compare to otherstructures with the template family? Remember, it’s only a MODEL (but evenmodels can be useful)WIBR Bioinformatics Course, Whitehead Institute, October 200337Structure Visualization Different representations of molecule– wire, backbone, space-filling, ribbon NMR ensembles– Models showing dynamic variation of molecules in solution VIEWERS– RasMol (Chime is the Netscape plug-in) � Cn3D MMDB viewer (See in 3D) with explicit bonding http://www.ncbi.nlm.nih.gov/Structure– SwissPDB Viewer http://www.expasy.ch/spdbv/mainpage.html– iMol http://www.pirx.com/iMolWIBR Bioinformatics Course, Whitehead Institute, October 20033819

Pulling It All TogetherYFPWhat is it?Linear SequenceHomology3D StructuralHomologyBlast SearchIdentify Homologs andPerform MSAsSpecialized Structure SearchFunctional Domain HomologyDomain SearchIdentify Functional DomainsThreading/Model BuildingStructural SimilarityMotif SearchLocate ConservedSequence PatternsWIBR Bioinformatics Course, Whitehead Institute, October 200339Demonstration Thread sequence to identify template– Web-based: 123Dhttp://123d.ncifcrf.gov/123D .html Model sequence with template– http://www.expasy.ch/swissmod/SWISS-MODEL.html VisualizationWIBR Bioinformatics Course, Whitehead Institute, October 20034020

ReferencesBioinformatics: Sequence and genome Analysis. David W.Mount. CSHL Press, 2001.Bioinformatics: A Practical Guide to the Analysis of Genesand Proteins. Andreas D. Baxevanis and B.F. FrancisOuellete. Wiley Interscience, 2001.Bioinformatics: Sequence, structure, and databanks. DesHiggins and Willie Taylor. Oxford University Press, 2000.Chou, P.Y. and Fasman, G. D. (1974). Biochemistry, 13, 211.Yi, T-M. and Lander, E.S.(1993) J. Mol. Biol., 232,1117.WIBR Bioinformatics Course, Whitehead Institute, October 20034121

1 Bioinformatics for Biologists Comparative Protein Analysis: Part III. Protein Structure Prediction and Comparison Robert