Day 4: KNIME Practical - EBI

Transcription

Day 4: KNIME PracticalGeorge Papadatos, ChEMBL group, EMBL-EBIFrancis Atkinson, ChEMBL group, EMBL-EBI

Outline Introduction to KNIME Basic components Desktop, nodes, dialogs, workflows Demo Compound selection for focused screening Read chemical dataCalculate propertiesApply drug- and lead- likeness filtersRemove “nasty” compoundsPick diverse moleculesVisualize results and plot properties Exercises 1 & 2 (hands-on)212/12/2013Resources for Computational Drug Discovery

Are there KNIME users among us?312/12/2013Resources for Computational Drug Discovery

What is KNIME? KNIME Konstanz Information MinerDeveloped at University of Konstanz in GermanyDesktop version available free of charge (Open Source)Modular platform for building and executing workflows usingpredefined components, called nodes Core functionality available for tasks such as standard datamining, analysis and manipulation Extra features and functionality available in KNIME throughextensions from various groups and vendors Written in Java based on the Eclipse SDK platform412/12/2013Resources for Computational Drug Discovery

KNIME resources Web pages (documentation) www.knime.org tech.knime.org tech.knime.org/installation-0 Downloads knime.org/download-desktop Community forum tech.knime.org/forum Books and white papers knime.org/node/33079 Myself georgep@ebi.ac.uk512/12/2013Resources for Computational Drug Discovery

What can you do with KNIME? Data manipulation and analysis File & database I/O, sorting, filtering, grouping, joining, pivoting Data mining / machine learning R, WEKA, KNIME, interactive plotting Chemoinformatics Conversions, similarity, clustering, (Q)SAR analysis, MMPs, reactionenumeration Scripting integration R, Perl, Python, Matlab, Octave, Groovy Reporting So much more Bioinformatics, HTS & image analysis, network & text mining Marketing, bid data and business analytics612/12/2013Resources for Computational Drug Discovery

Community contribution nodes http://tech.knime.org/community Chemoinformatics ChEMBL and ChEBI (EBI) – SureChEMBL nodes coming soon! CDK (EBI), RDKit (Novartis), Indigo (GGA), ErlWood (Eli Lilly), Enalos(NovaMechanics) Bioinformatics HCS (MPI), NGS (Konstanz), Image analysis Text mining Palladian Integration Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis)712/12/2013Resources for Computational Drug Discovery

Installation & updates Download and unzip KNIME No further setup required Additional nodes after first launch knime.ini contains arguments & parameters for launch New software (nodes) from update sites ns/release Workflows and data are stored in a workspace /Users/georgep/knime/workspace mac new C:\knime 2.8.2\workspace Customization in: File Preferences KNIME812/12/2013Resources for Computational Drug Discovery

KNIME WorkbenchAuto-layout Execute Execute all nodesNode descriptiontabsworkflow projectsfavorite nodespublic serverworkflow editornode repository912/12/2013outlineResources for Computational Drug Discoveryconsole

KNIME nodes: OverviewNode basic processing unit of KNIME workflow which performs a particular taskTitleInput port(s) – on the left of iconOutput port(s) – on the right of iconIconStatus display (‘traffic lights’)Sequence number Red (not ready) Amber (ready) Green (executed) 10Blue bar during execution(with percentage or flashing)12/12/2013Resources for Computational Drug DiscoveryRight-click menuTo configure andexecute the node,display the outputviews, edit thenode, and displaydata for the ports

KNIME nodes: DialogsDouble click to configure Configuration menus forselected nodesExplicit column type1112/12/2013Resources for Computational Drug Discovery

An example completed workflow Workflows can be imported and exported as .zip files With or without the underlying data File Import KNIME workflow File Export KNIME workflow 1212/12/2013Resources for Computational Drug Discovery

Any questions so far?1312/12/2013Resources for Computational Drug Discovery

Compound selection for focused screening1.2.3.4.5.6.14Read chemical dataCalculate phys/chem propertiesApply drug- and lead-likeness filtersApply more filters (e.g. remove solubility liabilities)Apply substructural filters (PAINS subset)Pick diverse molecules12/12/2013Resources for Computational Drug Discovery

The objective1512/12/2013Resources for Computational Drug Discovery

First steps - I Locate the directory with today’smaterial12 Copy and paste it to your desktop You can take it with you too Open the presentation file Import theFocusedScreeningSelection.zip toKNIME Menu File Import workflowto KNIME31612/12/2013Resources for Computational Drug Discovery

First steps - II Open a new workflow Right click on the workflow projects area1231712/12/2013Resources for Computational Drug Discovery

Part 1: Reading chemical data1812/12/2013Resources for Computational Drug Discovery

SDF Reader.\data\SMDC cleaned nodups.sdf134219512/12/2013Resources for Computational Drug Discovery

Inspect the structures Right click on the node2012/12/2013Resources for Computational Drug Discovery

Molecule to RDKit2112/12/2013Resources for Computational Drug Discovery

Any questions so far?2212/12/2013Resources for Computational Drug Discovery

Part 2: Property-based filtering2312/12/2013Resources for Computational Drug Discovery

Descriptor Calculation1232412/12/2013Resources for Computational Drug Discovery

Java Snippet1.\code\Lipinski.txt322512/12/2013Resources for Computational Drug Discovery

Numeric Row Splitter2612/12/2013Resources for Computational Drug Discovery

Inspect the Lipinski fails Right click on the node2712/12/2013Resources for Computational Drug Discovery

Java Snippet1.\code\Oprea.txt322812/12/2013Resources for Computational Drug Discovery

Numeric Row Splitter2912/12/2013Resources for Computational Drug Discovery

Inspect the Oprea fails Right click on the node3012/12/2013Resources for Computational Drug Discovery

Numeric Row Splitter3112/12/2013Resources for Computational Drug Discovery

Inspect the Solubility fails Right click on the node3212/12/2013Resources for Computational Drug Discovery

Any questions so far?3312/12/2013Resources for Computational Drug Discovery

Part 3: Substructure-based filtering3412/12/2013Resources for Computational Drug Discovery

Molecule to Indigo3512/12/2013Resources for Computational Drug Discovery

File reader3612/12/2013.\data\PAINS clean half.sdfResources for Computational Drug Discovery

Query Molecule to Indigo3712/12/2013Resources for Computational Drug Discovery

Inspect the SMARTS rules3812/12/2013Resources for Computational Drug Discovery

Chunk Loop Start3912/12/2013Resources for Computational Drug Discovery

Substructure Matcher4012/12/2013Resources for Computational Drug Discovery

Loop End4112/12/2013Resources for Computational Drug Discovery

Inspect matched structures Right click on the node4212/12/2013Resources for Computational Drug Discovery

Reference Row Filter4312/12/2013Resources for Computational Drug Discovery

Any questions so far?4412/12/2013Resources for Computational Drug Discovery

Part 4: Diversity picking and plotting4512/12/2013Resources for Computational Drug Discovery

RDKit Fingerprint4612/12/2013Resources for Computational Drug Discovery

Inspect the fingerprints Right click on the node4712/12/2013Resources for Computational Drug Discovery

RDKit Diversity Picker4812/12/2013Resources for Computational Drug Discovery

2D/3D Scatterplot4912/12/2013Resources for Computational Drug Discovery

Inspect the plot Right click on the node5012/12/2013Resources for Computational Drug Discovery

Any questions so far?5112/12/2013Resources for Computational Drug Discovery

Exercise 1 Read an sd file with drug information from ChEMBLInspect the structures and their propertiesSelect only drugs that were released after 1990 (First Approval)Select only drugs that target human (Homo sapiens)How many drugs remain now?Save the workflowTips 52Open a new workflowUse the SDF Reader nodeUse the Numeric Row Splitter node to filter on First Approval 1990Use the Nominal Value Row filter node to filter on Organism Homosapiens12/12/2013Resources for Computational Drug Discovery

Exercise 2 Continue from your previous workflowCalculate MW and logP of the drug compoundsGenerate a scatter plot of MW and logPCan you see any compounds with high MW and logP? Tips Use the Molecule to RDKit node Use the RDKit Descriptor Calculator node Include the SlogP and ExactMW descriptors Use the 2D/3D Scatterplot node5312/12/2013Resources for Computational Drug Discovery

Any questions? Last chance!5412/12/2013Resources for Computational Drug Discovery

Conclusions Compound selection for focused screening Typical scenario KNIME Open and free Data analysis Chemoinformatics toolkits Erl Wood, RDKit, Indigo, CDK, etc. Lots of other functionality More advanced KNIME on Friday around lunch time5512/12/2013Resources for Computational Drug Discovery

Further reading Open data and tools1. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC:A free tool to discover chemistry for biology. Journal of Chemical Informationand Modeling 2012 ASAP.2. Saubern, S.; Guha, R.; Baell, J. B., KNIME workflow to assess PAINS filters inSMARTS format. Comparison of RDKit and Indigo cheminformatics libraries.Molecular Informatics 2011, 30, (10), 847-850.3. Barnes, M. R.; Harland, L.; Foord, S. M.; Hall, M. D.; Dix, I.; Thomas, S.;Williams-Jones, B. I.; Brouwer, C. R., Lowering industry firewalls: precompetitive informatics initiatives in drug discovery. Nature Reviews DrugDiscovery 2009, 8, (9), 701-708.4. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.;Sieb, C.; Thiel, K.; Wiswedel, B., KNIME: The Konstanz Information Miner. InData Analysis, Machine Learning and Applications, Preisach, C.; Burkhardt, H.;Schmidt-Thieme, L.; Decker, R., Eds. Springer: Berlin, 2008; pp 319-326.5. Tiwari, A.; Sekhar, A. K. T., Workflow based framework for life scienceinformatics. Computational Biology and Chemistry 2007, 31, (5-6), 305-319.5612/12/2013Resources for Computational Drug Discovery

Further reading High throughput screening1. Bajorath, J., Integration of virtual and high-throughput screening. NatureReviews Drug Discovery 2002, 1, (11), 882-894.2. Harper, G.; Pickett, S. D.; Green, D. V. S., Design of a compoundscreening collection for use in High Throughput Screening. CombinatorialChemistry & High Throughput Screening 2004, 7, (1), 63-70. Lead- and drug-likeness1. Chuprina, A.; Lukin, O.; Demoiseaux, R.; Buzko, A.; Shivanyuk, A., Drug- andlead-likeness, target class, and molecular diversity analysis of 7.9 millioncommercially available organic compounds provided by 29 suppliers. Journal ofChemical Information and Modeling 2010, 50, (4), 470-479.2. Lipinski, C. A., Lead- and drug-like compounds: the rule-of-five revolution. DrugDiscovery Today: Technologies 2004, 1, (4), 337-341.3. Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D., Is there a differencebetween leads and drugs? A historical perspective. Journal of ChemicalInformation and Computer Sciences 2001, 41, (5), 1308-1315.5712/12/2013Resources for Computational Drug Discovery

Further reading Physicochemical properties and drug discovery1. Brüstle, M.; Beck, B.; Schindler, T.; King, W.; Mitchell, T.; Clark, T., Descriptors,physical properties, and drug-likeness. Journal of Medicinal Chemistry 2002, 45,(16), 3345-3355.2. Hill, A. P.; Young, R. J., Getting physical in drug discovery: A contemporaryperspective on solubility and hydrophobicity. Drug Discovery Today 2010, 15,(15/16), 648-655.3. Leeson, P. D.; Springthorpe, B., The influence of drug-like concepts on decisionmaking in medicinal chemistry. Nature Reviews Drug Discovery 2007, 6, (11), 881890. Structural alerts in HTS1. Baell, J. B.; Holloway, G. A., New substructure filters for removal of Pan AssayInterference Compounds (PAINS) from screening libraries and for their exclusion inbioassays. Journal of Medicinal Chemistry 2010, 53, (7), 2719-2740.2. Rishton, G. M., Reactive compounds and in vitro false positives in HTS. DrugDiscovery Today 1997, 2, (9), 382-384.5812/12/2013Resources for Computational Drug Discovery

Further reading Similarity and diversity1. Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.; Gorse, D.; Holliday,J.; Lahana, R.; Willett, P., Identification of diverse database subsets usingproperty-based and fragment-based molecular descriptions. QuantitativeStructure-Activity Relationships 2002, 21, (6), 598-604.2. Bender, A.; Glen, R. C., Molecular similarity: a key technique in molecularinformatics. Organic and Biomolecular Chemistry 2004, 2, 3204-3218.3. Gorse, A.-D., Diversity in medicinal chemistry space. Current Topics in MedicinalChemistry 2006, 6, (1), 3-18.4. Maldonado, A.; Doucet, J.; Petitjean, M.; Fan, B.-T., Molecular similarity anddiversity in chemoinformatics: From theory to applications. Molecular Diversity2006, 10, (1), 39-79.5. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. Journal of ChemicalInformation and Modeling 2010, 50, (5), 742-754.6. Schuffenhauer, A.; Brown, N., Chemical diversity and biological activity. DrugDiscovery Today: Technologies 2006, 3, (4), 387-395.7. Willett, P.; Barnard, J. M.; Downs, G. M., Chemical similarity searching. Journalof Chemical Information and Computer Sciences 1998, 38, (6), 983-996.5912/12/2013Resources for Computational Drug Discovery

Day 4: KNIME PracticalGeorge Papadatos, ChEMBL group, EMBL-EBIFrancis Atkinson, ChEMBL group, EMBL-EBI

KNIME Konstanz Information Miner Developed at University of Konstanz in Germany Desktop version available free of charge (Open Source) Modular platform for building and executing workflows using predefined components, called nodes Core functionality available for tasks such as standard data mining, analysis and manipulation