Using Ontologies For Data And Semantic Integration

Transcription

Using Ontologies for Data andSemantic IntegrationMonica CrubézyStanford Medical Informatics, Stanford University November 4, 2003

Ontologies Conceptualize a domain of discourse, anarea of expertise– Concepts (drug, patient, gene, clinical trial)– Properties, or attributes (dosage, age, location)– Relationships (contra-indications, body parts) Adhere to a modeling formalism, such as:– Frame-based representation– Description logics1

Protégé A general-purpose environment for ontologyediting and knowledge-base construction– Open-source, freely available (protégé.stanford.edu)– Interoperable with standards for knowledge representation(OKBC, RDF/S and more recently OWL)– Extensible in many ways (GUI, plugins, storage) Main frame-based modeling constructs– Classes represent concepts, organized in hierarchy– Slots represent properties of classes, with restrictionfacets on their values (e.g., type, cardinality, range)– Instances represent individual members of a class, withparticular values for slots– Instance-valued slots hold relationships with otherconcepts2

GLIF: Ontology for Clinical GuidelinesClass hierarchyList of slots for class Action Step3

GLIF: An instance of Action StepAutomatically-generatedinstance-knowledgeentry formSpecific valuesfill slots4

Ontologies for Data Integration1. Hold reference/standard models and datarepositories (e.g., the GLIF ontology) Existing examples speak for themselves2. Integrate data, metadata, and semanticsof multiple data sources A template ontology approach3. Enable reconciliation and translation ofdata between different models An ontology-mapping approach5

1. (Standard) Ontologies in Biomedicine Pervasive– From controlled terminologies to full-blown ontologies– Across the entire scope from biology to medicine Many examplars– Unified Medical Language System (UMLS)– Medical terminology and concept description(GALEN/OpenGALEN)– Foundational Model of Anatomy– Guideline models (GLIF, SAGE)– Gene Ontology (GO)– Pharmacogenomics ontology (PharmGKB)– 6

2. Integrating Data and SemanticsSyntactic differencesSemantic differences sales “Robitussin” 25 /sales “Sales” means cases sold per week. sales “Pepto-Bismol” 100 /sales “Robitussin” means all Robitussinbranded nces are usually explicit,but may be hard to reconcile.versus“Sales” is average number ofbottles sold per hour.“Robitussin” only refers toRobitussin DM.Differences can be subtle andimplicit.7

Integrating Data for Epidemic Detection The BioSTORM Project:– Biological Spatio-TempORal Module– Within DARPA-funded BioALIRT program forepidemics surveillance based on non-traditional,pre-diagnostic data Purpose:– To federate diverse non-traditional data sources(e.g., ER visits, 911 calls, absenteeism reports,pharmacy sales)– To enable space/time analysis of data byvarious computational methods, for earlyepidemics detection8

Integrating Data for Epidemic DetectionMappingOntologyBioSTORMData SourcesOntologyControl StructureData BrokerHeterogeneousInput DataDataSourcesData MapperSemanticallyUniform DataCustomizedOutput DataData RegularizationMiddlewareEpidemic Detection9Problem Solvers

Veterans Affair DataSeveral relational tablesLarge space of data valuesSemantics known to database creators10

911 Emergency Call DataOne table in a relational databaseConstrained space of data valuesArbitrary and unclear semantics11

Data Integration Approaches Integration of explicit local models of eachsource– Database schema matching and query distribution– Ontology merging, alignment & integration Description of data sources using a singleglobal model of entire domain of knowledge– SIMS (ISI): tie multiple DBs with rich semantics &construct complex queries– TAMBIS (U. Man.,UK): represent, access & query multiplemolecular biology DBs– caBIO (NCI): model cancer biology & provide methods toquery remote DBs transparently12

A Template Data Source OntologyData SourceOntologyHeterogeneousData InputDistributedData SourcesSemanticallyUniform DataObjectsDataBroker13

A Template Data Sources Ontology A template ontology for contextualizing diversedata sources– Hybrid of local and global approaches– Extensible & customizable framework for describing dataand their context in a way they can be compared andoperated on homogeneously Rationale– Require minimal ontological commitment of data sources– Preserve richness of data sources & flexibility in data use– Introduce no bias to data integration (left to analyticalmethods)– Ensure semantic uniformity of heterogeneous data14

Template-based ApproachData Source Template:LocationArea of influenceData groups recordedData Group Template:Bundle of related dataValid timeSpatial locationDatum Template:Contents (format)Specification (vocabulary-based)15

SF 911 Data Source OntologySF 911 Dispatch CenterLocated at Hunter’s PointReceives Data from Greater SFReceives “911 Call” Data911 CallContains: “Call Urgency”, “Call Type,”“Call Disposition,” etc.Valid on a specified dateCall occurred at a specified locationCall TypeContents: stringSpecification: Semantics of the string16

The Template Data Sources OntologyClasses ofData Sources17

An Instance of Data SourceAssociated set ofMeasurements(“data groups”)18

An Instance of a Data GroupAssociated LOINC-based vocabularyand specification of properties19

Providing Uniform Context to Data Semantics– Common language for describing and comparingsurveillance data sources, for which no standardscurrently exist– Extensible framework for incorporation of new datasources Metadata– Shared repository for enumerating available data sourcesin machine-processable form– Explicit and extensible vocabulary consistent with LOINCstandard for describing attributes of data and sources Data– Storage as instances of the ontology, OR– Definition of how data can be accessed from data sources20

3. Reconciling Diverse Ontologies Many ontologies in biomedicine arefederated models that fully or partiallyresemble standardization efforts But:– It is hard to agree on reference ontologies– We cannot expect people to adopt them (in thecourse of defining the standard, and even after)– Various reference and proprietary models needto interact in component-based architectures So, tools are needed to align differentmodels and translate data represented in agiven model to and from another model21

Operating on Data in Multiple WaysMappingOntologyData SourceOntologySemanticallyUniform DataObjectsInput–OutputOntologyCustomizedData ObjectsMappingInterpreterEpidemicDetectionProblem Solver22

Conceptual and Syntactic MismatchNotion of a “Data Group”Notion of an “Individual Event”23

Conceptual and Syntactic MappingNotion of a “Data Group”- filter out invalid events- extract & reformat source, date,location- abstract illness category- drop uidNotion of an “Individual Event”24

Ontology Mapping for Data Exchange Conceptual alignment– change in domain of discourse– difference in the level of knowledge granularity– split and join of concepts & attributes Value transformation––––abstraction, reductionaggregation or dispatchformat change (unit change)custom computation (functional transformation)25

Explicit Mapping Relations Isolate connections between ontologies– Each component ontology remains unchanged– Mapping relations express concept-level andattribute-level correspondences– Components focus and operate on their ownview, format of knowledge & data Define mediation of data betweenontology-based components– Mapping relations include the specification ofrules of transformation of values– Components do not have to handle knowledgetransformation internally26

An Ontology of Mapping gyClass SInstance mappingClass Tslot s1slot s2slot s3slot s4 (S’)Slot mappingSlot mappingRecursive slot mappingslot tAslot tB slot tE (T’)–renaming: value(tA) value(s1)–constant: value(tD) constant–lexical: value(tB) “* s2 * / 20* s3 *”–functional: value(tE) function()–recursive: value(tC) instance (auxiliary instance-mapping)27

Mapping Data Groups to Individual Eventsinstancemappingconstant slotmapping28

Mapping Data Groups to Individual Eventsrecursive slotmappingon-demand instancemapping29

Mapping InterpreterMappingInterpreterInstancedataSource ontology& instance dataInstancedataMapping ontology Target ontology& mapping instances– Processes the mapping relations between oneor more source ontologies and a target ontology– Produces a set of instances of the targetontology from the existing instances of thesource ontology30

Results of Mapping InterpretationSource “Data Group” instanceResulting target“Individual Event”instance}31

Varying Problem mantically UniformData ObjectsCustomized Data ObjectsMappingInterpreterProblemSolvers32

Benefits of Ontology-based Data Integration1. Modeling data with ontologies– Provides rich, machine-processable semanticsto data– Facilitates knowledge communication andsharing2. Integrating data with a template ontology– Enables software components to operate ondata in a uniform way– Facilitates access to existing data sources forany new customer component– Eliminates the need for customer componentsto be reprogrammed when a new data sourceis added33

Benefits of Ontology-based Data Integration Integrating data models by ontologymapping– Isolates ontological connections and data-leveltransformations for instance migration– Enables flexible, interconnected, componentbased architectures Each component relies on its own ontology Components remain independent Component coupling is explicit and maintainable34

Perspectives Data integration will always be needed!– Before standards are agreed upon and used– When information systems need to integrateand analyze multiple data sources– When system components need to access orrely on different ontologies Adaptations to be made for richer ways ofmodeling ontologies (DLs in particular) Combination with other data-integrationapproaches: matching, merging, alignment35

Aknowledgements At Stanford Medical Informatics––––Zachary PincusSamson Tu, Mor PelegNatasha NoyProf. Mark Musen Funding agencies––––National Library of MedicineNational Institute for Standards and TechnologyNational Cancer InstituteDefence Advanced Research Project Agency36

Stanford Medical Informatics– http://smi.stanford.edu The Protégé project– http://protege.stanford.edu Monica Crubézy– http://smi.stanford.edu/people/crubezy– crubezy@smi.stanford.edu37

Benefits of Ontology-based Data Integration 1.Modeling data with ontologies -Provides rich, machine-processable semantics to data -Facilitates knowledge communication and sharing 2.Integrating data with a template ontology -Enables software components to operate on data in a uniform way -Facilitates access to existing data sources for