Visual Data Management System

Transcription

Imagesdata andmetadataVisual Data Management SystemVishakha Gupta-Cledat, Luis Remis, Christina Strong, Ragaad Altarawneh, Scott Hahnvishakha.s.gupta, luis.remis, christina.r.strong, ragaad.altarawneh, scott.hahn@intel.comIntel LabsFindme cats

What is VDMS?A novel Visual Data Management System For storing, accessing and transforming visual data2Intel Labs

What is VDMS?A novel Visual Data Management System For storing, accessing and transforming visual data Primarily geared towards visual analytics pipelines and data science queries2Intel Labs

What is VDMS?A novel Visual Data Management System For storing, accessing and transforming visual data Primarily geared towards visual analytics pipelines and data science queries With a goal of efficiently achieving cloud scale while maintaining ease-of-use2Intel Labs

What is VDMS?A novel Visual Data Management System For storing, accessing and transforming visual data Primarily geared towards visual analytics pipelines and data science queries With a goal of efficiently achieving cloud scale while maintaining ease-of-useAlso aims to Exploit Intel’s heterogeneous memory and storage hierarchy2Intel Labs

What is VDMS?A novel Visual Data Management System For storing, accessing and transforming visual data Primarily geared towards visual analytics pipelines and data science queries With a goal of efficiently achieving cloud scale while maintaining ease-of-useAlso aims to Exploit Intel’s heterogeneous memory and storage hierarchy Be general purpose e.g. common core for medical imaging, sports, retail2Intel Labs

Visual Data: Scale and ApplicationsBillions of sources3Intel Labs

Visual Data: Scale and ApplicationsImagesVideosBillions of sourcesLarge in size (individual object could range in size from KB to GB)Increasingly being used for visual understanding in a range of machine learning applicationsIntel Labs3

Visual Data: Scale and ApplicationsImagesVideosBillions of sourcesFeature Vectors / DescriptorsLarge in size (individual object could range in size from KB to GB)Increasingly being used for visual understanding in a range of machine learning applicationsIntel Labs3

The Unsustainable Current SolutionsResolve visual computing challenges and frameworks first Improving accuracy of algorithms on more and more complex data Storage has not become a bottleneck yet!4Intel Labs

The Unsustainable Current SolutionsResolve visual computing challenges and frameworks first Improving accuracy of algorithms on more and more complex data Storage has not become a bottleneck yet!Application-specific solutions, if data does become a problem Organize media files Manually gather and normalize relevant metadata Build custom scripts to tie together many stages of complex processing4Intel Labs

The Unsustainable Current SolutionsResolve visual computing challenges and frameworks first Improving accuracy of algorithms on more and more complex data Storage has not become a bottleneck yet!Application-specific solutions, if data does become a problem Organize media files Manually gather and normalize relevant metadata Build custom scripts to tie together many stages of complex processingVisual data management for scale and reuse is still an open problem.4Intel Labs

VDMS Storage ArchitectureExploding amount of visual data For any request, access only the required subset of data – exploit metadata5Intel Labs

VDMS Storage ArchitectureExploding amount of visual data For any request, access only the required subset of data – exploit metadataEven individual objects could be large Speed up access to this desired data Preprocess while reading where possible e.g. crop or detect edges before transferring5Intel Labs

VDMS Storage ArchitectureExploding amount of visual data For any request, access only the required subset of data – exploit metadataEven individual objects could be large Speed up access to this desired data Preprocess while reading where possible e.g. crop or detect edges before transferringHigh performance as well as ease-of-use Suitable design choices for metadata and data, at scale Intel hardware optimizations e.g. 3D Xpoint, media hardware, disk offload Simple API and client libraries5Intel Labs

VDMS ImplementationUserVDMSVisual Data Storage6Intel Labs

VDMS ImplementationEfficient metadata access via Persistent Memory GraphDatabase (PMGD) for visual data Optimized for metadata storage and access patterns Easy to evolve schema with new vision researchUserVDMSPMGD(MetadataDatabase)Visual Data Storage6Intel Labs

VDMS ImplementationEfficient metadata access via Persistent Memory GraphDatabase (PMGD) for visual data Optimized for metadata storage and access patterns Easy to evolve schema with new vision researchUserVDMSEfficient data access via Visual Compute Library Enable alternate image/video analysis friendly storageformats as compared to viewer friendly ones Process data while accessing itPMGD(MetadataDatabase)VisualComputeLibraryVisual Data Storage6Intel Labs

VDMS ImplementationEfficient metadata access via Persistent Memory GraphDatabase (PMGD) for visual data Optimized for metadata storage and access patterns Easy to evolve schema with new vision researchEfficient data access via Visual Compute Library Enable alternate image/video analysis friendly storageformats as compared to viewer friendly ones Process data while accessing lComputeLibraryEase-of-use via Request Server Implement a unified and simple client API Route query (or parts) to the right components for acoherent user responseVisual Data Storage6Intel Labs

Where We Are NowUser API v1.0 defined with internal feedback7Intel Labs

Where We Are NowUser API v1.0 defined with internal feedbackFunctional one node server and client libraries7Intel Labs

Where We Are NowUser API v1.0 defined with internal feedbackFunctional one node server and client librariesThree interesting proofs of concept at various stages of development with input fromproduct groups Real data and concrete use case: medical imaging application Large scale, real time, intensive use case: FreeD sports storage architecture Integration with a larger analytic framework: Retail shopper insights application7Intel Labs

Medical Imaging Proof of Concept on VDMSThe Cancer Image Archive: http://www.cancerimagingarchive.net/ 60TB of medical images (Volumetric data) 1000 patients metadata information (very sparse)8Intel Labs

Medical Imaging Proof of Concept on VDMSThe Cancer Image Archive: http://www.cancerimagingarchive.net/ 60TB of medical images (Volumetric data) 1000 patients metadata information (very sparse)For our PoC: 457 patients metadata, including drug and radiation treatments Scans for 384 patients (60K images) Replicated metadata x10 and x100, keeping the original distribution8Intel Labs

Medical Imaging Proof of Concept on VDMSThe Cancer Image Archive: http://www.cancerimagingarchive.net/ 60TB of medical images (Volumetric data) 1000 patients metadata information (very sparse)For our PoC: 457 patients metadata, including drug and radiation treatments Scans for 384 patients (60K images) Replicated metadata x10 and x100, keeping the original distributionSegmentation pipeline for demo8Intel Labs

Segmentation PipelinePyClientVDMS ServerSegmentationAlgorithmfor BrianTumorsVDMSClientPythonModule9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QuerySegmentationAlgorithmfor BrianTumorsVDMSClientPythonModule9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QueryQuery - Pull DataSegmentationAlgorithmfor BrianTumorsVDMSClientPythonModule9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QueryQuery - Pull DataSegmentationAlgorithmfor BrianTumorsVDMSClientPythonModuleReturn Data9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QueryQuery - Pull DataSegmentationAlgorithmfor BrianTumorsVDMSClientPythonModuleReturn Data9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QueryQuery - Pull DataSegmentationAlgorithmfor BrianTumorsVDMSClientPythonModuleReturn DataConstructed JSON Query Image Blob9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QueryQuery - Pull DataSegmentationAlgorithmfor BrianTumorsVDMSClientPythonModuleReturn DataQuery - Push DataConstructed JSON Query Image Blob9Intel Labs

Segmentation PipelinePyClientVDMS ServerConstructed JSON QueryQuery - Pull DataSegmentationAlgorithmfor BrianTumorsConstructed JSON Query Image BlobVDMSClientPythonModuleReturn DataQuery - Push DataReturn Successful9Intel Labs

Domain Specific Queries - Some ExamplesQuery 1: Retrieve a single image (200x200), searching by its unique name. Retrieve single image10Intel Labs

Domain Specific Queries - Some ExamplesQuery 1: Retrieve a single image (200x200), searching by its unique name. Retrieve single imageQuery 2: Retrieve a complete brain scan (155 images) from a particular patient. Retrieve 155 images10Intel Labs

Domain Specific Queries - Some ExamplesQuery 1: Retrieve a single image (200x200), searching by its unique name. Retrieve single imageQuery 2: Retrieve a complete brain scan (155 images) from a particular patient. Retrieve 155 imagesQuery 3: Retrieve all brain scans corresponding to people over 75 who had achemotherapy using the drug “Temodar”. Retrieve 1600 images after 3 neighbor hops10Intel Labs

Comparison BaselineNo single solution to compare11Intel Labs

Comparison BaselineNo single solution to compareCreate “likely” combination of well-known options MemSQL for storing metadata Apache HTTP server for requesting images via http OpenCV for performing preprocessing11Intel Labs

Performance Improvements - MetadataQuery 3: Retrieve 1600image names after 3neighbor hops12Intel Labs

Performance Improvements - MetadataQuery 3: Retrieve 1600image names after 3neighbor hops12Intel Labs

Performance Improvements - MetadataQuery 3: Retrieve 1600image names after 3neighbor hops12Intel Labs

Performance Improvements - MetadataQuery 3: Retrieve 1600image names after 3neighbor hopsVDMS performs up to one order of magnitude better compared to MemSQLA Graph Database is a logical choice for visual metadata.12Intel Labs

Visual Compute Library: E.g. Transformation OperationsImages in Analytics-friendly TDB Format (uses TileDB)Resize to 256x25613Intel Labs

Visual Compute Library: E.g. Transformation OperationsImages in Analytics-friendly TDB Format (uses TileDB)Resize to 256x256Crop to one-sixth the size13Intel Labs

Visual Compute Library: E.g. Transformation OperationsImages in Analytics-friendly TDB Format (uses TileDB)Resize to 256x256Crop to one-sixth the sizeImages stored in the TDB format provide faster access and processing, thus making it a great formatfor visual analytics pipelines, specially for large images.13Intel Labs

Overall ImprovementsQuery 1: Retrieve single imageQuery 2: Retrieve 155 images fora patientQuery 3: Retrieve 1600 imagesafter 3 neighbor hops14Intel Labs

Overall ImprovementsQuery 1: Retrieve single imageQuery 2: Retrieve 155 images fora patientQuery 3: Retrieve 1600 imagesafter 3 neighbor hopsVDMS performs significantly better when dealing with more complex queries, without incurring inoverhead in more simple tasks14Intel Labs

Hermes Peak: A Framework for Ad-hoc Video AnalyticsFramework for processing visual data from the edge to cloud with four focus areas within the IntelScience and Technology Center for Visual Cloud SystemsIn-lineProcessing Video processing withreal time turnaroundSupport arbitrarynumber of streamsProgrammable eventsOptimized resourceutilizationOptimized Storageand Retrieval Optimized metadata DB Analysis friendly mediaformats Distributed for cloudscale Tiered storage for hotand cold dataOfflineProcessing Query and analytic onhistoric (stored) data Processing of large(cloud) scale video orimage libraries Optimized resourceutilizationQueryProcessing andConfiguration Tools to configurepipeline and answerqueries Visual query compiler Visual kernel repository15Intel Labs

Bigger Picture: Visual Cloud Inferencing FlowInlineanalyticsData AcquisitionCamerasLocal Analytics tel Labs

Bigger Picture: Visual Cloud Inferencing FlowInlineanalyticsData AcquisitionCamerasLocal Analytics g with our academic partners, Intel Labs is looking at the entire flow of visual data andprocessing from edge to cloud16Intel Labs

Hermes Peak: A Framework for Ad-hoc Video AnalyticsIn-lineProcessing Video processing withreal time turnaroundSupport arbitrarynumber of streamsProgrammable eventsOptimized resourceutilizationE.g. mized Storageand RetrievalOfflineProcessing Optimized metadata DB Analysis friendly mediaformats Distributed for cloudscale Tiered storage for hotand cold data Query and analytic onhistoric (stored) data Processing of large(cloud) scale video orimage libraries Optimized resourceutilizationE.g. VDMSE.g. r)QueryProcessing andConfiguration Tools to configurepipeline and answerqueries Visual query compiler Visual kernel repositoryTBD17Intel Labs

Conclusions and Future WorkRoom and need for novel storage methods in vision pipelinesGraph database, made efficient with new technology, a good option for metadataAnalysis friendly data storage a worthwhile research direction18Intel Labs

Conclusions and Future WorkRoom and need for novel storage methods in vision pipelinesGraph database, made efficient with new technology, a good option for metadataAnalysis friendly data storage a worthwhile research directionAddress feature vector and video storage and searchScale out to sustain large amount of data and high rates Also integrate with pub/sub model (Kafka) and evaluateNext version of the API and open source codeHermes Peak integration to complete a visual pipeline18Intel Labs

Conclusions and Future WorkRoom and need for novel storage methods in vision pipelinesGraph database, made efficient with new technology, a good option for metadataAnalysis friendly data storage a worthwhile research directionAddress feature vector and video storage and searchScale out to sustain large amount of data and high rates Also integrate with pub/sub model (Kafka) and evaluateNext version of the API and open source codeHermes Peak integration to complete a visual pipeline18Intel Labs

BackupIntel Labs

Extracting Value from Visual Data – Machine Learning20Intel Labs

Scale - Ubiquitous Cameras, New Applications21Intel Labs

Despite Computing Challenges, Data Access Can’t beIgnored E.g. Image Classification using Deep LearningAs processing capabilities and algorithms improve, amount of data increases, and data reusebecomes a possibility, data access goes from an afterthought to a real challenge22Intel Labs

Exploit Rich Visual MetadataMedia data easily leads to rich metadata computed in advance or on the flyMetadata much smaller and can be used to zoom in, on only the desired raw dataSearch photos by faces, scenes, objects, andactions/eventsSource: Yurong Chen, Intel Labs China23Intel Labs

Representing Media MetadataPersonName: Jane DoeDOB: 4/15/1974PhotoName: Hawaii1.jpgDate: 4/15/14Size: 2MBLocationName: MauiType: IslandState: HawaiiContainsPersonName: John DoeDOB: 11/1/1975LocatedAtPopulation: 20000PhotoName: Hawaii2.jpgDate: 4/16/14Size: 2.5MBPersonContainsName: Alice DoeDOB: 8/15/2000While this metadata schema will beapplication-specific, it looks like aproperty graph: Nodes connected with Edges Properties on nodes/edges (optional) Group by tagsSupport evolving schemaFind all photos of Alice from HawaiiVariety of indexes24Intel Labs

Persistent Memory Graph Database (PMGD)Traditional property graph databases plagued by disk latenciesNew non-volatile memory technology (e.g. 3D Xpoint) with performance close to DRAMOpportunity to avoid a lot of legacy software PMGD Graph database implementation targeting persistent memory25Intel Labs

PMGD Comparison to Neo4jQueries taken from the LDBCsocial network benchmarkBars show speedup overNeo4jThe more graph traversals,the better PMGD does26Intel Labs

Speeding up Access to Desired DataMore and more machine consumption of data for processing Think beyond standard formats for visual data Create formats better suited for processingVisual Compute Library (VCL) Explore alternate formats for images, videos and feature vectors Implement suitable processing on traditional and new formats27Intel Labs

VCL::ImageImplement alternate image storage formats to use when beneficial TDB format, based on TileDB [1]Higher level interaction with images in traditional or TDB format Perform processing such as crop, resize, threshold, ROI access, as data is read[1] Stavros Papadopoulos, Kushal Datta, et. al. 2016. The TileDB array data storage manager. VLDB 2016Intel Labs28

TDB PerformanceWrite PerformanceRead Performance29Intel Labs

Request ServerUnified and simple client APIRoute query to the right component for a coherent user responseClient APIParse RequestDataMetadataFunction CallPMGDVCL30Intel Labs

BraTS Challenge - Driving Application31Intel Labs

VDMS AlternativesNo one solution to do it allIntel automotive path HDFS for storing data Hbase for organizing metadata Another layer to make querying using relationships easierInitial CMU solution PostgreSQL database for metadata Write their own frame server and use OpenCV Still looking for an APIFacebook’s Tao Haystack, Amazon’s Neptune S3 Large scale but still not optimized for visual data management32Intel Labs

A novel Visual Data Management System For storing, accessing and transforming visual data Primarily geared towards visual analytics pipelines and data science queries With a goal of efficiently achieving cloud scale while maintaining ease-of-use Also aims to E