Transcription
Amundsen: A Data Discovery Platform from LyftApril 17th 2019Jin Hyuk Chang @jinhyukchang Engineer, LyftTao Feng @feng-tao Engineer, Lyft
Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft Demo Architecture Summary2
Data platform usersData ModelersAnalystsData xperimentersData Platform3
Core Infra high level architectureCustom apps4
Data Discovery5
Hi! I am a n00b Data Scientist! My first project is to analyze and predict Data council Attendance Where is the data? What does it mean?6
Status quo Option 1: Phone a friend! Option 2: Github search7
Understand the context What does this field mean?‒ Does attendance data include employees?‒ Does it include revenue? Let me dig in and understand8
ExploreSELECT*FROMdefault.my tableWHERE ds ’2018-01-01’LIMIT 100;
Exploring with SELECT * is EVIL1. Lack of productivity for data scientists2. Increased load on the databases10
Data Scientists spend upto 1/3rd time in Data Discovery. Data discovery‒ Lack ofunderstanding ofwhat data exists,where, who owns it,who uses it, and howto request access.11
Audience for datadiscovery12
Data Discovery - User personasData ModelersAnalystsData xperimentersData Platform13
3 Data Scientist personasPower user All info in their headGet interrupted a lotdue to questionsNoob user LostAsk “power users” alot of questionsManager Dependencieslanding on timeCommunicating withstakeholders
Data Discovery answers 3 kinds of questionsSearch basedLineage basedNetwork basedWhere is thetable/dashboard for X?What does it contain?I am changing a datamodel, who are the ownerand most common users?I want to follow a poweruser in my team.Does this analysis alreadyexist?This table’s delivery wasdelayed today, I want tonotify everyonedownstream.I want to bookmark tables ofinterest and get a feed ofdata delay, schema change,incidents.
Meet AmundsenFirst person to discover the South Pole Norwegian explorer, Roald Amundsen16
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?19
Relevance - search for “apple” on GoogleLow relevanceHigh relevance20
Popularity - search for “apple” on GoogleLow popularityHigh popularity21
Striking the balanceRelevance Names, Descriptions, Tags, [owners, frequentusers]Popularity Querying activityDashboardingDifferent weights for automated vs adhocquerying22
Back to mocks.23
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadataDisclaimer: these stats are arbitrary.
Built-in user feedback
Demo29
Open source in mind Pluggable code to each micro-services via Python entry point, etc Pluggable API endpoint via Blueprint Build your ingestion pipeline like a Lego brick
Amundsen’s architecture31
Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 32
1. Frontend Service33
Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 34
Amundsen table detail page
2. Metadata Service36
Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 37
2. Metadata Service A thin proxy layer to interact with graph database‒ Currently Neo4j is the default option for graph backend engine‒ Work with the community to support Apache Atlas Support Rest API for other services pushing / pulling metadata directly38
Trade Off #1Why choose Graphdatabase39
Why Graph database?
Why Graph database?
Trade Off #2Why not propagate themetadata back to source42
Why not propagate the metadata back to source43
Why not propagate the metadata back to source?44
Why not propagate the metadata back to source45
3. Search Service46
Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 47
3. Search Service A thin proxy layer to interact with the search backend‒ Currently it supports Elasticsearch as the search backend. Support different search patterns‒ Normal Search: match records based on relevancy‒ Category Search: match records first based on data type, thenrelevancy‒ Wildcard Search48
Challenge #1How to make the searchresult more relevant?49
How to make the search result more relevant? Define a search quality metric‒ Click-Through-Rate (CTR) over top 5 results Search behaviour instrumentation is key Couple of improvements:‒ Boost the exact table ranking‒ Support wildcard search (e.g. event *)‒ Support category search (e.g. column: is line ride)50
4. Data Builder51
Other MicroservicesMLFeatureServiceFrontend ServiceOtherServicesMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 52
Challenge #1Various forms of metadata53
Metadata Sources @ Lyft54
Metadata - Challenges No Standardization: No single data model that fits for all dataresources‒ A data resource could be a table, an Airflow DAG or a dashboard Different Extraction: Each data set metadata is stored and fetcheddifferently‒ Hive Table: Stored in Hive metastore‒ RDBMS(postgres etc): Fetched through DBAPI interface‒ Github source code: Fetched through git hook‒ Mode dashboard: Fetched through Mode API‒ 55
Challenge #2Pull model vs Push model56
Pull model vs. Push modelPull Model Push ModelPeriodically update the index by pulling fromthe system (e.g. database) via crawlers. The system (e.g. database) pushesmetadata to a message bus whichdownstream subscribes to.CrawlerDatabaseData graphDatabaseMessagequeueData graphScheduler57
Pull model vs. push modelPull Model Push ModelOnus of integration lays on data graphNo interface to prescribe, hard to maintaincrawlers Onus of integration lies on databaseMessage format serves as the interfaceAllows for near-real time indexingCrawlerDatabaseData graphDatabaseMessagequeueData graphScheduler58
Pull model vs. push modelPull Model Push ModelOnus of integration lays on data graphNo interface to prescribe, hard to maintaincrawlers Onus of integration lies on databaseMessage format serves as the interfaceAllows for near-real time indexingCrawlerDatabaseData graphPreferred if Waiting for indexing is okWorking with “strapped” teamsThere’s already an interfaceDatabaseMessagequeueData graphPreferred if Near-real time indexing is importantClean interface doesn’t existOther tools like Wherehows are movingtowards Push Model59
4. Databuilder
Databuilder in action
How are we building data? Databuilder
How is databuilder orchestrated?Amundsen uses Apache Airflow to orchestrate Databuilder jobs
What’s next?64
Amundsen seems to be more useful than what we thought Tremendous success at Lyft‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! Many organizations have similar problems‒ Collaborating with ING, WeWork and more‒ We plan to announce open source soon65
Impact - Amundsen at LyftGenerally Available(GA) releaseBeta release(internal)Alpha release66
Summary67
Adding more kinds of data resourcesData setsPhase 1(Complete)DashboardsPeoplePhase 2(In development)StreamsSchemasPhase 3(In Scoping)Workflows
Summary Data Discovery adds 30 % more productivity to Data Scientists Metadata is key to the next wave of big data applications Amundsen - Lyft’s metadata and data discovery platform Blog post with more details: go.lyft.com/datadiscoveryblog69
Jin Hyuk Chang @jinhyukchangTao Feng @feng-taoSlides at go.lyft.com/amundsen datacouncil 2019Blog post at go.lyft.com/datadiscoveryblogIcons under Creative Commons License from https://thenounproject.com/70
Backup71
Open source in mind Pluggable code to each micro-services via Python entry point, etc . Data Discovery adds 30 % more productivity to Data Scientists Metadata is key to the next wave of big data applications Amundsen - Lyft's metadata and data discovery platform