Amundsen: A Data Discovery Platform From Lyft - Data Council

Transcription

Amundsen: A Data Discovery Platform from LyftApril 17th 2019Jin Hyuk Chang @jinhyukchang Engineer, LyftTao Feng @feng-tao Engineer, Lyft

Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft Demo Architecture Summary2

Data platform usersData ModelersAnalystsData xperimentersData Platform3

Core Infra high level architectureCustom apps4

Data Discovery5

Hi! I am a n00b Data Scientist! My first project is to analyze and predict Data council Attendance Where is the data? What does it mean?6

Status quo Option 1: Phone a friend! Option 2: Github search7

Understand the context What does this field mean?‒ Does attendance data include employees?‒ Does it include revenue? Let me dig in and understand8

ExploreSELECT*FROMdefault.my tableWHERE ds ’2018-01-01’LIMIT 100;

Exploring with SELECT * is EVIL1. Lack of productivity for data scientists2. Increased load on the databases10

Data Scientists spend upto 1/3rd time in Data Discovery. Data discovery‒ Lack ofunderstanding ofwhat data exists,where, who owns it,who uses it, and howto request access.11

Audience for datadiscovery12

Data Discovery - User personasData ModelersAnalystsData xperimentersData Platform13

3 Data Scientist personasPower user All info in their headGet interrupted a lotdue to questionsNoob user LostAsk “power users” alot of questionsManager Dependencieslanding on timeCommunicating withstakeholders

Data Discovery answers 3 kinds of questionsSearch basedLineage basedNetwork basedWhere is thetable/dashboard for X?What does it contain?I am changing a datamodel, who are the ownerand most common users?I want to follow a poweruser in my team.Does this analysis alreadyexist?This table’s delivery wasdelayed today, I want tonotify everyonedownstream.I want to bookmark tables ofinterest and get a feed ofdata delay, schema change,incidents.

Meet AmundsenFirst person to discover the South Pole Norwegian explorer, Roald Amundsen16

Landing page optimized for search

Search results ranked on relevance and query activity

How does search work?19

Relevance - search for “apple” on GoogleLow relevanceHigh relevance20

Popularity - search for “apple” on GoogleLow popularityHigh popularity21

Striking the balanceRelevance Names, Descriptions, Tags, [owners, frequentusers]Popularity Querying activityDashboardingDifferent weights for automated vs adhocquerying22

Back to mocks.23

Search results ranked on relevance and query activity

Detailed description and metadata about data resources

Data Preview within the tool

Computed stats about column metadataDisclaimer: these stats are arbitrary.

Built-in user feedback

Demo29

Open source in mind Pluggable code to each micro-services via Python entry point, etc Pluggable API endpoint via Blueprint Build your ingestion pipeline like a Lego brick

Amundsen’s architecture31

Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 32

1. Frontend Service33

Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 34

Amundsen table detail page

2. Metadata Service36

Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 37

2. Metadata Service A thin proxy layer to interact with graph database‒ Currently Neo4j is the default option for graph backend engine‒ Work with the community to support Apache Atlas Support Rest API for other services pushing / pulling metadata directly38

Trade Off #1Why choose Graphdatabase39

Why Graph database?

Why Graph database?

Trade Off #2Why not propagate themetadata back to source42

Why not propagate the metadata back to source43

Why not propagate the metadata back to source?44

Why not propagate the metadata back to source45

3. Search Service46

Other MicroservicesMLFeatureServiceFrontend ServiceSecurityServiceMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 47

3. Search Service A thin proxy layer to interact with the search backend‒ Currently it supports Elasticsearch as the search backend. Support different search patterns‒ Normal Search: match records based on relevancy‒ Category Search: match records first based on data type, thenrelevancy‒ Wildcard Search48

Challenge #1How to make the searchresult more relevant?49

How to make the search result more relevant? Define a search quality metric‒ Click-Through-Rate (CTR) over top 5 results Search behaviour instrumentation is key Couple of improvements:‒ Boost the exact table ranking‒ Support wildcard search (e.g. event *)‒ Support category search (e.g. column: is line ride)50

4. Data Builder51

Other MicroservicesMLFeatureServiceFrontend ServiceOtherServicesMetadata ServiceSearch ServiceElasticSearchNeo4jDatabuilder CrawlerMetadata 52

Challenge #1Various forms of metadata53

Metadata Sources @ Lyft54

Metadata - Challenges No Standardization: No single data model that fits for all dataresources‒ A data resource could be a table, an Airflow DAG or a dashboard Different Extraction: Each data set metadata is stored and fetcheddifferently‒ Hive Table: Stored in Hive metastore‒ RDBMS(postgres etc): Fetched through DBAPI interface‒ Github source code: Fetched through git hook‒ Mode dashboard: Fetched through Mode API‒ 55

Challenge #2Pull model vs Push model56

Pull model vs. Push modelPull Model Push ModelPeriodically update the index by pulling fromthe system (e.g. database) via crawlers. The system (e.g. database) pushesmetadata to a message bus whichdownstream subscribes to.CrawlerDatabaseData graphDatabaseMessagequeueData graphScheduler57

Pull model vs. push modelPull Model Push ModelOnus of integration lays on data graphNo interface to prescribe, hard to maintaincrawlers Onus of integration lies on databaseMessage format serves as the interfaceAllows for near-real time indexingCrawlerDatabaseData graphDatabaseMessagequeueData graphScheduler58

Pull model vs. push modelPull Model Push ModelOnus of integration lays on data graphNo interface to prescribe, hard to maintaincrawlers Onus of integration lies on databaseMessage format serves as the interfaceAllows for near-real time indexingCrawlerDatabaseData graphPreferred if Waiting for indexing is okWorking with “strapped” teamsThere’s already an interfaceDatabaseMessagequeueData graphPreferred if Near-real time indexing is importantClean interface doesn’t existOther tools like Wherehows are movingtowards Push Model59

4. Databuilder

Databuilder in action

How are we building data? Databuilder

How is databuilder orchestrated?Amundsen uses Apache Airflow to orchestrate Databuilder jobs

What’s next?64

Amundsen seems to be more useful than what we thought Tremendous success at Lyft‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! Many organizations have similar problems‒ Collaborating with ING, WeWork and more‒ We plan to announce open source soon65

Impact - Amundsen at LyftGenerally Available(GA) releaseBeta release(internal)Alpha release66

Summary67

Adding more kinds of data resourcesData setsPhase 1(Complete)DashboardsPeoplePhase 2(In development)StreamsSchemasPhase 3(In Scoping)Workflows

Summary Data Discovery adds 30 % more productivity to Data Scientists Metadata is key to the next wave of big data applications Amundsen - Lyft’s metadata and data discovery platform Blog post with more details: go.lyft.com/datadiscoveryblog69

Jin Hyuk Chang @jinhyukchangTao Feng @feng-taoSlides at go.lyft.com/amundsen datacouncil 2019Blog post at go.lyft.com/datadiscoveryblogIcons under Creative Commons License from https://thenounproject.com/70

Backup71

Open source in mind Pluggable code to each micro-services via Python entry point, etc . Data Discovery adds 30 % more productivity to Data Scientists Metadata is key to the next wave of big data applications Amundsen - Lyft's metadata and data discovery platform