Fast And Expressive Big Data Analytics With Python Matei . - Apache Spark

Transcription

Fast and Expressive Big Data Analyticswith PythonMatei ZahariaUC Berkeley / MITUC BERKELEYspark-project.org

What is Spark?Fast and expressive cluster computingsystem interoperable with Apache HadoopImproves efficiency through:» In-memory computing primitives Up to 100 faster» General computation graphs(2-10 on disk)Improves usability through:» Rich APIs in Scala, Java, PythonOften 5 less code» Interactive shell

Project HistoryStarted in 2009, open sourced 201017 companies now contributing code» Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, Entered Apache incubator in JunePython API added in February

An Expanding StackSpark is the basis for a wide set of projects inthe Berkeley Data Analytics Stack aph)MLbase(machinelearning) SparkMore details: amplab.berkeley.edu

This TalkSpark programming modelExamplesDemoImplementationTrying it out

Why a New Programming Model?MapReduce simplified big data processing,but users quickly found two problems:Programmability: tangle of map/red functionsSpeed: MapReduce inefficient for apps thatshare data across multiple steps» Iterative algorithms, interactive queries

Data Sharing in MapReduceHDFSreadHDFSwriteHDFSreaditer. 1HDFSwrite. . .iter. 2InputHDFSreadInputquery 1result 1query 2result 2query 3result 3. . .Slow due to data replication and disk I/O

What We’d Likeiter. 1iter. 2Inputquery 1one-timeprocessingInputDistributedmemoryquery 2query 3. . .10-100 faster than network and disk. . .

Spark ModelWrite programs in terms of transformationson distributed datasetsResilient Distributed Datasets (RDDs)» Collections of objects that can be stored inmemory or disk across a cluster» Built via parallel transformations (map, filter, )» Automatically rebuilt on failure

Example: Log MiningLoad error messages from a log into memory,then interactively search for various patternslines spark.textFile(“hdfs://.”)!Base RDDTransformed RDDerrors lines.filter(lambda s: s.startswith(“ERROR”))!messages errors.map(lambda s: lter(lambda s: “foo” in s).count()!resultsDriverWorkerBlock 1ActionCache 2messages.filter(lambda s: “bar” in s).count()!Worker. . .!Cache 3Result:Result: full-textscaled tosearch1 TB dataof Wikipediain 7 secin (vs2 sec180(vssec30fors foron-diskon-diskdata)data)tasksCache 1WorkerBlock 3Block 2

Fault ToleranceRDDs track the transformations used to buildthem (their lineage) to recompute lost datamessages textFile(.).filter(lambda s: “ERROR” in s)!.map(lambda s: s.split(“\t”)[2])!!HadoopRDDpath hdfs:// FilteredRDDfunc lambdas: MappedRDDfunc lambdas:

Example: Logistic RegressionGoal: find line separating two sets of pointsrandom initial line – – – – –– – –– –target

Example: Logistic Regressiondata spark.textFile(.).map(readPoint).cache()!!w numpy.random.rand(D)!!for i in range(iterations):!gradient data.map(lambda p:!(1 / (1 exp(-p.y * w.dot(p.x)))) * p.y * p.x!).reduce(lambda x, y: x y)!w - gradient!!print “Final w: %s” % w!

Running Time (s)Logistic Regression Performance40003500300025002000150010005000110 s / iterationHadoopPySpark151020Number of Iterations30first iteration 80 sfurther iterations 5 s

Demo

Supported uterJoin!flatMap!.!

Other Engine FeaturesGeneral operator graphs (not just map-reduce)Hash-based reduces (faster than Hadoop’s sort)Controlled data partitioning to save communicationIteration time (s)PageRank Performance200171Hadoop150100500Basic Spark7223Spark ControlledPartitioning

Spark Community1000 meetup members60 contributors17 companies contributing

This TalkSpark programming modelExamplesDemoImplementationTrying it out

OverviewSpark core is written in ScalaPySpark calls existing scheduler, cache andnetworking layer (2K-line wrapper)YourappPySparkNo changes to PythonSparkworkerPython childSparkworkerPython childSparkclientPython childPython child

OverviewSpark core is written in ScalaPySpark calls existing scheduler, cache andnetworking layer (2K-line wrapper)No changes to rkerJoshclientRosenSparkworkercs.berkeley.edu/ joshrosenPython childPython childPython childPython child

Object MarshalingUses pickle library for both communicationand cached data» Much cheaper than Python objects in RAMLambda marshaling library by PiCloud

Job SchedulerSupports generaloperator graphsAutomaticallypipelines functionsB:A:G:Stage 1C:Aware of data localityand partitioninggroupByD:F:mapE:Stage 2joinunion cached data partitionStage 3

InteroperabilityRuns in standard CPython, on Linux / Mac» Works fine with extensions, e.g. NumPyInput from local file system, NFS, HDFS, S3» Only text files for nowWorks in IPython, including notebookWorks in doctests – see our tests!

Getting StartedVisit spark-project.org for video tutorials,online exercises, docsEasy to run in local mode (multicore),standalone clusters, or EC2Training camp at Berkeley in August (freevideo): ampcamp.berkeley.edu

Getting StartedEasiest way to learn is the shell: ./pyspark! nums sc.parallelize([1,2,3]) # make RDD from array! nums.count()!3! nums.map(lambda x: 2 * x).collect()![2, 4, 6]!

Writing Standalone Jobsfrom pyspark import SparkContext!!if name " main ":!sc SparkContext(“local”, “WordCount”)!lines sc.textFile(“in.txt”)!!counts lines.flatMap(lambda s: s.split()) \!.map(lambda word: (word, 1)) \!.reduceByKey(lambda x, y: x y)!!!!counts.saveAsTextFile(“out.txt”)!

ConclusionPySpark provides a fast and simple way toanalyze big datasets from PythonLearn more or contribute at spark-project.orgLook for our training campon August 29-30!My email: matei@berkeley.edu

Behavior with Not Enough RAM20050%75%% of working set in eration time (s)100Fullycached

The Rest of the StackSpark is the foundation for wide set of projectsin the Berkeley Data Analytics Stack aph)MLbase(machinelearning) SparkMore details: amplab.berkeley.edu

1050SQL15Shark Lab35GiraphResponse Time (min)Storm30SparkThroughput (MB/s/node)Shark (mem)Redshift20Impala (mem)25Impala (disk)Response Time (s)Performance Comparison2015Graph

analyze big datasets from Python Learn more or contribute at spark-project.org Look for our training camp on August 29-30! My email: matei@berkeley.edu . The Rest of the Stack Spark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS) Spark Spark Streaming (real-time) GraphX (graph) .