Hadoop File Formats And Data Ingestion

2y ago

99 Views

1 Downloads

1.12 MB

25 Pages

Report/dmca

Download PDF

Transcription

Hadoop File Formats and Data IngestionPrasanth Kothuri, CERN2

Files Formats – not just CSV--Key factor in Big Data processing and queryperformanceSchema EvolutionCompression and SplittabilityData Processing Write performancePartial readFull readHadoop File Formats and Data Ingestion3

Available File Formats--Text / CSVJSONSequenceFile -binary key/value pair formatAvroParquetORC optimized row columnar formatHadoop File Formats and Data Ingestion4

AVRO-Language neutral data serialization system---AVRO data is described using language independentschemaAVRO schemas are usually written in JSON and data isencoded in binary formatSupports schema evolution--Write a file in python and read it in Cproducers and consumers at different versions of schemaSupports compression and are splittableHadoop File Formats and Data Ingestion5

Avro – File structure and exampleSample AVRO schema in JSON formatAvro file structure{"type" :"name" :"fields""name""type"}, {"name""type"}, {"name""type"} ],"doc:" :}"record","tweets",: [ {: "username",: "string",: "tweet",: "string",: "timestamp",: "long",“schema for storing tweets"Hadoop File Formats and Data Ingestion6

Parquet columnar storage formatkey strength is to store nested data in truly columnarformat using definition and repetition levels1Nested schemaTable representationRow 2y3y4y5z1z2z3z4z5Columnar formatXYZx2y2z2x3y3z3x4y4z4x5Y5z5x1x2x3x4encoded chunkencoded chunkencoded chunk(1) Dremel made simple with parquet - ith-parquetHadoop File Formats and Data Ingestion7

Optimizations – CPU and I/OStatistics for filtering and query optimizationprojection push downpredicate push downread only the data you needYMinimizes CPU cache missesSLOWcache misses costs cpu cyclesHadoop File Formats and Data Ingestion8

Encoding-Delta Encoding:--Prefix Encoding:--delta encoding for stringsDictionary Encoding:--E.g timestamp can be encoded by storing first value and the deltabetween subsequent values which tend to be small due to temporalvaliditySmall set of values, e.g post code, ip addresses etcRun Length Encoding:-repeating dataHadoop File Formats and Data Ingestion9

Parquet file structure & ConfigurationConfigurable parquet parametersInternal structure of parquet fileProperty nameDefault value Descriptionparquet.block.size128 MBThe size in bytes of a block (rowgroup).parquet.page.size1MBThe size in bytes of a page.parquet.dictionary.page.size1MBThe maximum allowed size in bytesof a dictionary before fallingback to plain encoding for a page.parquet.enable.dictionarytrueWhether to use dictionaryencoding.The type of compression:UNCOMPRESSED, SNAPPY, GZIP & LZOparquet.compressionUNCOMPRESSEDIn summation, Parquet is state-of-the-art, open-source columnar format thesupports most of Hadoop processing frameworks and is optimized for highcompression and high scan efficiencyHadoop File Formats and Data Ingestion10

Data IngestionHadoop File Formats and Data Ingestion11

FlumeFlume is for high-volume ingestion into Hadoop of eventbased datae.g collect logfiles from a bank of web servers, thenmove log events from those files to HDFS (clickstream)Hadoop File Formats and Data Ingestion12

Flume ExampleHadoop File Formats and Data Ingestion13

Flume configurationCategoryagent1.sources source1agent1.sinks sink1agent1.channels 1.channels channel1agent1.sinks.sink1.channel channel1Sourceagent1.sources.source1.type spooldiragent1.sources.source1.spoolDir /var/spooldirJMSNetcatSequence generatorSpooling e hdfsagent1.sinks.sink1.hdfs.path ink1.hdfs.filetype DataStreamSinkagent1.channels.channel1.type memoryagent1.channels.channel1.capacity 10000agent1.channels.channel1.transactionCapacity 100flume-ng agent –name agent1 –conf FLUME HOME/confAvroElasticsearchFile rollHBaseHDFSIRCLoggerMorphline (Solr)ChannelNullThriftFileJDBCMemoryHadoop File Formats and Data Ingestion14

Sqoop – SQL to Hadoop-Open source tool to extract data from structured datastore into HadoopArchitectureHadoop File Formats and Data Ingestion15

Sqoop – contd.-Sqoop schedules map reduce jobs to effect imports and exportsSqoop always requires the connector and JDBC driverSqoop requires JDBC drivers for specific database server, theseshould be copied to /usr/lib/sqoop/libThe command-line structure has the following structureSqoop TOOL PROPERTY ARGS SQOOP ARGSTOOL - indicates the operation that you want to perform, e.g import, export etcPROPERTY ARGS - are a set of parameters that are entered as Java properties in theformat -Dname value.SQOOP ARGS - all the various sqoop parameters.Hadoop File Formats and Data Ingestion16

Sqoop – How to run sqoopExample:sqoop import \--connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11 s.cern.ch \--username hadoop tutorial \-P \--num-mappers 1 \--target-dir visitcount rfidlog \--table VISITCOUNT.RFIDLOGHadoop File Formats and Data Ingestion17

Sqoop – how to parallelize-- table table name-- query select * from table name where CONDITIONS-- table table name-- split-by primary key-- num-mappers n-----table table namesplit-by primary keyboundary-query select range from dualnum-mappers nHadoop File Formats and Data Ingestion18

Hands On – 1Use Kite SDK to demonstrate copying of various file formats to HadoopStep 1) Download the MovieLens Datasetcurl atest-small.zip -omovies.zipunzip movies.zipcd ml-latest-small/Step 2) Load the Dataset into Hadoop in Avro format-- infer the schemakite-dataset csv-schema ratings.csv --record-name ratings -o ratings.avsccat ratings.avsc-- create the schemakite-dataset create ratings --schema ratings.avsc-- load the datakite-dataset csv-import ratings.csv --delimiter ',' ratingsHadoop File Formats and Data Ingestion19

Hands On – 1 contdStep 3) Load the Dataset into Hadoop in Parquet format-- infer the schemakite-dataset csv-schema ratings.csv --record-name ratingsp -oratingsp.avsccat ratingsp.avsc-- create the schemakite-dataset create ratingsp --schema ratingsp.avsc --format parquet-- load the datakite-dataset csv-import ratings.csv --delimiter ',' ratingspStep 4) Run a sample query to compare the elapsed time between Avro & Parquethiveselect avg(rating)from ratings;select avg(rating)from ratingsp;Hadoop File Formats and Data Ingestion20

Hands On – 2Use Sqoop to copy an Oracle table to HadoopStep 1) Get the Oracle JDBC driversudo su cd /var/lib/sqoopcurl -L https://pkothuri.web.cern.ch/pkothuri/ojdbc6.jar -o ojdbc.jarexitStep 2) Run the sqoop jobsqoop import \--connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11 s.cern.ch \--username hadoop tutorial \-P \--num-mappers 1 \--target-dir visitcount rfidlog \--table VISITCOUNT.RFIDLOGHadoop File Formats and Data Ingestion21

Hands On – 3Use Sqoop to copy an Oracle table to Hadoop, multiple mapperssqoop import \--connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11 s.cern.ch \--username hadoop tutorial \-P \--num-mappers 2 \--split-by alarm id \--target-dir lemontest alarms \--table LEMONTEST.ALARMS \--as-parquetfileCheck the size and number of fileshdfs dfs -ls lemontest alarms/Hadoop File Formats and Data Ingestion22

Hands On – 4Use Sqoop to make incremental copy of a Oracle table to HadoopStep 1) Create a sqoop jobsqoop job \--create alarms \-- \import \--connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11 s.cern.ch \--username hadoop tutorial \-P \--num-mappers 1 \--target-dir lemontest alarms i \--table LEMONTEST.ALARMS \--incremental append \--check-column alarm id \--last-value 0 \Hadoop File Formats and Data Ingestion23

Hands On – 4 contd.Step 2) Run the sqoop jobsqoop job --exec alarmsStep 3) Run sqoop in incremental modesqoop import \--connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11 s.cern.ch \--username hadoop tutorial \-P \--num-mappers 1 \--table LEMONTEST.ALARMS \--target-dir lemontest alarms i \--incremental append \--check-column alarm id \--last-value 47354 \hdfs dfs -ls lemontest alarms i/Hadoop File Formats and Data Ingestion24

Q&AE-mail: Prasanth.Kothuri@cern.chBlog: http://prasanthkothuri.wordpress.comSee also: https://db-blog.web.cern.ch/25

Use Sqoop to copy an Oracle table to Hadoop Step 1) Get the Oracle JDBC driver Step 2) Run the sqoop job Hadoop File Formats and Data Ingestion 21 sudo su - cd /var/lib/sqoop curl -L https://pkothuri.web.cern.ch/pkothuri/ojdbc6.jar -o ojdbc.jar exit sqoop import \--connect s.cern.ch \