Data Processing Goes Big - Sysbus

Transcription

Test report: Talend Enterprise Data Integration – Big Data EditionData processing goes bigDr. Götz GüttichTalend Enterprise Data Integration is a powerful set of tools to access, transform,move and synchronize data. With more than 450 connectors, it can shift data fromany number of different sources to just as many targets. The Big Data Edition ofthis solution is optimized to take advantage of Hadoop and itsdatabases or technologies such as HBase, HCatalog, HDFS, Hive, Oozieand Pig. IAIT decided to put Talend Enterprise Data Integration throughits paces and see how it stacks up in the real world.When IT specialists talk about‘big data’, they are usuallyreferring to data sets that are solarge and complex that they canno longer be processed withconventional data managementtools. These huge volumes ofdata are produced for a variety ofreasons. Streams of data can begenerated automatically (reports,logs, camera footage, etc.). Orthey could be the result ofdetailed analyses of customerbehavior (consumption data),scientific investigations (theLarge Hadron Collider is an aptexample) or the consolidation ofvarious data sources.These data repositories – whichtypically run into petabytes andexabytes – are hard to analyzebecause conventional databasesystems simply lack the muscle.Big data has to be analyzed inmassively parallel environmentswhere computing power isdistributed over thousands ofcomputers and the results aretransferred to a central location.The Hadoop open sourceplatform has emerged as thepreferred framework foranalyzing big data. Thisdistributed file system splits theinformation into several datablocks and distributes theseblocks across multiple systems inthe network (the Hadoop cluster).By distributing computing power,1Hadoop also ensures a highdegree of availability andredundancy. A "master node"handles file storage as well asrequests.

Talend Enterprise DataIntegrationHadoop is a very powerfulcomputing platform for workingwith big data. It can acceptexternal requests, distribute themto individual computers in thecluster and execute them inparallel on the individual nodes.The results are fed back to acentral location where they canthen be analyzed.However, to reap the benefits ofHadoop, data analysts need away to load data into Hadoop andsubsequently extract it from thisopen source system. This iswhere the Big Data Edition ofTalend Enterprise DataJaspersoft, SAP, Amazon RDSand Salesforce and supports avariety of databases such as DB2,Informix and – obviously –Hadoop.How does it work?Talend Enterprise DataIntegration is a code generator.All the user needs to do is definea data source – a CSV file or adatabase for instance – and thenspecify the operation to beperformed on that data. In thecase of a CSV file, for example,the user will specify details likeencoding method and fieldseparator, although the optionsavailable will of course varyfrom source to source. As soon asUsers are presented with a welcome screen when they launch TalendEnterprise Data IntegrationIntegration comes in. Aspreviously mentioned, TalendEnterprise Data Integration canread data from just about anysource, process the data asinstructed by the user and thenexport it. Taking a simple job asan example, this tool can readCSV files, select certain fieldslike name or address, and exportthe results to an Excel file. Butits capabilities don’t end there: Italso works with businessintelligence solutions likethe data source is set, the usercan link it to their workspace asan icon.Next, the user can define theirpreferred workflow. They canmanipulate the data in a numberof ways – filter, sort, replace,transform, split, merge andconvert. There is also a mapfunction for transforming data.This lets users select specific datafields, rearrange these fields,automatically insert additional2data like numbering and muchmore. Each of thesetransformation features comeswith its own icon which can bedragged and dropped to theuser’s workspace and configuredthere.Once the data source has beendefined and the user has specifiedhow the tool should process theinformation, the next task is todefine the export settings. Forthis, Talend offers variousconnectors to supported targetsystems such as Informix orHadoop. Talend Enterprise DataIntegration uses icons for theconnectors, too. Again, the usercan simply drag and drop theicons and configure them on theirworkspace. Here again, theconfiguration options depend onthe export medium. In the case ofan Excel spreadsheet, forexample, the user just needs toenter the export path.Lines are used to illustrate theflow of data between icons.Analytics users can usually justdrag these (sometimes they alsoneed to select certain connectiontypes from a menu). The job canbe started once all of these stepshave been completed. Firstly,Talend Enterprise DataIntegration creates the coderequired to execute the job. Thenit starts the job and transformsthe data. The generated code canbe in Java or SQL. Depending onthe technology used, thegenerated code can be in Java orSQL for Hadoop, MapReduce,Pig Latin, HiveQL and others.Since each job step is symbolizedby an icon – for which users onlyhave to specify the generalrequirements in order to generatethe code automatically – usersdon’t need programming skills to

run complex, code intensive dataprocessing jobs.The above example is a simpleone to illustrate the process, butTalend Enterprise Databetween the subscription andopen source products is extrasupport options (including SLAs)and additional features such asshared repositories, wizards,shared jobs, version control,When they use a CSV file as their source, users can configure all kinds ofparametersIntegration can handle muchmore complex tasks such asimporting data and then mappingspecific fields, transformingcertain data types and sorting themodified output prior to export.VersionsTalend software is available in anumber of editions, starting withfreely downloadable open sourceproducts Talend Open Studio forData Integration and TalendOpen Studio for Big Data. Theseare available on the vendor’swebsite and are free to downloadand use. The subscription basedproduct is called TalendEnterprise Data Integration and isavailable in four editions: Team,Professional, Cluster and BigData. The main differencereference projects, to mention buta few. The subscription versionsthemselves offer differentcombinations of features,including support for loadbalancing, high availability andHadoop. Talend is in fact one ofthe first vendors to provideHadoop support.The website provides acomparative matrix of theavailable software versions andtheir functions. For our test, welooked at the Big Data Edition ofTalend Enterprise DataIntegration, which requires asubscription. It is fair to pointout, however, that the opensource product Talend OpenStudio for Big Data has quite anextensive set of functions, and3would be perfectly suitable for allinternal data transformation jobs.Administrators looking to fast track import and export scriptwriting should look at what thesubscription products have tooffer.The testFor the test we used anenvironment with Hadoop 1.0.3running in a vSphere installationbased on IBM’s X Architecture.After installing Talend EnterpriseData Integration on a workstationoperating on the x64 edition ofWindows 7 Ultimate, we startedby importing data from a CSVfile. We transformed the data andexported it as an Excelspreadsheet to become familiarwith the solution’s functionality.We then set up a connection toour Hadoop system, imported thesame CSV data again and wroteit to Hadoop. Then, we re exported the data to check thateverything had worked properly.For the next step, we took100,000 or ten million records ofcompany data. We analyzed thisdata using Pig. Finally, we usedHive to access the data inHadoop and worked with theHBase database. We will coverthese terms in greater detail lateron. As for specifications, TalendEnterprise Data Integration runson Java 1.6 – the latest version.The vendor recommends the 64 bit version of Windows 7 as theoperating system and a standardworkstation with 4 GB ofmemory.Installing Talend EnterpriseData IntegrationInstalling Talend Enterprise DataIntegration is straightforward.The user just needs to make surethat their system runs a supported

version of Java and then unzipthe Talend files to a folder oftheir choice (e.g. c:\Talend).allow the user to defineautomated tasks like the splittingof fields.The next step is to launch theexecutable file. The system willinitially ask for a valid license (alicense key is, of course, notrequired in the open sourceeditions). Once the key has beenvalidated, the user will see ascreen with some general licenseterms. After the users click toconfirm, they can get started bycreating a repository (withworkspace) and setting up theirfirst project. When they open thisproject they will be presentedwith the development tool’swelcome screen, which guidesthe users through the first steps.The workspace is located in thetop center of the screen. Here, theuser can define jobs using icons.Working with TalendEnterprise Data IntegrationThe design environment is basedon the Eclipse platform. Arepository on the left allows theuser to define items like jobs,joblets and metadata. The jobsoutline the operations –represented by icons – to beperformed on the data. Themetadata can be used to set upfile, database and SAPconnections, schemas, etc. Andthe joblets create reusable dataintegration task logic that can befactored into new jobs on amodular basis.Performing an analysis job with Pig, running here on the Hadoop serverThe Code subfolder contains twointeresting additionalfunctionalities. Job scripts, on theone hand, are processdescriptions – i.e. codegeneration instructions – in XMLformat. Job scripts can provide afull description of the processes,so users can implement functionsfor the import of tabledescriptions, for example.Routines, on the other hand,Underneath, the context sensitiveconfiguration options can be setfor the selected icon. The optionsfor starting and debugging jobsand lists of errors, messages andinformation are also found here.A panel on the right of the screencontains a Palette with thevarious components in icon form.These include the import andexport connectors as well as thefunctions for editing data,executing commands and so on.Users can also integrate theirown code into the system at anytime. The Palette therefore holdsall of the components that can bedragged to the workspace anddropped.The first jobsFor the test, we carried out thefirst job at this point, importing aCSV file and then writing thedata to an Excel spreadsheet. Aswe have already basicallydescribed this job in the4introduction, we will move on tothe job of writing the data from aCSV file to Hadoop. For this job,our source was the CSV file thatwas pre defined under metadatawith its configuration parameterslike field separator and encoding.After dragging it to theworkspace, we defined an exporttarget. For this, we selected thetHDFSOutput component fromthe Big Data folder in the Paletteand dragged the icon next to oursource file. The HDFS in thename stands for HadoopDistributed File System.The next task was to configurethe output icon. After clicking theicon, we were able to enter therequired settings in theComponent tab below theworkspace. These settingsincluded Hadoop version, servername, user account, target folderand name of target file inHadoop. For our test, we againused a CSV file as our target.To finish, we had to create aconnection between the twoicons. We did this by right clicking the source icon anddragging a line as far as theHDFS icon.

Once the connection was made,we were able to start the job (onwork of the information importand export process.Pig Latin in actionthe relevant tab below theworkspace). When the task wascompleted, the Talend softwareshowed the throughput in rowsper second and the number ofdata sets transferred. We checkedthat the new file had actuallyarrived at its target destinationusing the Hadoop function"Browse the file system" athttp://{Name of Hadoopserver}:50070. It took us lessthan five minutes to set up thisjob and everything worked out ofthe box exactly as expected.Working with the dataWe carried out the two jobsdescribed above to make surethat the Talend suite was able tocommunicate seamlessly withour Hadoop system. Satisfied thatthis was the case, we decided toanalyze the data.The objective was to read aparticular customer number fromtransferred to the Hadoop systemto carry out the data queries. Wesaved the query result as a file inHadoop.At this point we need to go intothe technical details. Hadoopuses the MapReduce algorithm toperform computations on largevolumes of data. This is aframework for the parallelprocessing of queries usingcomputer clusters. MapReduceinvolves two steps, the first ofwhich is the mapping. Thismeans that the master node takesthe input, divides it into smallersub queries and distributes theseto nodes in the cluster.The sub nodes may then either,split the queries amongthemselves – leading to a tree like structure – or else query thedatabase for the answer and sendit back to the master node. In theFor the next part of our test, wewanted to read data fromHadoop. We started by selectinga component from the Palettecalled tHDFSInput, which was tobe our source. We configured itusing the same settings as for thetarget – server name, file name,etc. For the data output we addeda tLogRow component – an easyway to export data from the datastream to the system console.As soon as we had created aconnection between the twoicons (as described above), wewere able to start the job andlook at the content of our originalCSV file on the screen. Hadoopand the Talend Enterprise DataIntegration solution made lightUsers can check the status of their jobs at any time on the Hadoop webinterfacea customer file with ten milliondata sets. Hadoop functionalitywas extremely beneficial for thistask. We used the Talend tool tocreate a code which we5second step (reduce), the masternode collects the answers andcombines them to form theoutput – the answer to theoriginal query. This

parallelization of queries acrossmultiple processors greatlyimproves process execution time.The Pig platform is used to createMapReduce programs running onHadoop. It is so named becauseits task is to find truffles in thedata sets. The associatedprogramming language is calledPig Latin. Special programstherefore need to be written touse MapReduce. The TalendA connection to a Hive database canbe set up quickly and easilyEnterprise Data Integration codegenerator makes this task mucheasier. It delivers variousfunctions that allow users todefine the data sources, queriesand targets using the familiaricons in the design workspace,generate the code (e.g.MapReduce or Pig Latin), send itto the Hadoop environment andexecute it there.For the test, the first thing we didwas to create a component calledtPigLoad to load the data wewished to analyze. To this weassigned configurationparameters such as the Hadoopserver name, Hadoop version,user account, file to be analyzedand the schema that we hadpreviously configured undermetadata. We then created atPigRow filter component,specifying the value each fieldshould have in order to executethe query.In relation to the schema, wewould like to point out that sincethe source file consists of datalike name, number, etc., TalendEnterprise Data Integration has toknow which data belongs towhich fields. The user can definethe relevant fields as a schemaunder metadata and make thisinformation available to thesystem.We defined the answer with anicon called tPigStoreResult, towhich we added the target folderand the name of the result file. Tofinish, we created connectionsbetween the individual icons, notwith the right click actiondescribed earlier, but by right clicking the relevant componentand selecting the command PigCombine in the Row menu.called number and that thesystem should count all theproduct names in the database(these were indicated alongsidethe respective customer entries)and then write the names in a file,specifying the frequency of eachname. We were able to see theresult on our Hadoop servershortly after starting the job.Working with HiveHive creates a JDBC connectionto Hadoop with SQL. Developerscan use Hive to query Hadoopsystems with an SQL like syntax.To test Hive, we first created anew database connection to thecustomer database that wasalready in our Hadoop testsystem under metadata. All wehad to do was select Hive as thedatabase type, specify the serverand port and click Check. Onsuccessful completion of thedatabase connection test, wewere able to see the connection inour Data Integration system anduse it as an icon.We did this because we wanted tocreate a script to be executed onthe Hadoop system. We thenstarted the job, and a short timelater we were able to see theresult on the Hadoop server’sweb interface, which was asexpected. Our testing of the Pigcomponents therefore rancompletely smoothly.One of the configuration optionsfor the Hive database connectionis a Query field, where the usercan type SQL queries. For ourfirst query, we asked thecustomer database how manycustomers lived in Hannover. Westarted by entering "selectcount(*) from {database} wherecity like '%Hannover%'" in thedatabase connection’s queryfield.For our next job, we wanted touse the entries in our customerfile to determine the popularity ofcertain products. We started bycopying the query job andreplaced the tPigRow componentwith an icon calledtPigAggregate. We specified thatwe wanted an output columnAgain, we used a tLogRowcomponent as output and createda connection between the twoicons, which the system used todeliver the count value. Not longafterwards we were able to seethe number of Hannover basedcustomers on the system console.So just like Pig, the Hive testing6

went extremely smoothly. Forour second Hive job, weattempted to write the entiredatabase to an Excel table.First, we adapted the query in oursource connection. Instead of thetLogRow component, weselected a tFileOutputExcel iconand specified the target path andthe name of the target file. Afteroutput on our system console. Westarted by creating a new job,dragging the icon with the sourceCSV file to the workspace. Wethen used a tMap component tofilter the data destined for thedatabase from the file.Finally, we created atHBaseOutput icon. To configurethis icon we had to enter theData queries with an SQL like language and Hivethat we right clicked to connectthe two entries. Shortly afterstarting the job, we found all therequired data in an Excelspreadsheet on our workstation.Hive is a very useful technologyfor SQL administrators. It isextremely easy to use inconjunction with TalendEnterprise Data Integration.HBaseHBase is a relatively easy to use,scalable database suited tomanaging large volumes of datain a Hadoop environment. Usersrarely change the data in theirHBase databases but frequentlywill keep adding data to it.To finish our test, we exporteddiverse data sets from ouroriginal CSV file to the HBasedatabase on our Hadoop systemand waited for the data to beHadoop version, server name andtable name, and assign data to therelevant fields. When all therequired connections had beenmade, we started the job and thedata arrived in the database.To check that everything hadworked correctly, we extractedthe data in the HBaseenvironment to our systemconsole. We did this with acomponent called tHBaseInput,configuring it in the same way asthe output component. Wecompleted the job configurationby adding a tLogRow icon andcreating a connection betweenthe two components. Afterstarting the job, the data appearedon our screen as expected. Usersof HBase can thus be reassuredthat their database worksperfectly with Talend EnterpriseData Integration.7SummaryThe Big Data Edition of TalendEnterprise Data Integration is abridge between the datamanagement tools of the past andthe technology of tomorrow.Even without big datafunctionality, this solutionprovides an impressive range offunctions for data integration,synchronization andtransformation. And the big datafunctionality takes it into anotherleague altogether. Pig supportmakes it easy for users to rundistributed data queries in thecluster. Hive and HBase supportmean that Talend Enterprise DataIntegration can be deployed injust about any environment.Extensive data quality featuresand project management with ascheduling and monitoringframework round off thepackage. Talend Enterprise DataIntegration works not only withthe Apache Foundation’s Hadoopdistribution but also withsolutions from Hortonworks,Cloudera, MapR and Greenplum.Database administrators andservice providers will be hardpressed to find a better product.Talend Enterprise DataIntegration – Big DataEditionPowerful set of tools to access,transform, move andsynchronize data with Hadoopsupport.Advantages: Lots of functions Easy to use Support for Hadoop, Pig,HBase and HiveFurther information:Talendwww.talend.com

Talend software is available in a number of editions, starting with freely downloadable open source products Talend Open Studio for Data Integration and Talend Open Studio for Big Data. These are available on the vendor's website and are free to download and use. The subscription based product is called Talend Enterprise Data Integration and is