Scientific Computing Environments In The Age Of Virtualization

Transcription

Scientific Computing Environments in the age ofvirtualizationToward a universal platform for the CloudKarim ChineCloud Era LtdCambridge, act— This paper describes Biocep-R, an Open Sourceplatform for the virtualization of Scientific ComputingEnvironments (SCEs) such as R and Scilab. To our knowledge itis the first time that a software platform enables geographicallydistributed collaborators to view and analyze terabytes of datainteractively and collaboratively, using standard computationaltools. Those tools can be running on high performance machinesor on a Cloud. This is also the first time that a full end-to-endsolution is proposed for reproducible computational research in aCloud and for virtual appliances-based education.Keywords: HPC; cloud computing; distributed computing;application virtualization; SaaS; Web Services; workflows;cyberinfrastructure; large scale data mining; collaborative dataanalysis; reproducible research; open sourceII. LOWERING THE BARRIERS FOR ACCESSING CYBERINFRASTRUCTURES. LOCAL/REMOTE TRANSPARENCYThe same application (the virtual workbench) makes iteasy to connect to various environments locally or on remotemachines whether they are nodes of a grid or virtual machinesof a Cloud. Using Java remoting technologies and HTTPrelays, Biocep-R makes it possible to uniquely identify aremote generic computational engine with a simple URL:Switching from one resource to another (EGEE to TeraGrid toAmazon’s Cloud) becomes as simple as replacing one URLwith another.I. INTRODUCTIONR is a language and environment for statistical computingand graphics. It is becoming the lingua franca of data analysis.Repositories of contributed R packages related to a variety ofproblem domains in life sciences, social sciences, finance,econometrics, chemometrics, etc. are growing at an exponentialrate. Scilab is a scientific software package for numericalcomputations providing a powerful open computingenvironment for engineering and scientific applications.Biocep-R is a GPL Java platform that enables to use R, Scilabor any other computational environment with an API(example: Sage, Octave, Root, Matlab, SAS, etc.) interactivelyon Clusters, Grids or private/public Clouds and to enable theinteroperability, pluggability, sharing and reuse of thecomputing artifacts.FIG. 1 Biocep-R Computational Open PlatformFIG. 2 R VirtualizationThe virtual workbench has several dockable built-in viewsincluding:Consoles for issuing commands to the SCEs or to theserver-side scripting interpreters (Python, Groovy,Ruby, etc.)Remote working directory browser.Syntax-highlighting-enabled code editors.Help viewers.Viewers for PDF, SVG, HTML, etc. files.Highly interactive server-side graphic devices (withbuilt-in zooming, scrolling, coordinate tracking, etc.).Data inspectors.Server-side linked plots.Server-side spreadsheets that are fully integrated withthe SCEs functions and data, etc.

Biocep-R is currently available on Amazon’s ElasticCloud. Here are the steps a user needs to follow to performcomputing on EC2 (see www.biocep.net for details):Sign up for EC2Use Amazon’s Elasticfox Graphic User Interface(GUI) to browse the Biocep-R AMIs (AmazonMachine Images) and choose the one corresponding tothe computing environment and libraries he needs.Choose an instance type (memory size, number ofcores ), provide his email in the user data and run theAMI.When the AMI starts running ( 2 minutes wait time)he receives an email containing a URLThe user can thenClick on the URL and this runs the Biocep-Rvirtual workbench which connects automaticallyto a computational engine on the runningmachine instance. drag-and-drops his R scriptsand his data files from his desktop to the virtualworking directory viewExecute R scripts using the R consoleDrag-and-drop result files to his desktopShut down the AMI when the session is nolonger needed.cyberinfrastructures’ users to do: The generic computationalengine can run on any machine that has a privilegedconnectivity with the data storage machine or within the largescale database. The user can connect his Biocep-R virtualworkbench (or his scripts using the Biocep-R SOAP clients) tothe computational engine, set the working directory to thelocation of the data (e.g. via NFS) and view or analyze thedata using R/Scilab packages.IV. ENABLING COLLABORATION WITHIN COMPUTINGENVIRONMENTSUsers can connect to the same remote engine and workwith large scale data collaboratively using broadcastedcommands/graphics and collaborative spreadsheets. Forexample, the Amazon EC2 user can forward the emailreceived from the Biocep-R running AMI to any number ofhis collaborators. By clicking on the same URL, they all getconnected to the same computing environment. Everycommand issued by one of them is seen by all the others.Synchronized R graphics panels allow them to see the samegraphics and to annotate them collaboratively. Chatting isenabled. Views based on a refactored iplots package enablecollaborative highlighting and color brushing on a variety ofhigh interaction graphics (linked plots).Besides the virtual workbench, the RESTful Biocep-Rserver enables users to compute and generate graphics on HPCresources using only a browser. Simple URLs allow them toexecute any script or to evaluate any expression by workersfrom a back-end computational engines pool and to retrievethe results either as text files or as Graphics in any format(pdf, svg, jpeg, png, etc).FIG. 4 Collaborative RV. SCIENCE GATEWAYS MADE EASYFIG. 3 R virtualization on the National Grid ServiceWeb-based interfaces and portals allowing scientists to usea Grid to solve their domain specific problems have alwaysbeen difficult to develop, upgrade and maintain. We shouldhave front ends that are easy to create. Biocep-R proposes adifferent paradigm for the creation and distribution of suchfront-ends to HPC/Cloud environments.III. DEALING WITH THE DATA DELUGEThe data generated by modern science tools can becometoo large to move easily from one machine to another. Thiscan be an issue for large collaborative projects. The analysis ofsuch data can’t be performed the way it has been so far. Theanswer to this increasingly acute problem is to take thecomputation to the data and is what Biocep-R enables1) The Biocep-R Plug-insThe Biocep-R platform defines a contract for creatingcross-platform statistical/numerical new interfaces in SwingJava either programmatically or using visual composition toolslike the Netbeans GUI designer. The views can be bundled

into zip files and opened by anyone using the Biocep-R virtualworkbench. The views receive a Java Interface that allowsthem to use the R/Scilab engine to which the workbench isconnected and that can be running at any location.Three-parts-URLs (Biocep-R’s Java Web Start trigger computational engine’s URL parameter plug-in’s zip fileURL parameter) can be used to deliver those GUIs to the enduser. He retrieves them in one click and the only softwarerequired to be preinstalled on his machine is a Java runtime.Instead of requiring a transparent connection to a server-sideGrid/Cloud-enabled engine, the distribution URLs can bewritten to trig transparently the creation of a computationalengine on the user’s machine: a zipped version of R is copiedon the user’s machine (with or without administrativeprivileges) and is used transparently by the GUI.- Buttons executing any user-defined R script, etc.This spreadsheet enables scientists without programmingskills to create sophisticated Grid/Cloud-based analyticalviews and dashboards and lowers the barriers for creatingscience gateways and distributing them.VI. BRIDGING THE GAP BETWEEN EXISTING SCES ANDGRIDS/CLOUDSOnce the user’s workbench is connected to a remoteR/Scilab engine, a RESTful embedded server (local http relay)enables third-party applications such as emacs, Open OfficeCalc or Excel to access and use the Grid/Cloud-enabledengine. For example, an Excel add-in is being built to use thefull capabilities of the platform and reproduce the features ofthe Biocep-R spreadsheets from within Excel. The bidirectional mirroring of server-side spreadsheets’ models intoExcel cell ranges will also be available. This will allow usersto overcome some of the Excel flaws (limited capabilities instatistical analysis, inaccurate numerical calculations at theedge of double, inconsistent identification of missingobservations.). Excel becomes a front-end of choice toGrid/Cloud resources and can then become the universalworkbench for different sciences.FIG. 5 GUI Plug-ins2) The Biocep-R SpreadsheetsThe Biocep-R spreadsheets are Java-based built originallyusing the OSS jspreadsheet. Unlike jspreadsheet, Calc andExcel’s spreadsheets, they have their models on server-side,are HPC and collaboration enabled and are fully connectedwith the remote statistical/numerical engine's workspace. Thisenables for example R data import/export from/to cells and Rfunctions use in formula cells. Dedicated R functions(cells.get, cells.put, cells.select, etc.) allow the R user toretrieve the content of cell ranges into the R workspace or toupdate them programmatically: An R script can reproduceentirely the spreadsheet. A macros system allows the user todefine listeners on R variables and on cell ranges and to definecorresponding actions as R/Java scripts. Specific macroscalled datalinks allow the user to bi-directionally mirror Rvariables with cell ranges. R graphics and User Interfacecomponents can be docked onto cell ranges. UI componentscan be for example:- sliders mirroring R variables- Graphic Panels showing R Graphics (in any format)produced by user defined R scripts and automaticallyupdated in case user-defined R variables have theirvalues change or in case cells within a user-definedcell ranges list are updatedFIG. 6 R On Amazon’s Elastic Cloud EC2VII. A UNIVERSAL COMPUTING TOOLKIT FOR SCIENTIFICAPPLICATIONSBiocep-R frameworks and tools make it possible to use Ras a Java object-oriented toolkit or as an RMI server. All thestandard R objects have been mapped to Java and user definedR classes can be mapped to Java on demand. R functions canbe called from Java as if they were Java functions. The inputparameters are provided as Java objects and the result of afunction call is retrieved as a Java object. Calls to R functionsfrom Java locally or remotely cope with local and distributedR objects. The full capabilities of the platform are exposedvia a SOAP and RESTful front-ends. Several tools andframeworks are provided to help building analyticaldesktop/web applications and scalable data analysis pipelinesin any programming language (Java, C#, C , Perl, etc.)

VIII. SCALABILITY FOR COMPUTATIONAL BACK-ENDSBiocep-R provides a pooling framework for distributedresources (RPF) allowing pools of computational engines tobe deployed on heterogeneous nodes/virtual machinesinstances. These engines are managed and used via a simpleborrow/return API for multithreaded web applications andweb services, for distributed and parallel computing, fordynamic content on-the-fly generation (analytic results, tablesand graphics in various formats for thin web clients) and forcomputational engines’ virtualization in a sharedcomputational resources context. The engines becomeagnostic to the hosting operating system. Several tools areprovided to monitor and manage the pools programmaticallyor interactively (Supervisor UI). The pooling frameworkenables transparent cloudbursting: Amazon EC2 virtualmachines instances hosting one or many computationalengines can be fired up or shut down to scale up or scale downaccording to the load in a highly scalable web applicationsdeployment for example.Amazon EC2 HypervisorRunningMachine Instance YRunningMachine Instance ZNFSEBSVolumeVirtualWorkbenchRunningMachine Instance XFIG. 8 Distributed Computing on EC2X. BRIDGING THE GAP BETWEEN MAINSTREAM SCESThe platform has a server-side extensions architecture thatenables the creation of bridges between the remotecomputational engine and any third party tool. Besides R andScilab, several widely used environments will be integrated inthe future (Matlab, Root, SAS, etc.). Since R and Scilab arerunning within the same process (same Java Virtual Machine),it is easy and very fast to exchange data between them. Thiscan be achieved for example by using the Groovy interpreteravailable as part of the remote engine. The Python clientprovided by the platform makes it possible for the Scipycommunity to use R/Scilab engines on Grids/Clouds directlyfrom within their python scripts.XI. BRIDGING THE GAP BETWEEN MAINSTREAM SCES ANDWORKFLOW WORKBENCHESFig. 7 R engines pools deployment – CloudburstingIX. DISTRIBUTED COMPUTING MADE EASYTo solve heavily computational problems, there is a needto use many engines in parallel. Several tools are available butthey are difficult to install and beyond the technical skills ofmost scientists. Biocep-R solves this problem. From within amain R session and without installing any extratoolkits/packages, it becomes possible to create logical links toremote R/Scilab engines either by creating new processes orby connecting to existing ones on Grids/Clouds. Logical linksare variables that allow the R/scilab user to interact with theremote engines. rlink.console, rlink.get, rlink.put allow theuser to respectively submit R commands to the R/Scilabworker referenced by the rlink, retrieve a variable from theR/scilab worker’s workspace into the main R workspace andpush a variable from the main R workspace to the worker’sworkspace. All the functions can be called in synchronous orasynchronous mode. Several rlinks referencing R/Scilabengines running at any locations can be used to create a logicalcluster which enables to use several R/Scilab engines in acoordinated way. For example, a function called cluster.applyuses the workers belonging to a logical cluster in parallel toapply a function to a large scale R data.Biocep-R enables automatic exposure of R functions andpackages as Web Services. The generated Web Services areeasy to deploy and can use back-end computational enginesrunning at any location. They can be seamlessly integrated asworkflows nodes and used within environments such asKnime, Taverna or Pipeline Pilot. They can be stateless (ananonymous R worker performs the computation) or stateful(an R worker reserved and associated with a session ID is usedand can be reused until the session is destroyed). Thestatefulness solves the overhead problem caused by thetransfer of intermediate results between workflow nodes.FIG. 9 Generated stateful Web Services workflows

Besides being free and open source and thereforeaccessible to students and educators, Biocep-R provideseducation-friendly features that only proprietary softwarecould offer so far (for example the centralized and controlledserver-side deployment of the Scientific ComputingEnvironments) and enables new scenarios and practices in theteaching of statistics and applied mathematics. With Biocep-R,it becomes possible for educators to hide the complexity of R,Scilab, Matlab, etc. with User Interfaces such as Biocep-Rplugins/spreadsheets. These are very easy to create and todistribute to students. The User Interfaces reduce thecomplexity of the learning environment and keep beginningstudents away from the steep learning curves of R, Scilab orMatlab. Once created by one educator, the User Interfaces canbe shared, reused and improved by other educators. Dedicatedrepositories can be provided to centralize the efforts andcontributions of the community of educators and help themsharing the insight gained in using this new environment. Onecould envisage these methods being used from primaryschools to graduate-level studies.Virtual appliances (VMWare/Virtualbox/Zen virtualmachines) prepared by educators can be provided to studentson USB keys. The virtual machines contain the SCE, thelibraries used for the course and Biocep-R. The students needonly to have Java and a virtual machine player (the freeVMware player for example) installed on their laptops to runthe virtual Biocep-R workbench and to connect to acomputational engine on the virtual machine. The virtualappliance is fully self-contained: the code needed to run theworkbench or the plug-ins prepared by the educator isdelivered by the virtual appliance itself thanks to the Biocep-Rcode server that runs at startup. The interaction between thestudent and the SCE as well as the artifacts he produces aresaved within the Biocep-R-enabled-virtual machine. Theeducator can retrieve the USB keys used by the students andchecks not only the validity of the different intermediateresults they obtained but also the path they followed to getthose results.The collaboration capabilities of the virtual workbenchopen also new perspectives in distributed learning. TheEducator can connect anytime to the SCEs of students at anylocation. He can then see/update their environments and guidethem remotely. Collaborative problem solving becomes alsopossible and can be used as a support for learning.directories can be created as Elastic Block Stores (EBSs). TheBiocep-R virtual workbench makes it possible to all scientiststo work with these snapshots (AMIs EBSs) and producethem easily. By Providing the Biocep-R-enbled AMIidentifier, the complementary computational libraries EBSsnapshots identifiers and the working directory (data) EBSsnapshot identifier that have been used for his research, thescientist makes it possible to anyone to rebuild all the data andthe computational environment required to process that data.High-level Web ServicesHardwareHypervisorOperating SystemJava Virtual MachineUser 1VirtualWorkbenchUser 2DeveloperFIG. 10 Biocep-R within the Technology EcosystemXIV. CONCLUSIONThis new environment has the potential de democratize thecloud and to push forward the reproducibility ofcomputational research. Its current availability and easy accesson amazon's Elastic cloud and its planned deployments onmajor Grids (NGS, EGEE, TeraGrid) maximize its chances foruptake and adoption. Academia, Industry and EducationalInstitutions would benefit from the emergence of a newenvironment for the interoperability, sharing and reuse ofcomputational artifacts. The creation and sharing of analyticaltools and resources can become accessible to anyone (openscience). An international portal for on demand computing(www.elasticr.net) is being built using the differentframeworks provided by Biocep-R and could become a singlepoint of access to Virtualized SCEs on public servers and onvirtual appliances that are ready for use on various clouds.There is no question about the need for more usability in thecomputational landscape. Java, Xen, EC2, R and Biocep-Rprove that the target of a universal computational environmentfor science and for everyone is definitely within reach.XIII. THE BUILDING BLOCKS OF A TRACEABLE ANDREPRODUCIBLE COMPUTATIONAL RESEARCH PLATFORMWe provide a system so that the computationalenvironment, the data and the manipulations of the data(scripts, applications) can be recorded. These can be used byreviewers, collaborators and anyone wanting to investigate thedata. Biocep-R provides an end-to-end solution for traceableand reproducible computational research. Snapshots ofcomputational environments can be created as virtual machineimages (AMIs). Snapshots of versioned libraries and workingVirtualWorkbenchBiocepCollaborative Data AnalysisAND APPLIED MATHEMATICS EDUCATIONBiocepXII. THE BUILDING BLOCKS OF A PLATFORM FOR STATISTICSREFERENCE[1][2][3][4]R Development Core Team (2009). R: A language and environment forstatistical computing. R Foundation for Statistical Computing, Vienna,Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.Hardt, M., Seymour, K., Dongarra, J., Zapf, M., Ruiter, N.V."Interactive Grid-Access Using Gridsolve and Giggle," Computing andInformatics, Vol. 27, No. 2, 233-248, ISSN 1335-9150, 2008.http://www.scilab.orgTheus, M. and Urbanek, S. (2008) Interactive Graphics for DataAnalysis: Principles and Examples, CRC Press, ISBN 978-1-5848-85948

Scientific Computing Environments in the age of . Toward a universal platform for the Cloud Karim Chine Cloud Era Ltd Cambridge, UK karim.chine@polytechnique.org . Root, Matlab, SAS, etc.) interactively on Clusters, Grids or private/public Clouds and to enable the interoperability, pluggability, sharing and reuse of the computing artifacts. .