TIBCO Spotfire S Big Data User’s Guide

Transcription

Big Data User’s Guide for TIBCO Spotfire S 8.2November 2010TIBCO Software Inc.

IMPORTANT INFORMATIONSOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHERTIBCO SOFTWARE. USE OF SUCH EMBEDDED ORBUNDLED TIBCO SOFTWARE IS SOLELY TO ENABLE THEFUNCTIONALITY (OR PROVIDE LIMITED ADD-ONFUNCTIONALITY) OF THE LICENSED TIBCO SOFTWARE.THE EMBEDDED OR BUNDLED SOFTWARE IS NOTLICENSED TO BE USED OR ACCESSED BY ANY OTHERTIBCO SOFTWARE OR FOR ANY OTHER PURPOSE.USE OF TIBCO SOFTWARE AND THIS DOCUMENT ISSUBJECT TO THE TERMS AND CONDITIONS OF ALICENSE AGREEMENT FOUND IN EITHER A SEPARATELYEXECUTED SOFTWARE LICENSE AGREEMENT, OR, IFTHERE IS NO SUCH SEPARATE AGREEMENT, THECLICKWRAP END USER LICENSE AGREEMENT WHICH ISDISPLAYED DURING DOWNLOAD OR INSTALLATION OFTHE SOFTWARE (AND WHICH IS DUPLICATED IN TIBCOSPOTFIRE S LICENSES). USE OF THIS DOCUMENT ISSUBJECT TO THOSE TERMS AND CONDITIONS, ANDYOUR USE HEREOF SHALL CONSTITUTE ACCEPTANCEOF AND AN AGREEMENT TO BE BOUND BY THE SAME.This document contains confidential information that is subject toU.S. and international copyright laws and treaties. No part of thisdocument may be reproduced in any form without the writtenauthorization of TIBCO Software Inc.TIBCO Software Inc., TIBCO, Spotfire, TIBCO Spotfire S ,Insightful, the Insightful logo, the tagline "the Knowledge to Act,"Insightful Miner, S , S-PLUS, TIBCO Spotfire Axum,S ArrayAnalyzer, S EnvironmentalStats, S FinMetrics, S NuOpt,S SeqTrial, S SpatialStats, S Wavelets, S-PLUS Graphlets,Graphlet, Spotfire S FlexBayes, Spotfire S Resample, TIBCOSpotfire Miner, TIBCO Spotfire S Server, TIBCO Spotfire StatisticsServices, and TIBCO Spotfire Clinical Graphics are either registeredtrademarks or trademarks of TIBCO Software Inc. and/orsubsidiaries of TIBCO Software Inc. in the United States and/orother countries. All other product and company names and marksmentioned in this document are the property of their respectiveowners and are mentioned for identification purposes only. Thisii

software may be available on multiple operating systems. However,not all operating system platforms for a specific software version arereleased at the same time. Please see the readme.txt file for theavailability of this software version on a specific operating systemplatform.THIS DOCUMENT IS PROVIDED “AS IS” WITHOUTWARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,INCLUDING, BUT NOT LIMITED TO, THE IMPLIEDWARRANTIES OF MERCHANTABILITY, FITNESS FOR APARTICULAR PURPOSE, OR NON-INFRINGEMENT. THISDOCUMENT COULD INCLUDE TECHNICALINACCURACIES OR TYPOGRAPHICAL ERRORS.CHANGES ARE PERIODICALLY ADDED TO THEINFORMATION HEREIN; THESE CHANGES WILL BEINCORPORATED IN NEW EDITIONS OF THIS DOCUMENT.TIBCO SOFTWARE INC. MAY MAKE IMPROVEMENTSAND/OR CHANGES IN THE PRODUCT(S) AND/OR THEPROGRAM(S) DESCRIBED IN THIS DOCUMENT AT ANYTIME.Copyright 1996-2010 TIBCO Software Inc. ALL RIGHTSRESERVED. THE CONTENTS OF THIS DOCUMENT MAYBE MODIFIED AND/OR QUALIFIED, DIRECTLY ORINDIRECTLY, BY OTHER DOCUMENTATION WHICHACCOMPANIES THIS SOFTWARE, INCLUDING BUT NOTLIMITED TO ANY RELEASE NOTES AND "READ ME" FILES.TIBCO Software Inc. Confidential InformationReferenceThe correct bibliographic reference for this document is as follows:Big Data User’s Guide for TIBCO Spotfire S 8.2, TIBCO SoftwareInc.TechnicalSupportFor technical support, please visit http://spotfire.tibco.com/supportand register for a support account.iii

TIBCO SPOTFIRE S BOOKSNote about NamingThroughout the documentation, we have attempted to distinguish between the language(S-PLUS) and the product (Spotfire S ). “S-PLUS” refers to the engine, the language, and its constituents (that is objects,functions, expressions, and so forth). “Spotfire S ” refers to all and any parts of the product beyond the language, includingthe product user interfaces, libraries, and documentation, as well as general product andlanguage behavior. The TIBCO Spotfire S documentation includes books to addressyour focus and knowledge level. Review the following table to helpyou choose the Spotfire S book that meets your needs. These booksare available in PDF format in the following locations: In your Spotfire S installation directory (SHOME\help onWindows, SHOME/doc on UNIX/Linux). In the Spotfire S Workbench, from the Help 䉴 Spotfire S Manuals menu item. In Microsoft Windows , in the Spotfire S GUI, from theHelp 䉴 Online Manuals menu item. Spotfire S documentation.ivInformation you need if you.See the.Must install or configure your current installationof Spotfire S ; review system requirements.Installtion andAdministration GuideWant to review the third-party products includedin Spotfire S , along with their legal notices andlicenses.Licenses

Spotfire S documentation. (Continued)Information you need if you.See the.Are new to the S language and the Spotfire S GUI, and you want an introduction to importingdata, producing simple graphs, applying statisticalGetting StartedGuide models, and viewing data in Microsoft Excel .Are a new Spotfire S user and need how to useSpotfire S , primarily through the GUI.User’s GuideAre familiar with the S language and Spotfire S ,and you want to use the Spotfire S plug-in, orcustomization, of the Eclipse IntegratedDevelopment Environment (IDE).Spotfire S WorkbenchUser’s GuideHave used the S language and Spotfire S , andyou want to know how to write, debug, andprogram functions from the Commands window.Programmer’s GuideAre familiar with the S language and Spotfire S ,and you want to extend its functionality in yourown application or within Spotfire S .ApplicationDeveloper’s GuideAre familiar with the S language and Spotfire S ,and you are looking for information about creatingor editing graphics, either from a Commandswindow or the Windows GUI, or using SpotfireS supported graphics devices.Guide to GraphicsAre familiar with the S language and Spotfire S ,and you want to use the Big Data library to importand manipulate very large data sets.Big DataUser’s GuideWant to download or create Spotfire S packagesfor submission to the Comprehensive S-PLUSArchive Network (CSAN) site, and need to knowthe steps.Guide to Packagesv

Spotfire S documentation. (Continued)viInformation you need if you.See the.Are looking for categorized information aboutindividual S-PLUS functions.Function GuideIf you are familiar with the S language andSpotfire S , and you need a reference for therange of statistical modelling and analysistechniques in Spotfire S . Volume 1 includesinformation on specifying models in Spotfire S ,on probability, on estimation and inference, onregression and smoothing, and on analysis ofvariance.Guide to Statistics,Vol. 1If you are familiar with the S language andSpotfire S , and you need a reference for therange of statistical modelling and analysistechniques in Spotfire S . Volume 2 includesinformation on multivariate techniques, time seriesanalysis, survival analysis, resampling techniques,and mathematical computing in Spotfire S .Guide to Statistics,Vol. 2

CONTENTSChapter 1Introduction to the Big Data Library1Introduction2Working with a Large Data Set3Size Considerations7The Big Data Library Architecture8Chapter 2 Census Data Example21Introduction22Exploratory Analysis25Data Manipulation36More Graphics40Clustering44Modeling Group Membership52Chapter 3RulesAnalyzing Large Datasets for Association59Introduction60Big Data Association Rules Implementation62Association Rule Sample73More information77vii

ContentsChapter 4SetsCreating Graphical Displays of Large Data79Introduction80Overview of Graph Functions81Example Graphs87Chapter 5Advanced Programming InformationIntroduction124Big Data Block Size Issues125Big Data String and Factor Issues131Storing and Retrieving Large S Objects137Increasing Efficiency139Appendix: Big Data Library Functions141Introduction142Big Data Library Functions143Indexviii123181

INTRODUCTION TO THE BIGDATA LIBRARY1Introduction2Working with a Large Data SetFinding a SolutionNo 64-Bit Solution335Size ConsiderationsSummary77The Big Data Library ArchitectureBlock-based ComputationsData TypesClassesFunctionsSummary88111415191

Chapter 1 Introduction to the Big Data LibraryINTRODUCTIONIn this chapter, we discuss the history of the S language and large datasets and describe improvements that the Big Data library presents.This chapter discusses data set size considerations, including when touse the Big Data library. The chapter also describes in further detailthe Big Data library architecture: its data objects, classes, functions,and advanced operations.To use the Big Data library, you must load it as you would any otherlibrary provided with Spotfire S : that is, at the command prompt,type library(bigdata). To ensure that the library is always loaded on startup, addlibrary(bigdata) to your SHOME/local/S.init file. Alternatively, in the Spotfire S GUI for Microsoft Windows , you can set this option in the General Settingsdialog box. 2In the Spotfire S Workbench, you can set this option in theSpotfire S section of the Preferences dialog box, availablefrom the Window menu.

Working with a Large Data SetWORKING WITH A LARGE DATA SETWhen it was first developed, the S programming language wasdesigned to hold and manipulate data in memory. Historically, thisdesign made sense; it provided faster and more efficient calculationsand modeling by not requiring the user’s program to accessinformation stored on the hard drive. Data size has outstripped therate at which RAM size increased; consequently, S program userscould have encountered an error similar to the following:Problem in read.table: Unable to obtain requested dynamicmemory.This error occurs because Spotfire S requires the operating systemto provide a block of memory large enough to contain the contents ofthe data file, and the operating system responds that not enoughmemory is available.While Spotfire S can access data contained in virtual memory, themaximum size of data files depends on the amount of virtual memoryavailable to Spotfire S , which depends in turn on the user’shardware and operating system. In typical environments, virtualmemory limits your data file size, and then it returns an out-ofmemory error.Finally, you can also encounter an out-of-memory error aftersuccessfully reading in a large data object, because many S functionsrequire one or more temporary copies of the source data in RAM forcertain manipulation or analysis functions.Finding aSolutionS programmers with large data sets have historically dealt withmemory limitations in a variety of ways. Some opted to use otherapplications, and some divided their data into “digestible” batches,and then recompile the results. For S programmers who like theflexibility and elegant syntax of the S language and the supportprovided to owners of a Spotfire S license, the option to analyze andmodel large data sets in S has been a long-awaited enhancement.Out-of-MemoryProcessingThe Big Data library provides this enhancement by processing largedata sets using scalable algorithms and data streaming. Instead ofloading the contents of a large data file into memory, Spotfire S creates a special binary cache file of the data on the user’s hard disk,3

Chapter 1 Introduction to the Big Data Libraryand then refers to the cache file on disk. This out-of-memory designrequires relatively small amounts of RAM, regardless of the total sizeof the data.ScalableAlgorithmsAlthough the large data set is stored on the hard drive, the scalablealgorithms of the Big Data library are designed to optimize access tothe data, reading from disk a minimum number of times. Manytechniques require a single pass through the data, and the data is readfrom the disk in blocks, not randomly, to minimize disk access times.These scalable algorithms are described in more detail in the sectionThe Big Data Library Architecture on page 8.Data StreamingSpotfire S operates on the data binary cache file directly, using“streaming” techniques, where data flows through the applicationrather than being processed all at once in memory. The cache file isprocessed on a row-by-row basis, meaning that only a small part ofthe data is stored in RAM at any one time. It is this out-of-memorydata processing technique that enables Spotfire S to process data setshundreds of megabytes, or even gigabytes, in size without requiringlarge quantities of RAM.Data TypeSpotfire S provides the large data frame, an object of class bdFrame.A big data frame object is similar in function to standard S-PLUS dataframes, except its data is stored in a cache file on disk, rather than inRAM. The bdFrame object is essentially a reference to that externalfile: While you can create a bdFrame object that represents anextremely large data set, the bdFrame object itself requires very littleRAM.For more information on bdFrame, see the section Data Frames onpage 11.Spotfire S also provides time date (bdTimeDate), time span(bdTimeSpan), and series (bdSeries, bdSignalSeries, andbdTimeSeries) support for large data sets. For more information, seethe section Time Date Creation on page 175 in the Appendix.Flexibility4The Big Data library provides reading, manipulating, and analyzingcapability for large data sets using the familiar S programminglanguage. Because most existing data frame methods work in thesame way with bdFrame objects as they do with data.frame objects,the style of programming is familiar to Spotfire S programmers.Much existing code from previous versions of Spotfire S runs

Working with a Large Data Setwithout modification in the Big Data library, and only minormodifications are needed to take advantage of the big-datacapabilities of the pipeline engine.BalancingScalability withPerformanceWhile accessing data on disk (rather than in RAM) allows for scalablestatistical computing, some compromises are inevitable. The mostobvious of these is computation speed. The Big Data library providesscalable algorithms that are designed to

DISPLAYED DURING DOWNLOAD OR INSTALLATION OF THE SOFTWARE (AND WHICH IS DUPLICATED IN TIBCO SPOTFIRE S LICENSES). USE OF THIS DOCUMENT IS SUBJECT TO THOSE TERMS AND CONDITIONS, AND YOUR USE HEREOF SHALL CONSTITUTE ACCEPTANCE OF AND AN AGREEMENT TO BE BOUND BY THE SAME. This document contains confidential