MEETING BIG DATA CHALLENGES WITH EMC ISILON

Transcription

MEETING BIG DATA CHALLENGESWITH EMC ISILON STORAGE SYSTEMSAnuj Sharma

ContentsAbstract. 3Big Data Challenges and EMC Isilon Storage Systems. 6Big Data Value-Add To Business . 9OneFS Architecture .11Which EMC Isilon Cluster to choose? .16EMC Isilon Cluster Networking Best Practices .17EMC Isilon Smart Connect Internals .20SmartConnect Architecture Example .27EMC Isilon Smart Quotas Internals .28EMC Isilon and vSphere Integration Best Practices .31EMC Isilon SyncIQ Architecture and Tips.36EMC Isilon NDMP Backup Configuration for EMC NetWorker.37Cluster Performance Tuning .41EMC Isilon Cluster Maintenance .43References .47Glossary.48FiguresFigure 2: Big Data Sources . 5Figure 3: Isilon Node Types . 8Figure 5: Big Data Enabled Property and Casualty Insurance Policy Premium Factors. 9Figure 6: OneFS vs. Traditional File Systems .11Figure 7: OneFS vs. Traditional File Systems .12Figure 8: Isilon Cluster .13Figure 9: OneFS Protection.15Figure 10: 10GigE Networking with Accelerator Node .19Figure 11: Redundant Internal Network Topology .20Figure 12: SmartConnect Communication.22Figure 14: SmartConnect Configuration .27Figure 15: Optimizing Isilon NFS for VM I/O Operations .33Figure 16: Isilon NFS Architecture.34Figure 17: Isilon iSCSI Architecture.35Figure 19: Direct NMDP Method .37Figure 20: Remote NDMP Model .38Disclaimer: The views, processes, or methodologies published in this article are those of the author. They do notnecessarily reflect EMC Corporation’s views, processes, or methodologies.2012 EMC Proven Professional Knowledge Sharing2

AbstractWhat is Big Data? Big Data does not refer to a specific type of data; every kind of unstructureddata can be considered big data when a single file size is in terabytes. Digital data is growing atan exponential rate today, and “big data” is the new buzzword in IT circles. According toInternational Data Corporation (IDC), the amount of digital data created and replicated willsurpass 1.8 Zettabytes (1.8 trillion GB) in 2011, having grown by a factor of nine in just fiveyears. The information we deal with today is very different from the information that we used todeal with 20-30 years ago. Chip manufacturers render terabyte files, oil and gas explorationcompanies deal with terabytes of data to be analyzed, advancements in healthcare has led tothe creation of high-definition 4-D imaging files ranging up to terabytes, and NASA deals with alarge number of files with sizes in terabytes. Social networking sites and online communitiesgenerate data in huge numbers. It seems no industry is safe from massive data growth and thestorage implications are profound.Figure 1: Data GrowthThe massive amount of rich unstructured file data generated by richer file formats and InternetEra computing is creating a demand for new and innovative scale-out file storage solutions toeconomically scale bandwidth and performance to previously unheard of capacities.2012 EMC Proven Professional Knowledge Sharing3

I have seen many organizations relying on scale-up storage for Big Data storage and analytics.However, in the long run, companies are bound to encounter performance and storageproblems using scale-up storage systems for Big Data storage. Scale-up storage are monolithicstorage systems where lots of storage sits behind one or two file server heads and is designedto scale to multi-TB range behind those file server heads. Once the storage and performancelimitations are reached, a new monolithic storage system must be added. As the existing filesystems residing on the previous scale-up storage system cannot be expanded to leverage thestorage of the new scale-up storage system, a new file system needs to be managed, even ifthere is only the need to add minimal incremental storage capacity. This is one of the problemsthat enterprises run into while dealing with Big Data using scale-up storage systems. A singlefile system is limited to TB’s in the case of scale-up storage systems and file system migration isoften a painful exercise and requires downtime. Also, traditional analysis tools cannot be usedto analyze Big Data. Data needs to be mined real-time and results need to be published. Forexample, a retail store can see instantaneously which stores are most profitable, which item isin demand, the consumer choices per region, and so on. Big Data analysis is a critical factorthat plays a significant role in future business decisions of the organization. Parallel processingis required to mine data of such huge volumes simultaneously. Consequently, scale-out storagesystems are the best candidates for high parallel processing power.Scale-out storage architectures are significantly different than monolithic scale-up storagearchitectures (e.g. traditional NAS or SAN systems) that were developed to meet distributedcomputing needs. “Scale-out NAS” are systems designed from the ground up for economicallydynamic scale and for supporting extremely high bandwidth applications dealing with multiterabyte files often referred to as Big Data. The EMC Isilon storage system is the world leaderin the scale-out NAS category.2012 EMC Proven Professional Knowledge Sharing4

Figure 2: Big Data SourcesIn this article we will discuss why scale-up storage is not able to meet the performance, cost,and storage requirements of Big Data efficiently and how scale-out storage successfully meetsall the requirements of Big Data. We also discuss an example of how Big Data can add value tobusiness. Additionally, the following areas will be covered: Big Data Challenges EMC Isilon Storage Systems and OneFS Architecture EMC Isilon Storage Systems Features such as SyncIQ, SmartConnect,SmartQuotasBest practices for the Implementation of EMC Isilon Storage Clustered SystemsBest Practices for VSphere 4 integration with EMC Isilon Storage SystemsCluster MaintenanceEMC Isilon NDMP Backup Configuration And much more 2012 EMC Proven Professional Knowledge Sharing5

Big Data Challenges and EMC Isilon Storage Systems Unstructured Data is being generated at exponential ratesThe pace of data growth requires storage that can scale ondemand. Typically, storage is purchased with a view on futureor peak requirements. Most often, we end up spending moreas the cost of the equipment declines over time. EMC Isilonprovides the benefit of adding the storage on demand insteadof buying at once; start with minimal nodes and then scale outover time. Seismic applications, NASA satellite imaging, and high performance videorendering applications require storage that support huge IOPS and data transferthroughputEMC Isilon has a scale-out architecture; whenever a node is added, storage andcompute is also added. Hence, compute increases linearly as nodes are added to anEMC Isilon cluster. EMC accelerator nodes can be added to increase the compute anddata transfer throughput. For example, using Isilon IQ NASA has been able toconsolidate more than 8,000 large Landsat 7 satellite images into a single volume andsingle file system while providing high-performance and concurrent access to i-cubedgeoprocessing applications. File System Size required in PetabytesSome Big Data applications require that petabyte-size data be stored in a single filesystem. EMC Isilon OneFS spans across the EMC ISILON Cluster nodes and presentthe application with one file system in petabytes spanned across the nodes. Big Data analytics requirementsTo analyze Big Data real-time requires storage that can withstand the simultaneous readand write requirements of the analytical engine. Owing to the architecture of EMC Isilon,data can be analyzed almost real time so that organizations can look at the analytics inreal time and make quick decisions. EMC Isilon addresses all the requirements that theBig Data analytical application requires.2012 EMC Proven Professional Knowledge Sharing6

For example, for Oil and Gas exploration companies, the cost of oil and gas explorationcontinues to skyrocket, thus making the rapid analysis of exploratory data essential inorder to stay ahead of competitors. Companies cannot afford to have crews sit idle whiledrill/no drill decision-making data is being analyzed. To speed data analysiscomputational workflows, oil and gas organizations analytical application requiresstorage such as EMC Isilon with the latest multi-core processors, petabytes of storage,the fastest-available networking, and the intelligence to divide and handle workloadsacross an array of compute nodes. Data transfer types for Big Data can be random or sequentialAs per Big Data transfer types, organizations can choose from different Isilon models,providing flexibility to select the model that best meets their requirements. Tight Backup WindowAs data grows exponentially, so does the time required to back up data also increase.EMC Isilon has NDMP accelerator nodes that increase the NDMP backup throughputthus reducing the backup window. Isilon Solves Media Industry Challenges Rendering/compositing/encoding jobs no longer need to be scheduled andqueued based on storage limitations. Artists do not need to wait for one job to complete before another can begin. Nordo they need to determine what volume or drive a particular file resides on orwhere it should be written. Data wranglers are no longer needed to manually move files and processes fromover‐taxed drives. No downtime is required when more space is added. And most importantly, heavy workloads with concurrent access patterns will notdegrade the performance of an Isilon IQ cluster. High Performance Computing Challenges SolvedHigh Performance Computing (HPC) applications need multiple processors, memorymodules, and data paths. HPC needs parallel data services, which break up single filesand deliver them in pieces in parallel. Isilon meets all the requirements of HighPerformance computing, i.e. multiple computing nodes can access the data in parallelfrom the Cluster Nodes and perform the desired operations on the data efficiently and at2012 EMC Proven Professional Knowledge Sharing7

faster rates. Isilon eliminates storage from becoming a bottleneck in High Performancecomputing.EMC Isilon is designed with a view toward addressing all of the Big Data challenges above.Organizations can mix and match various hardware elements depending on specific needs. Forexample, the IsilonS-Series delivers the performance needed for IOPS-intensive applications,the X-Series is ideal for high-concurrent and sequential throughput workflows, and the NLSeries provides economical storage that enables organizations to keep data online andavailable for longer periods of time. This article will look at implementing EMC Isilon featuresusing best practices to get the best performance out of EMC Isilon systems.Figure 3: Isilon Node Types2012 EMC Proven Professional Knowledge Sharing8

Big Data Value-Add To BusinessBusiness houses, corporations, and enterprises are dealing with a huge amount of unstructureddata. This data can turn out to be a real value-add in terms of revenue to the organizations.There can be many use cases where Big Data can do wonders for an organization.The insurance industry can benefit from Big Data analytics by analyzing the large amount ofdata almost in real time.Typically, to generate a quote, an insurance company will judge the premium by applicationform and the credit history of the individual. Thus, insurance companies depend on the data thatthe applicant fills in and the credit history.Applicationform ure 4: Traditional Property and Casualty Insurance Policy Premium FactorsNow, with the power of Big Data analytics, insurance companies can analyze the factors belowto decide on the insurance premium.Applicationform SocialNetworkingDataAnalysisInsurancePremiumFigure 5: Big Data Enabled Property and Casualty Insurance Policy Premium Factors2012 EMC Proven Professional Knowledge Sharing9

Suppose a request is made for an auto insurance quote. The insurance company can use bigdata analytics to calculate the insurance premium by analyzing the data points below that arebeyond the typical application form data. Individual purchases a new car and requests an insurance premium quote. His previouscar, insured by the same insurance company, has been fitted with a telemetric device.The telemetric device provides data that the insurance company can use to get datapoints such as the speed at which the driver drives the car, accidents, fuel economy,rapid acceleration, average speed, highway speeds, and city speeds. The big dataanalytical software can grade the insurance seeker by comparing these data points withthe poor, average, good, and excellent data points that have been decided by theinsurance company. For example, they can set ratings—a safe driver has a 5-starrating, a poor driver has a 1-star rating—and factor these data points while calculatingthe premium.Consequently, companies are able to increase or decrease the premium amount as perthe nature of the driving of the insurance quote seeker.In addition to the data points above, data analytical software can take data from socialnetworking sites such as Facebook, Twitter, and YouTube for the insurance premiumquote seeker. For example, their YouTube activities may show that they liked videos orshared videos related to Formula 1 racing or car stunts which would influence insurancequote calculations. Similarly, data from Facebook status updates such as “car bumpedinto other car”, or “touched 150 miles/hour on the highway“ can also be used byanalytical software for calculating the premium.Driving RatingSocialNetworkinganalysis ratingAveragePremiumBig DataAnalyticsCalculatedPremiumUser A54800 600 User B11800 1200 UserThese examples provide an overview of how companies can use big data to add value to theirbusiness while also benefiting users. To store this big data efficiently and economically andanalyze the big data in real time, EMC Isilon is the storage system that companies shouldconsider.2012 EMC Proven Professional Knowledge Sharing10

OneFS ArchitectureFigure 6: OneFS vs. Traditional File SystemsOneFS eliminates the need for a separate file system, volume manager, and RAID systemOneFS runs on a scale-out NAS architecture across the cluster of Isilon nodes. It creates asingle namespace and file system on each Isilon cluster. The OneFS is spread across all thenodes in the cluster. All information is shared among nodes; the entire file system is accessibleby clients connecting to any node in the cluster. Because all nodes in the cluster are peers, theIsilon clustered storage system does not have any master or slave nodes. All data is stripedacross all nodes in the cluster. As nodes are added, the file system grows dynamically andcontent is redistributed. Each Isilon storage node contains globally coherent RAM, meaningthat, as a cluster becomes larger, it also becomes faster. Each time a node is added, thecluster's concurrent performance scales linearly.2012 EMC Proven Professional Knowledge Sharing11

Figure 7: OneFS vs. Traditional File Systems On-the-fly node expansionAdding a new node requires no downtime and takes under 60 seconds. Scaling a clusterrequires no reconfiguration, no server or client mount points, and no applicationchanges. OneFS filesystem scalabilityOneFS can scale to 15.5 PB of storage in a single file system, so there is no need tocreate small volumes or logical units. As the cluster scales, Isilon AutoBalance migrates content to new storage nodes while the system is online and in production.There is never more than 5% imbalance of the percentage of used data between anynodes in the cluster. Data is automatically balanced across all nodes, reducing costs,complexity, and risk.2012 EMC Proven Professional Knowledge Sharing12

Figure 8: Isilon Cluster Separate Internal and External NetworksIsilon clusters use separate internal and external networks, so each node in the clusterrequires multiple network connections. Even the simplest non-redundant networktopology requires two network connections per node—an internal network connection forintra-cluster communication and an external network connection for client traffic. Theinternal network, also called the back-end network, uses InfiniBand to connect the nodesin a cluster. InfiniBand is a switched-fabric I/O standard that offers high throughput andlow latency. Essentially, the back-end network acts as the backplane of the cluster,enabling each node to act as a contributor to the whole. Clusters using an InfiniBandback end can grow to a maximum of 144 nodes. OneFS uses Reed-Solomon to provide redundancy and high availabilityAs traditional storage systems scale, techniques that were appropriate at a small sizebecome inadequate at a larger size, and there is no better example of this than RAID.RAID can only be effective if the data can be reconstructed before another failure canoccur. However, as the amount of data increases, the speed to access that data doesnot and the probability of additional failures continues to increase. OneFS does notdepend on hardware-based RAID technologies to pro

Tight Backup Window As data grows exponentially, so does the time required to back up data also increase. EMC Isilon has NDMP accelerator nodes that increase the NDMP backup throughput thus reducing the backup window. Isilon Solves Media Industry Challenges Rendering/compositing/encoding jobs no longer need to be scheduled and