Avoiding The Disk Bottleneck In The Data Domain .

Transcription

Avoiding the Disk Bottleneck in the Data Domain Deduplication File SystemBenjamin ZhuData Domain, Inc.Kai LiData Domain, Inc. and Princeton UniversityHugo PattersonData Domain, Inc.AbstractDisk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection toreplace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact formand makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise dataprotection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significantchallenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot affordenough RAM to store an index of the stored segments and may be forced to access an on-disk index for every inputsegment.This paper describes three techniques employed in the production Data Domain deduplication file system to relievethe disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure foridentifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk localityfor sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of thefingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the diskaccesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core systemto run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughputand 210 MB/sec for multi-stream throughput.1IntroductionThe massive storage requirements for data protectionhave presented a serious problem for data centers.Typically, data centers perform a weekly full backup ofall the data on their primary storage systems to secondarystorage devices where they keep these backups for weeksto months. In addition, they may perform dailyincremental backups that copy only the data which haschanged since the last backup. The frequency, type andretention of backups vary for different kinds of data, butit is common for the secondary storage to hold 10 to 20times more data than the primary storage. For disasterrecovery, additional offsite copies may double thesecondary storage capacity needed. If the data istransferred offsite over a wide area network, the networkbandwidth requirement can be enormous.Given the data protection use case, there are two mainrequirements for a secondary storage system storingbackup data. The first is low cost so that storing backupsand moving copies offsite does not end up costingsignificantly more than storing the primary data. Thesecond is high performance so that backups can completein a timely fashion. In many cases, backups mustcomplete overnight so the load of performing backupsdoes not interfere with normal daytime usage.USENIX AssociationThe traditional solution has been to use tape libraries assecondary storage devices and to transfer physical tapesfor disaster recovery. Tape cartridges cost a smallfraction of disk storage systems and they have goodsequential transfer rates in the neighborhood of 100MB/sec. But, managing cartridges is a manual processthat is expensive and error prone. It is quite common forrestores to fail because a tape cartridge cannot be locatedor has been damaged during handling. Further, randomaccess performance, needed for data restores, isextremely poor. Disk-based storage systems and networkreplication would be much preferred if they wereaffordable.During the past few years, disk-based, “deduplication”storage systems have been introduced for data protection[QD02, MCM01, KDLT04, Dat05, JDT05].Suchsystems compress data by removing duplicate data acrossfiles and often across all the data in a storage system.Some implementations achieve a 20:1 compression ratio(total data size divided by physical space used) for 3months of backup data using the daily-incremental andweekly-full backup policy. By substantially reducing thefootprint of versioned data, deduplication can make thecosts of storage on disk and tape comparable and makereplicating data over a WAN to a remote site for disasterrecovery practical.FAST ’08: 6th USENIX Conference on File and Storage Technologies269

The specific deduplication approach varies amongsystem vendors. Certainly the different approaches varyin how effectively they reduce data. But, the goal of thispaper is not to investigate how to get the greatest datareduction, but rather how to do deduplication at highspeed in order to meet the performance requirement forsecondary storage used for data protection.The most widely used deduplication method forsecondary storage, which we call Identical SegmentDeduplication, breaks a data file or stream intocontiguous segments and eliminates duplicate copies ofidentical segments. Several emerging commercialsystems have used this approach.The focus of this paper is to show how to implement ahigh-throughput Identical Segment Deduplicationstorage system at low system cost. The key performancechallenge is finding duplicate segments. Given a segmentsize of 8 KB and a performance target of 100 MB/sec, adeduplication system must process approximately 12,000segments per second.An in-memory index of all segment fingerprints couldeasily achieve this performance, but the size of the indexwould limit system size and increase system cost.Consider a segment size of 8 KB and a segmentfingerprint size of 20 bytes. Supporting 8 TB worth ofunique segments, would require 20 GB just to store thefingerprints.An alternative approach is to maintain an on-disk indexof segment fingerprints and use a cache to acceleratesegment index accesses. Unfortunately, a traditionalcache would not be effective for this workload. Sincefingerprint values are random, there is no spatial localityin the segment index accesses. Moreover, because thebackup workload streams large data sets through thesystem, there is very little temporal locality. Mostsegments are referenced just once every week during thefull backup of one particular system. Reference-basedcaching algorithms such as LRU do not work well forsuch workloads. The Venti system, for example,implemented such a cache [QD02]. Its combination ofindex and block caches only improves its writethroughput by about 16% (from 5.6MB/sec to6.5MB/sec) even with 8 parallel disk index lookups. Theprimary reason is due to its low cache hit ratios.With low cache hit ratios, most index lookups requiredisk operations. If each index lookup requires a diskaccess which may take 10 msec and 8 disks are used forindex lookups in parallel, the write throughput will beabout 6.4MB/sec, roughly corresponding to Venti’sthroughput of less than 6.5MB/sec with 8 drives. WhileVenti’s performance may be adequate for the archivalusage of a small workgroup, it’s a far cry from thethroughput goal of deduplicating at 100 MB/sec to270compete with high-end tape libraries. Achieving 100MB/sec, would require 125 disks doing index lookups inparallel! This would increase the system cost ofdeduplication storage to an unattainable level.Our key idea is to use a combination of three methods toreduce the need for on-disk index lookups during thededuplication process. We present in detail each of thethree techniques used in the production Data Domaindeduplication file system. The first is to use a Bloomfilter, which we call a Summary Vector, as the summarydata structure to test if a data segment is new to thesystem. It avoids wasted lookups for segments that donot exist in the index. The second is to store datasegments and their fingerprints in the same order thatthey occur in a data file or stream. Such Stream-InformedSegment Layout (SISL) creates spatial locality forsegment and fingerprint accesses. The third, calledLocality Preserved Caching, takes advantage of thesegment layout to fetch and cache groups of segmentfingerprints that are likely to be accessed together. Asingle disk access can result in many cache hits and thusavoid many on-disk index lookups.Our evaluation shows that these techniques are effectivein removing the disk bottleneck in an Identical SegmentDeduplication storage system. For a system running on aserver with two dual-core CPUs with one shelf of 15drives, these techniques can eliminate about 99% ofindex lookups for variable-length segments with anaverage size of about 8 KB. We show that the systemindeed delivers high throughput: achieving over 100MB/sec for single-stream write and read performance,and over 210 MB/sec for multi-stream writeperformance. This is an order-of-magnitude throughputimprovement over the parallel indexing techniquespresented in the Venti system.The rest of the paper is organized as follows. Section 2presents challenges and observations in designing adeduplication storage system for data protection. Section3 describes the software architecture of the productionData Domain deduplication file system. Section 4presents our methods for avoiding the disk bottleneck.Section 5 shows our experimental results. Section 6gives an overview of the related work, and Section 7draws conclusions.22.1Challenges and ObservationsVariable vs. Fixed Length SegmentsAn Identical Segment Deduplication system couldchoose to use either fixed length segments or variablelength segments created in a content dependent manner.Fixed length segments are the same as the fixed-sizeblocks of many non-deduplication file systems. For thepurposes of this discussion, extents that are multiples ofFAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

some underlying fixed size unit such as a disk sector arethe same as fixed-size blocks.Variable-length segments can be any number of bytes inlength within some range. They are the result ofpartitioning a file or data stream in a content dependentmanner [Man93, BDH94].The main advantage of a fixed segment size is simplicity.A conventional file system can create fixed-size blocksin the usual way and a deduplication process can then beapplied to deduplicate those fixed-size blocks orsegments. The approach is effective at deduplicatingwhole files that are identical because every block ofidentical files will of course be identical.In backup applications, single files are backup imagesthat are made up of large numbers of component files.These files are rarely entirely identical even when theyare successive backups of the same file system. A singleaddition, deletion, or change of any component file caneasily shift the remaining image content. Even if no otherfile has changed, the shift would cause each fixed sizedsegment to be different than it was last time, containingsome bytes from one neighbor and giving up some bytesto its other neighbor. The approach of partitioning thedata into variable length segments based on contentallows a segment to grow or shrink as needed so theremaining segments can be identical to previously storedsegments.Even for storing individual files, variable lengthsegments have an advantage. Many files are very similarto, but not identical to other versions of the same file.Variable length segments can accommodate thesedifferences and maximize the number of identicalsegments.Because variable length segments are essential fordeduplication of the shifted content of backup images,we have chosen them over fixed-length segments.2.2Segment SizeWhether fixed or variable sized, the choice of averagesegment size is difficult because of its impact oncompression and performance.The smaller thesegments, the more duplicate segments there will be. Putanother way, if there is a small modification to a file, thesmaller the segment, the smaller the new data that mustbe stored and the more of the file’s bytes will be induplicate segments. Within limits, smaller segments willresult in a better compression ratio.On the other hand, with smaller segments, there are moresegments to process which reduces performance. At aminimum, more segments mean more times through thededuplication loop, but it is also likely to mean more ondisk index lookups.USENIX AssociationWith smaller segments, there are more segments tomanage. Since each segment requires the same metadatasize, smaller segments will require more storagefootprint for their metadata, and the segment fingerprintsfor fewer total user bytes can be cached in a givenamount of memory. The segment index is larger. Thereare more updates to the index. To the extent that any datastructures scale with the number of segments, they willlimit the overall capacity of the system. Sincecommodity servers typically have a hard limit on theamount of physical memory in a system, the decision onthe segment size can greatly affect the cost of the system.A well-designed duplication storage system should havethe smallest segment size possible given the throughputand capacity requirements for the product. After severaliterative design processes, we have chosen to use 8 KBas the average segment size for the variable sized datasegments in our deduplication storage system.2.3Performance-Capacity BalanceA secondary storage system used for data protectionmust support a reasonable balance between capacity andperformance. Since backups must complete within afixed backup window time, a system with a givenperformance can only backup so much data within thebackup window. Further, given a fixed retention periodfor the data being backed up, the storage system needsonly so much capacity to retain the backups that cancomplete within the backup window. Conversely, given aparticular storage capacity, backup policy, anddeduplication efficiency, it is possible to compute thethroughput that the system must sustain to justify thecapacity. This balance between performance andcapacity motivates the need to achieve good systemperformance with only a small number of disk drives.Assuming a backup policy of weekly fulls and dailyincrementals with a retention period of 15 weeks and asystem that achieves a 20x compression ratio storingbackups for such a policy, as a rough rule of thumb, itrequires approximately as much capacity as the primarydata to store all the backup images. That is, for 1 TB ofprimary data, the deduplication secondary storage wouldconsume approximately 1 TB of physical capacity tostore the 15 weeks of backups.Weekly full backups are commonly done over theweekend with a backup window of 16 hours. Thebalance of the weekend is reserved for restarting failedbackups or making additional copies. Using the rule ofthumb above, 1 TB of capacity can protectapproximately 1 TB of primary data. All of that must bebacked up within the 16-hour backup window whichimplies a throughput of about 18 MB/sec per terabyte ofcapacity.FAST ’08: 6th USENIX Conference on File and Storage Technologies271

Following this logic, a system with a shelf of 15 SATAdrives each with a capacity of 500 GB and a total usablecapacity after RAID, spares, and other overhead of 6 TBcould protect 6 TB of primary storage and must thereforebe able to sustain over 100 MB/sec of deduplicationthroughput.2.4Fingerprint vs. Byte ComparisonsAn Identical Segment Deduplication storage systemneeds a method to determine that two segments areidentical. This could be done with a byte by bytecomparison of the newly written segment with thepreviously stored segment. However, such a comparisonis only possible by first reading the previously storedsegment from disk. This would be much more onerousthan looking up a segment in an index and would make itextremely difficult if not impossible to maintain theneeded throughput.To avoid this overhead, we rely on comparisons ofsegment fingerprints to determine the identity of asegment. The fingerprint is a collision-resistant hashvalue computed over the content of each segment. SHA1 is such a collision-resistant function [NIST95]. At a160-bit output value, the probability of fingerprintcollision by a pair of different segments is extremelysmall, many orders of magnitude smaller than hardwareerror rates [QD02]. When data corruption occurs, it willalmost certainly be the result of undetected errors inRAM, IO busses, network transfers, disk storage devices,other hardware components or software errors and notfrom a collision.3Deduplication Storage SystemArchitectureTo provide the context for presenting our methods foravoiding the disk bottleneck, this section describes thearchitecture of the production Data Domain File System,DDFS, for which Identical Segment Deduplication is anintegral feature. Note that the methods presented in thenext section are general and can apply to other IdenticalSegment Deduplication storage systems.At the highest level, DDFS breaks a file into variablelength segments in a content dependent manner [Man93,BDH94] and computes a fingerprint for each segment.DDFS uses the fingerprints both to identify duplicatesegments and as part of a segment descriptor used toreference a segment. It represents files as sequences ofsegment fingerprints. During writes, DDFS identifiesduplicate segments and does its best to store only onecopy of any particular segment. Before storing a newsegment, DDFS uses a variation of the Ziv-Lempelalgorithm to compress the segment [ZL77].272Figure 1: Data Domain File System architecture.Figure 1 is a block diagram of DDFS, which is made upof a stack of software components. At the top of thestack, DDFS supports multiple access protocols whichare layered on a common File Services interface.Supported protocols include NFS, CIFS, and a virtualtape library interface (VTL).When a data stream enters the system, it goes throughone of the standard interfaces to the generic File Serviceslayer, which manages the name space and file metadata.The File Services layer forwards write requests toContent Store which manages the data content within afile. Content Store breaks a data stream into segments,uses Segment Store to perform deduplication, and keepstrack of the references for a file. Segment Store does theactual work of deduplication. It packs deduplicated(unique) segments into relatively large units, compressessuch units using a variation of Ziv-Lempel algorithm tofurther compress the data, and then writes thecompressed results into containers supported byContainer Manager.To read a data stream from the system, a client drives theread operation through one of the standard interfaces andthe File Services Layer.Content Store uses thereferences to deduplicated segments to deliver thedesired data stream to the client. Segment Storeprefetches, decompresses, reads and caches datasegments from Container Manager.The following describes the Content Store, SegmentStore and the Container Manager in detail and discussesour design decisions.3.1Content StoreContent Store implements byte-range writes and readsfor deduplicated data objects, where an object is a linearFAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

sequence of client data bytes and has intrinsic and clientsettable attributes or metadata. An object may be aconventional file, a backup image of an entire volume ora tape cartridge.To write a range of bytes into an object, Content Storeperforms several operations. Anchoring partitions the byte range into variablelength segments in a content dependent manner[Man93, BDH94]. Segment fingerprinting computes the SHA-1 hashand generates the segment descriptor based on it.Each segment descriptor contains per segmentinformation of at least fingerprint and size Segment mapping builds the tree of segments thatrecords the mapping between object byte ranges andsegment descriptors. The goal is to represent a dataobject using references to deduplicated segments.To read a range of bytes in an object, Content Storetraverses the tree of segments created by the segmentmapping operation above to obtain the segmentdescriptors for the relevant segments.

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Benjamin Zhu Data Domain, Inc. Kai Li Data Domain, Inc. and Princeton University Hugo Patterson Data Domain, Inc. Abstract Disk-based deduplication storage has emerged as the new-generation storage system