Transcription
Deduplication: Overview & Case StudiesCSCI 333 – Spring 2020Williams College
Lecture OutlineBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther CAS applications
Lecture OutlineBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther CAS applications
Content Addressable Storage (CAS)Deduplication systems often rely on Content AddressableStorage (CAS)Data is indexed by some content identifierThe content identifier is determined by somefunction over the data itself- often a cryptographically strong hash function
CASExample:I send a document to be stored remotelyon some content addressable storage
CASExample:The server receives the document, andcalculates a unique identifier called thedata's fingerprint
CASThe fingerprint should be:unique to the data- NO collisionsone-way- hard to invert
CASThe fingerprint should be:unique to the data- NO collisionsone-way- hard to invertSHA-1:20 bytes (160 bits)P(collision(a,b)) (½)160coll(N, 2160) (NC2)(½)1601024 objects before it is more likelythan not that a collision has occurred
) de9f2c7fd25e1b3a.de9f2c7fd25e1b3a. data
CASExample:I submit my homework, and my “buddy”Harold also submits my homework.
CASExample:Same contents, same 2c7fd25e1b3a.data
CASExample:Same contents, same fingerprint.The data is only stored e1b3a.data
BackgroundBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications
CASExample:Now suppose Harry writes his name at thetop of my document.
CASExample:The fingerprints are completely different,despite the (mostly) identical fd25e1b3a.fad3e85a 0bd17d9b.datadata'
CASProblem Statement:What is the appropriate granularity toaddress our data?What are the tradeoffs associated withthis choice?
BackgroundBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications
DeduplicationChunking breaks a data stream into segmentsSHA1(DATA)becomesSHA1( CK1 ) SHA1( CK2 ) SHA1( CK3 )How do we divide a data stream?How do we reassemble a data stream?
DeduplicationDivision.Option 1: fixed-size blocks- Every (?)KB, start a new chunkOption 2: variable-size chunks- Chunk boundaries dependent on chunk contents
DeduplicationDivision: fixed-size blockshw-bill.txthw-harold.txt
DeduplicationDivision: fixed-size blockshw-bill.txthw-harold.txt HaroldSuppose Harold adds his nameto the top of my homework This is called theboundary shiftingproblem.
DeduplicationDivision.Option 1: fixed-size blocks- Every 4KB, start a new chunkOption 2: variable-size chunks- Chunk boundaries dependent on chunk contents
DeduplicationDivision: variable-size chunksparameters:Window of width wTarget pattern t- Slide the window byte by byte across the data, andcompute a window fingerprint at each position.- If the fingerprint matches the target, t, then wehave a fingerprint match at that position
DeduplicationDivision: variable-size chunks- Slide the window byte by byte across the data, andcompute a window fingerprint at each position.- If the fingerprint matches the target, t, then wehave a fingerprint match at that position
DeduplicationDivision: variable-size chunkshw-wkj.txthw-harold.txt
DeduplicationDivision: variable-size chunkshw-wkj.txthw-harold.txt HaroldSuppose Harold adds his nameto the top of my homeworkOnly introduce onenew chunk to storage.
DeduplicationDivision: variable-size chunksSliding window properties:- collisions are OK, but- average chunk size should be configurable- reuse overlapping window calculationsRabin fingerprintsWindow w, target t- expect a chunk ever 2t-1 w bytesLBFS: w 48, t 13- expect a chunk every 8KB
DeduplicationDivision: variable-size chunksRabin fingerprint: preselect divisor D, and an irreducible polynomialR(b1,b2,.,bw) (b1pw-1 b2pw-2 bw) mod DR(bi,.,bi w-1) ((R(bi-1, ., bi w-2) - bi-1pw-1)p bi w-1) mod DArbitrarywindowof width wpreviouswindowcalculationpreviousfirstterm
DeduplicationRecap:Chunking breaks a data stream into smaller segments What do we gain from chunking? What are the tradeoffs? Finer granularity of sharing- Fingerprinting is an expensive operation Finer granularity of addressing- Not suitable for all data patterns- Index overhead
DeduplicationReassemblingchunks:Recipes provide directions for reconstructing files from chunks
DeduplicationReassemblingchunks:Recipes provide directions for reconstructing files from chunksMetadata SHA1 SHA1 SHA1 .DATABLOCKDATABLOCKDATABLOCK
ata SHA1 SHA1 SHA1 .de9f2c7fd25e1b3a. recipe/data)?
DeduplicationBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications
DeduplicationThe Index:SHA-1 fingerprint uniquely identifies data, butthe index translates fingerprints to chunks. sha-11 sha-12 sha-13 sha-1n chunk1 chunk2 chunk3 chunkn chunki {location, size?, refcount?, compressed?, .}
DeduplicationThe Index:For small chunk stores:- database, hash table, treeFor a large index, legacy data structures won't fit in main memory- each index query requires a disk seek- why?SHA-1 fingerprints independent and randomly distributed- no localityKnown as the index disk bottleneck
DeduplicationThe Index:Back of the envelope:Average chunk size: 4KBFingerprint: 20B20TB unique data 100GB SHA-1 fingerprints
DeduplicationDisk bottleneck:Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessaryMemoryLocality Preserving CacheSummary VectorDiskStream Informed SegmentLayout (Containers)
DeduplicationDisk bottleneck:Summary vector- Bloom filter (any AMQ data structure works).1011100101011h1h2h3Filter properties: No false negatives if an FP is in the index, it is in summary vectorTuneable false positive rate We can trade memory for accuracyNote: on a false positive, we are no worse off- We just do the disk seek we would have done anyway101.
DeduplicationDisk bottleneck:Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessaryLocality Preserving CacheBloom FilterMemorySummary VectorDiskStream Informed SegmentLayout (Containers)
DeduplicationDisk bottleneck:Stream informed segment layout (SISL)- variable sized chunks written to fixed size containers- chunk descriptors are stored in a list at the head “temporal locality” for hashes within a containerPrinciple:- backup workloads exhibit chunk locality
DeduplicationDisk bottleneck:Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessaryLocality Preserving CacheBloom FilterSummary VectorMemoryGroup Fingerprints:Temporal LocalityDiskStream Informed SegmentLayout (Containers)
DeduplicationDisk bottleneck:Locality Preserving Cache (LPC)- LRU cache of candidate fingerprint groupsCD1 CD2 CD3 CD4 CD43 CD44 CD45 CD46 CD9 CD10 CD11 CD12.On-disk containerPrinciple:- if you must go to disk, make it worth your while
DeduplicationDisk bottleneck:NoSTARTRead requestfor chunkfingerprintFingerprintin Bloomfilter?YesNo LookupNecessaryFingerprintin LPC?NoOn-disk fingerprintindex lookup: getcontainer locationYesENDRead data fromtarget container.Prefetch fingerprintsfrom head of targetdata container.
DeduplicationSummary: Dedup and the 4 W'sDedup Goal: eliminate repeat instances of identical dataWhat (granularity) to dedup?Where to dedup?When to dedup?Why dedup?
DeduplicationSummary: Dedup and the 4 W'sWhat (granularity) to tentdefinedChunking N/AoverheadsoffsetsSliding ry shiftingproblemBestOthernotesLow e) Ease ofimplementation,selective caching,synchronizationLatency,CPU intensive
DeduplicationSummary: Dedup and the 4 W'sWhere to dedup?sourcedestinationDedup before sendingdata over the network save bandwidth- client complexity- trust clients?Dedup at storage server server more powerful- centralized data structureshybridClient index checks membership,Server index stores location
DeduplicationSummary: Dedup and the 4 W'sWhen to dedup?inlineDataDeduppost-processDiskData never store duplicate data- slower index lookup per chunk faster save I/O for duplicate dataDedupDisk- temporarily wasted storage faster stream long writes, reclaim inthe background- may create (even more) fragmentationhybrid post-processing faster for initial commits switch to inline to take advantage of I/O savings
DeduplicationWhy dedup?Perhaps you have a loooooot of data.- enterprise backupsOr data that is particularly amenable to deduplication.- small or incremental changes- data that is not encrypted or compressedOr that changes infrequently.- blocks are immutable no such thing as a “block modify”- rate of change determines container chunk localityIdeal use case: “Cold Storage”
DeduplicationWhy dedup?Perhaps your bottleneck isn't the CPU- Use dedup if you can favorably trade other resourcesSharedCacheSharedCachePacket Store(FIFO)Packet Store(FIFO)Bandwidth mple: Protocol Independent Technique for EliminatingRedundant Network Traffic
BackgroundBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications
Other CAS ApplicationsData verificationCAS can be used to build tamper evident storage. Suppose that:- you can't fix a compromised server,- but you never want be fooled by oneInsight: Fingerprints uniquely identify data- hash before storing data, and save the fp locally- rehash data and compare fps upon receipt!?!?!?!!
Deduplication Data Domain strategy: - filter unnecessary lookups - piggyback useful work onto the disk lookups that are necessary Disk bottleneck: Summary Vector Stream Informed Segment . post-process hybrid inline Data Dedup Disk Data Disk Dedup post-processing faster for initial commits