Deduplication: Overview & Case Studies - Computer Science

Transcription

Deduplication: Overview & Case StudiesCSCI 333 – Spring 2020Williams College

Lecture OutlineBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther CAS applications

Lecture OutlineBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther CAS applications

Content Addressable Storage (CAS)Deduplication systems often rely on Content AddressableStorage (CAS)Data is indexed by some content identifierThe content identifier is determined by somefunction over the data itself- often a cryptographically strong hash function

CASExample:I send a document to be stored remotelyon some content addressable storage

CASExample:The server receives the document, andcalculates a unique identifier called thedata's fingerprint

CASThe fingerprint should be:unique to the data- NO collisionsone-way- hard to invert

CASThe fingerprint should be:unique to the data- NO collisionsone-way- hard to invertSHA-1:20 bytes (160 bits)P(collision(a,b)) (½)160coll(N, 2160) (NC2)(½)1601024 objects before it is more likelythan not that a collision has occurred

) de9f2c7fd25e1b3a.de9f2c7fd25e1b3a. data

CASExample:I submit my homework, and my “buddy”Harold also submits my homework.

CASExample:Same contents, same 2c7fd25e1b3a.data

CASExample:Same contents, same fingerprint.The data is only stored e1b3a.data

BackgroundBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications

CASExample:Now suppose Harry writes his name at thetop of my document.

CASExample:The fingerprints are completely different,despite the (mostly) identical fd25e1b3a.fad3e85a 0bd17d9b.datadata'

CASProblem Statement:What is the appropriate granularity toaddress our data?What are the tradeoffs associated withthis choice?

BackgroundBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications

DeduplicationChunking breaks a data stream into segmentsSHA1(DATA)becomesSHA1( CK1 ) SHA1( CK2 ) SHA1( CK3 )How do we divide a data stream?How do we reassemble a data stream?

DeduplicationDivision.Option 1: fixed-size blocks- Every (?)KB, start a new chunkOption 2: variable-size chunks- Chunk boundaries dependent on chunk contents

DeduplicationDivision: fixed-size blockshw-bill.txthw-harold.txt

DeduplicationDivision: fixed-size blockshw-bill.txthw-harold.txt HaroldSuppose Harold adds his nameto the top of my homework This is called theboundary shiftingproblem.

DeduplicationDivision.Option 1: fixed-size blocks- Every 4KB, start a new chunkOption 2: variable-size chunks- Chunk boundaries dependent on chunk contents

DeduplicationDivision: variable-size chunksparameters:Window of width wTarget pattern t- Slide the window byte by byte across the data, andcompute a window fingerprint at each position.- If the fingerprint matches the target, t, then wehave a fingerprint match at that position

DeduplicationDivision: variable-size chunks- Slide the window byte by byte across the data, andcompute a window fingerprint at each position.- If the fingerprint matches the target, t, then wehave a fingerprint match at that position

DeduplicationDivision: variable-size chunkshw-wkj.txthw-harold.txt

DeduplicationDivision: variable-size chunkshw-wkj.txthw-harold.txt HaroldSuppose Harold adds his nameto the top of my homeworkOnly introduce onenew chunk to storage.

DeduplicationDivision: variable-size chunksSliding window properties:- collisions are OK, but- average chunk size should be configurable- reuse overlapping window calculationsRabin fingerprintsWindow w, target t- expect a chunk ever 2t-1 w bytesLBFS: w 48, t 13- expect a chunk every 8KB

DeduplicationDivision: variable-size chunksRabin fingerprint: preselect divisor D, and an irreducible polynomialR(b1,b2,.,bw) (b1pw-1 b2pw-2 bw) mod DR(bi,.,bi w-1) ((R(bi-1, ., bi w-2) - bi-1pw-1)p bi w-1) mod DArbitrarywindowof width wpreviouswindowcalculationpreviousfirstterm

DeduplicationRecap:Chunking breaks a data stream into smaller segments What do we gain from chunking? What are the tradeoffs? Finer granularity of sharing- Fingerprinting is an expensive operation Finer granularity of addressing- Not suitable for all data patterns- Index overhead

DeduplicationReassemblingchunks:Recipes provide directions for reconstructing files from chunks

DeduplicationReassemblingchunks:Recipes provide directions for reconstructing files from chunksMetadata SHA1 SHA1 SHA1 .DATABLOCKDATABLOCKDATABLOCK

ata SHA1 SHA1 SHA1 .de9f2c7fd25e1b3a. recipe/data)?

DeduplicationBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications

DeduplicationThe Index:SHA-1 fingerprint uniquely identifies data, butthe index translates fingerprints to chunks. sha-11 sha-12 sha-13 sha-1n chunk1 chunk2 chunk3 chunkn chunki {location, size?, refcount?, compressed?, .}

DeduplicationThe Index:For small chunk stores:- database, hash table, treeFor a large index, legacy data structures won't fit in main memory- each index query requires a disk seek- why?SHA-1 fingerprints independent and randomly distributed- no localityKnown as the index disk bottleneck

DeduplicationThe Index:Back of the envelope:Average chunk size: 4KBFingerprint: 20B20TB unique data 100GB SHA-1 fingerprints

DeduplicationDisk bottleneck:Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessaryMemoryLocality Preserving CacheSummary VectorDiskStream Informed SegmentLayout (Containers)

DeduplicationDisk bottleneck:Summary vector- Bloom filter (any AMQ data structure works).1011100101011h1h2h3Filter properties: No false negatives if an FP is in the index, it is in summary vectorTuneable false positive rate We can trade memory for accuracyNote: on a false positive, we are no worse off- We just do the disk seek we would have done anyway101.

DeduplicationDisk bottleneck:Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessaryLocality Preserving CacheBloom FilterMemorySummary VectorDiskStream Informed SegmentLayout (Containers)

DeduplicationDisk bottleneck:Stream informed segment layout (SISL)- variable sized chunks written to fixed size containers- chunk descriptors are stored in a list at the head “temporal locality” for hashes within a containerPrinciple:- backup workloads exhibit chunk locality

DeduplicationDisk bottleneck:Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessaryLocality Preserving CacheBloom FilterSummary VectorMemoryGroup Fingerprints:Temporal LocalityDiskStream Informed SegmentLayout (Containers)

DeduplicationDisk bottleneck:Locality Preserving Cache (LPC)- LRU cache of candidate fingerprint groupsCD1 CD2 CD3 CD4 CD43 CD44 CD45 CD46 CD9 CD10 CD11 CD12.On-disk containerPrinciple:- if you must go to disk, make it worth your while

DeduplicationDisk bottleneck:NoSTARTRead requestfor chunkfingerprintFingerprintin Bloomfilter?YesNo LookupNecessaryFingerprintin LPC?NoOn-disk fingerprintindex lookup: getcontainer locationYesENDRead data fromtarget container.Prefetch fingerprintsfrom head of targetdata container.

DeduplicationSummary: Dedup and the 4 W'sDedup Goal: eliminate repeat instances of identical dataWhat (granularity) to dedup?Where to dedup?When to dedup?Why dedup?

DeduplicationSummary: Dedup and the 4 W'sWhat (granularity) to tentdefinedChunking N/AoverheadsoffsetsSliding ry shiftingproblemBestOthernotesLow e) Ease ofimplementation,selective caching,synchronizationLatency,CPU intensive

DeduplicationSummary: Dedup and the 4 W'sWhere to dedup?sourcedestinationDedup before sendingdata over the network save bandwidth- client complexity- trust clients?Dedup at storage server server more powerful- centralized data structureshybridClient index checks membership,Server index stores location

DeduplicationSummary: Dedup and the 4 W'sWhen to dedup?inlineDataDeduppost-processDiskData never store duplicate data- slower index lookup per chunk faster save I/O for duplicate dataDedupDisk- temporarily wasted storage faster stream long writes, reclaim inthe background- may create (even more) fragmentationhybrid post-processing faster for initial commits switch to inline to take advantage of I/O savings

DeduplicationWhy dedup?Perhaps you have a loooooot of data.- enterprise backupsOr data that is particularly amenable to deduplication.- small or incremental changes- data that is not encrypted or compressedOr that changes infrequently.- blocks are immutable no such thing as a “block modify”- rate of change determines container chunk localityIdeal use case: “Cold Storage”

DeduplicationWhy dedup?Perhaps your bottleneck isn't the CPU- Use dedup if you can favorably trade other resourcesSharedCacheSharedCachePacket Store(FIFO)Packet Store(FIFO)Bandwidth mple: Protocol Independent Technique for EliminatingRedundant Network Traffic

BackgroundBackgroundContent Addressable Storage (CAS)DeduplicationChunkingThe IndexOther applications

Other CAS ApplicationsData verificationCAS can be used to build tamper evident storage. Suppose that:- you can't fix a compromised server,- but you never want be fooled by oneInsight: Fingerprints uniquely identify data- hash before storing data, and save the fp locally- rehash data and compare fps upon receipt!?!?!?!!

Deduplication Data Domain strategy: - filter unnecessary lookups - piggyback useful work onto the disk lookups that are necessary Disk bottleneck: Summary Vector Stream Informed Segment . post-process hybrid inline Data Dedup Disk Data Disk Dedup post-processing faster for initial commits