Disaster Recovery - IEEE Computer Society

Transcription

IC2E 2018Feasibility Study of Location-ConsciousMulti-Site Erasure-Coded Ceph Storage forDisaster RecoveryKeitaro Uehara*Hitachi Ltd.Yih-Farn Robin Chen AT&T Labs-ResearchMatti HiltunenKaustubh JoshiRichard Schlichting Hitachi, Ltd. 2018. All rights reserved.0

Background Hitachi, Ltd. 2018. All rights reserved.1

Introduction Software-Defined Storage (SDS) is emerging. Ceph is one of the most popular SDS open source project. To achieve high availability for disaster recovery, erasure codeis a key technology. But performance drawback occurs in using erasure codes. We have studied the feasibility of Ceph’s flexible mechanismto implement storage system with both high availability andperformance improvement. Hitachi, Ltd. 2018. All rights reserved.2

Assumption: 48 nodes in 4 data centersMONdc westdc northOSDOSDOSDOSDOSDOSDOSDOSDOSDOSDOSDOSDMONdc SDOSDOSDOSDOSDOSDOSDOSDOSDOSDOSD: Object Storage DeviceMON: Monitor daemonOSDOSDOSDOSDOSDOSDOSDOSDOSDOSDOSDOSDdc southOSD: Object Storage DeviceMON: Monitor daemon Hitachi, Ltd. 2018. All rights reserved.3

Assumptions and availabilitiesAssumptions for failure probabilitiesFactorParameterNode failure rateOnce per 4.3 monthsDatacenter power outageOnce per yearAverage disk life timeThree yearsMTTR for node failure and DC power outageOne dayTarget availability99.999% (Five nines)Availability comparison between x3 replication and 9 15 erasure codeFailure causex3 replication9 15 erasure codesimultaneous nodes failure99.774%(3nodes failure in 3DC)100%(16nodes failure)1DC nodes failure99.978%(1DC 2nodes failure)100%(1DC 7nodes failure)2DC nodes failure99.999%(2DC 1node failure)99.999%(2DC 2nodes failure) Hitachi, Ltd. 2018. All rights reserved.4

Issue: Longer read latency in symmetric distribution9 15 erasure code in symmetric distribution (6 chunks each) on 4 data centersMONdc westdc Userdc SDOSDOSDDOSDDOSDOSDOSDPOSDPOSDOSDOSDDOSDDOSD: Object Storage DeviceMON: Monitor daemonOSDOSDOSDPOSDDOSDOSDOSDPOSDPOSDOSDOSDPOSDPdc southD: Data ChunkP: Parity Chunk Hitachi, Ltd. 2018. All rights reserved.5

Solution: Asymmetric/localized distribution9 15 erasure code in symmetric distribution (6 chunks each) on 4 data centers- 9 15 erasure code in localized distribution, where all of 9 data chunks in dc east.MONdc westdc Userdc SDOSDOSDDOSDDOSDOSDOSDPOSDPOSDOSDOSDDOSDDOSD: Object Storage DeviceMON: Monitor daemonOSDOSDOSDPOSDDOSDOSDOSDPOSDPOSDOSDOSDPOSDPdc southD: Data ChunkP: Parity Chunk Hitachi, Ltd. 2018. All rights reserved.6

Implementation with Ceph CRUSH map Hitachi, Ltd. 2018. All rights reserved.7

Erasure Coded Pool on Ceph9 3 Erasure Coded Pool on 16 OSDsPG (Placement Group)Object A B C D E F G H ICeph PoolDivided andparity chunksx y z calculated Each PG has own permutation of12(9 3) OSDs from d.7osd.10 osd.11osd.14 osd.15C z HD IGEF AB y[13, 10, 1, 4, 8, 12, 6, 3, 5, 15, 11, 2][7, 14, 3, 9, 15, 0, 4, 10, 2, 6, 11, 13][0, 9, 12, 5, 2, 6, 13, 8, 4, 1, 15, 11] osd.12 osd.13xk OSDs(data chunks)m OSDs(parity chunks)Object scattered among OSDs Hitachi, Ltd. 2018. All rights reserved.8

Ceph CRUSH Map Ceph provides CRUSH (Controlled Replication Under Scalable Hashing) map Define hierarchy of multiple layers Define “rule sets” for each pool to retrieve “OSD”s from hierarchy inrecursive way to meet requirements of replications:– In x3 Replication, 3 OSDs required to be chosen.– In 9 15 Erasure Code, 24 OSDs required to be chosen.DefaultRootData centerHostDevice ( OSD)dc east osd.0 osd.11dc westdc northdc south osd.12 osd.23osd.24 osd.35osd.36 osd.47 Hitachi, Ltd. 2018. All rights reserved.9

Ceph CRUSH Map for EC with Primary Affinity We define two different kinds of “root” for East DC as primary DC– “primary east” includes only “dc east”– “secondary east” includes the other three DCs.RootData centerHostDevice ( OSD)primary eastdc east osd.0 osd.11secondary eastDefaultdc westdc northdc south osd.12 osd.23osd.24 osd.35osd.36 osd.47 Hitachi, Ltd. 2018. All rights reserved.10

Ceph CRUSH Map for EC with Primary Affinity Define ruleset for 9 15 EC:– take first 9 chunks from different hosts under “primary east”– take 3 DCs from “secondary east”, then take 5 hosts under each DC.1:root primary east {2:id -543:alg straw4:hash 05:item dc east weight 126:}7:root secondary east {8:id -559:alg straw10:hash 011:item dc west weight 1212:item dc north weight 1213:item dc south weight 1214:}15:rule primary ec ruleset {16:ruleset 217:type erasure18:min size 919:max size 4820:step set chooseleaf tries 521:step take primary east22:step chooseleaf indep 9 type host23:step emit24:step take secondary east25:step choose firstn 3 type datacenter26:step chooseleaf indep 5 type host27:step emit28:} Hitachi, Ltd. 2018. All rights reserved.11

Experiments of placement with crushtool Hitachi, Ltd. 2018. All rights reserved.12

Tests of CRUSH Map with crushtool Ceph provides “crushtool”, which enables users to testuser-defined CRUSH maps without actual Ceph clusterenvironment. Automatically produce 1024 patterns (default) of objectplacement, and show statistics or bad-mappings. crushtool -c test-crushmap.txt -o test-crushmap.bin crushtool -i test-crushmap.bin --test --rule 2 --num rep 24--output csv Hitachi, Ltd. 2018. All rights reserved.13

Placement Group#crushtool test Results: Placement InformationChunk Order0123456789:k 9 (data chunks)0 1 2 3 4 53 2 9 10 11 84 11 3 0 5 102 5 7 11 3 08 3 1 4 11 06 9 7 0 4 52 9 4 10 0 85 2 11 9 10 38 0 6 2 5 44 0 3 9 6 102 7 11 5 10 6371540371014464131121741224445m 15 (parity 74530263428datacenterdc eastdc westdc northdc southDevice ID0 1112 2324 3536 2425123912222341272345262713412117 Hitachi, Ltd. 2018. All rights reserved.14

crushtool test Results: Device Utilization Total number of object stored for each device (OSD) in 1024 patterns. In symmetric distribution, 1024 * 24 / 48 512 is the expected value. First 12 devices (in East DC) has been more chosen than the othersdue to primary affinity.Device ID# of StoredDevice ID# of StoredDevice ID# of StoredDevice ID# of 410434398773204343242044399datacenterdc eastdc westdc northdc southDevice ID0 1112 2324 3536 53542047421 Hitachi, Ltd. 2018. All rights reserved.15

Experiments of I/O traffic with iostat Hitachi, Ltd. 2018. All rights reserved.16

Experiments overviewTarget To confirm Ceph’s activity of reading erasure codes in normalcondition.– Whether parity chunks are always read or not.Method To aggregate “iostat” of volumes on each physical host.Write 50MB single object to 9 3 erasure coded pool (on VM).Flush VM caches (from both VMs and physical hosts).Read 50MB single object from erasure coded pool. Hitachi, Ltd. 2018. All rights reserved.17

Ceph data write sequence (1/2)In x3 replication case:client(1) Client Write Request(3) ReplicationOSDOSD(4) JournalWrite(3) Replication(2) JournalWriteOSD(3’) Ack (4) JournalWriteJournal AreaEach OSD writes to Journal prior to Data,to reduce write latency with keeping durability. Hitachi, Ltd. 2018. All rights reserved.18

Ceph data write sequence (2/2)In x3 replication case: x6 write traffic occursclient(7) Client Write Completion(6’) AckOSDOSDOSD(5) Ack(6’) Ack(6) DataWrite(3’) Ack(4’) DataWrite(5) Ack(6) DataWriteJournal Area(Journal Data) x (x3 replication)2x3 Hitachi, Ltd. 2018. All rights reserved.19

Experimental ResultsOSD Placement Group Map: [9, 5, 13, 1, 11, 2, 10, 14, 4, 15, 12, 3]Expected Write Amount: 50MB * (9 3) / 9 * 2 (Data Journal) 133.3 MBExpected Read Amount: 50MB (if only data chunks are read)or 66.7MB (if parity chunks are always read) Data ChunksParity ChunksNon Related 4a1d2a3b2d3a3d1b1a2c4c4d10.83 10.84 10.75 10.74 10.74 10.83 10.84 10.83 10.85 10.82 10.83 10.81 0.10.100.1130.0100053.28write[MB]read[MB] 5.98 5.93 5.68 5.99665.93 5.81 5.960000Writes are almost equally distributed to data and parity chunks OSDs.All of reads are from data chunks OSDs, no parity chunks. Hitachi, Ltd. 2018. All rights reserved.20

Conclusion Hitachi, Ltd. 2018. All rights reserved.21

Conclusion and Future WorkConclusion From the experimental results, our proposederasure code could be applied to satisfy bothhigh availability and improvement of readperformance.Future Work To deploy a large storage system in fourgeographically-distant data centers based onthe proposed erasure code scheme. Hitachi, Ltd. 2018. All rights reserved.22

Ceph provides "crushtool", which enables users to test user-defined CRUSH maps without actual Ceph cluster environment. Automatically produce 1024 patterns (default) of object placement, and show statistics or bad-mappings. crushtool -c test-crushmap.txt -o test-crushmap.bin crushtool -i test-crushmap.bin --test --rule 2 --num .