ZBC/ZAC Support In Linux - Storage Networking Industry Association

Transcription

ZBC/ZAC Support in LinuxDamien Le MoalWestern Digital2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.

OutlinerrrrBackground: Shingled Magnetic Recording (SMR)r Device interface, standard and constraints on host softwareLinux kernel supportr SCSI stack, block I/O stack, APISome evaluation resultsr File systems and device mapperConclusion2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.2

Foreword and AcknowledgementrrZBC/ZAC support in Linux is an ongoing effortr Mechanisms and API presented here may change in the finalreleaseThis development is a community effort with many contributorsr Dr Hannes Reinecke, Christoph Hellwig, Shaun Tancheff,Damien Le Moalr And many others2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.3

Shingled Magnetic Recording (SMR)ZoneConventional PMR HDDData in Discrete Tracks. Capacityincrease achieved with narrower tracksSMR HDDData in Zones ofOverlapped wider tracks2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.4

Higher Disk Capacity, And More !rHigher (read) track density increases diskcapacity, and more r Wider write head produces higherfields, enabling smaller grains andlower noiser Better sector erasure coding, reducedATI exposure, and more powerful datadetection and recovery2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.5

But rrWhile track zones are independent,sectors cannot be modified independentlywithin a zoner Random reads similar to PMRr But sequential writes within a zoneDisk firmware can hide or expose zonesand write constraintr Standardized disk interface2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.6

SMR StandardsrrCommand setr T10 (SCSI) Zoned Block Command (ZBC) and T13 (ATA)Zoned-device ATA command set (ZAC)r Both semantically identicalr Latest drafts r05 forwarded to INCITS for processing towardspublicationSCSI to ATA translation (SAT) specifications updatedr Draft in ballot review2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.7

Standardized Disk ModelsModelDescriptionDrive Managed(DM) Disk firmware handles random writes processing Backward compatible (standard Device Type 0H) Performance can be unpredictable in some workloadsHost Managed(HM) Host must use zone commands to handle writeoperations Not backward compatible (Device type 14h) Predictable PerformanceHost Aware(HA) Disk firmware handles random writes processingBackward compatible (standard Device type 0H)Host can use zone commands to optimize write behaviorPerformance can be unpredictable if the host sends a“sub-optimal” requestImpact on HostSoftwareNONEHIGHHost must writesequentially intozonesNONE HIGHDepends on theamount of optimization2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.8

Standardized Zone TypesrrConventional zonesr Unconstrained read & write operationsr Optional for HA and HMWrite pointer zonesr HA: Sequential write preferred zonesrrUnconstrained read & write operations possibleHM Sequential write required zonesrrWrite operations must be sequentialNo read after write pointer position2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.9

Host Disk ViewDisk LBA rangeConventionalzoneZone emptyPartiallywritten zoneSequentialwrite zonesZone fullNo read areaWrite pointerWRITE command2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.10

ZBC & ZAC Command Setr2 main commandsr REPORT ZONES: get disk zone layout and zone statusrrRESET WRITE POINTER: “rewind” a sequential zonerrSequential zone write pointer positionSet write pointer at the beginning of the zone3 additional commands for software optimizationr OPEN ZONE: keep a zone FW resources lockedr CLOSE ZONE: release a zone FW resourcesr FINISH ZONE: fill a zone2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.11

Linux Kernel: What Do We Have ?rAs of kernel v 4.7r ZAC command set and translation from ZBC implementedrrSG IO is the only interface available to issue ZBC commandsrrFrom applications onlyHost aware drives are seen as regular block devicesrrBut no ZBC support in the SCSI disk driverNo differentiation with regular disksHost managed drives are exposed as SG noderNo block device file2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.12

What Is Needed ?rrrrAPI integrated in block I/O stackRespect read and write constraintsr Ensure sequential write command orderingr No read after write pointerNew device type supportr Host managedZBC and ZAC command set supportr Zone information and onVirtual File SystemPage CacheFile SystemBlock LayerBlock I/O SchedulerSCSI stack / libataZoned Disk2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.13

What Is Not Being ConsideredrHide HM sequential write constraintr No changes to page cacherToo complexResponsibility of disk user (FS, devicemapper or application)Natively support zoned devices in all filesystemsr Some are better suited than othersrrrrf2fs, nilfs, btrfs are good candidatesDevice mapper for nVirtual File SystemPage CacheFile SystemBlock LayerBlock I/O SchedulerSCSI stack / libataZoned Disk2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.14

Upper Block LayerrI/O constraints require differentiationfrom regular block devicesr Block device request queue isflagged as “zoned” with the devicetype (HA or HM)r A zone information cache is attachedto the device request queuerrOn-the-fly I/O checks possible withoutneeding a disk access for a zone reportImplemented as a RB-tree for efficiencystruct blk zone {struct rb nodeunsigned longsector tsector tsector tunsigned intunsigned intunsigned intunsigned int};node;flags;len;start;wp;type : 4;cond : 4;non seq : 1;reset : 1;unsigned int blk queue zoned(structrequest queue *q)2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.15

Zoned Block Device APIrrrZone information accessr Cache only or with update from diskZone manipulationr Reset write pointer, open zone, closezone, finish zoneUpper block I/O layer communicateoperations down to lower layers in theusual mannerr Block I/O operation codesblk lookup zoneblkdev report zoneblkdev reset zoneblkdev open zoneblkdev close zoneblkdev finish zoneREQ OP ZONE REPORTREQ OP ZONE RESETREQ OP ZONE OPENREQ OP ZONE CLOSEREQ OP ZONE FINISH2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.16

Lower Layers: SCSI Disk DriverrModified to create a zoned block device for HA and HM drivesr Initializes zone cacherrZone report is outside of critical I/O pathrrrFills zone information for entire LBA rangeSingle threaded work queueAvoid deadlocks and simplify error processingRequest order is not modifiedr Ensure single threaded HBA request submission from dispatchqueue to maintain user submission orderrUnaligned write or read errors can be tracked to HBA problems2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.17

Lower Layers: Read & Write ProcessingrAll read and write requests in sequential zones are checked atdispatch timer Read after write pointer are not sent to the diskrrrWrite not at write pointer are failed without being sent to the diskrrZero-out request buffer and return successAvoids boot-time errors for HM disks (partition table read)Write pointer position advanced in zone information cache for successfullychecked write requestsRequest completing with error trigger a zone report executionr Update zone cache information with current disk state2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.18

Lower Layers: Zone CommandsrrrA minimal zone state machine is maintained with the zone cacher Zone condition: empty, open, closed, fullUpper layer initiated zone operation requests trigger an update ofthe zone cache information at dispatch timer Before command completionr Consistent with command queueing and read/write checksSimilarly to read & write errors, zone commands failure trigger azone reportr Except for zone report itself, for obvious reasons2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.19

Block I/O Stack Final OverviewApplication OSEBLKZONEFINISHFile Systems,devicemapperblkdev report zoneblkdev reset zoneblkdev open zoneblkdev close zoneZone cacheREQ OP ZONE REPORTBlock LayerREQ OP ZONE RESETworkqueueblkdev finish zoneREQ OP ZONE OPENREQ OP ZONE CLOSEREQ OP ZONE FINISHDispatch Queue ChecksSCSI LayerHBAZoned Disk2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.20

File SystemsrrWork to natively support zoned block devices in file systems alsoon-goingr f2fs and btrfsBasic problem to solve is common to both candidatesr Block allocation on write block I/O issuing is not atomicrrSome optimizations doing “update-in-place” must be disabledrrSequential block allocation does not necessarily result in sequential writesMaintain sequential write patternIntegration of zone reset on block reclaim2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.21

Device Mapper: dm-zonedrrExpose a zoned block device as a regular block devicer Allows using any file systemUses conventional zones as “write buffer”r Aligned writes go straight to sequential zoner Random/unaligned writes are first staged to write buffer zonesrrrConfigurable number of buffer zonesBuffer zones must be reclaimed (rewritten to sequential zones)Zone indirection table used to track write locationsrUsed for read processing2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.22

Performance EvaluationrrrPatched 4.7 kernel baseFocus on file systemsr Native support file systems: f2fs, btrfsr Unmodified file systems dm-zoned: ext4, XFSComparison of ZBC enabled solutions with regular disk user Same physical disk for all experimentsrrSAS 6 TB disk with regular firmware or “hacked” ZBC enabled firmware(256 MB zones with 1% of LBA space as conventional zones)dbench scores are used as a performance metric2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.23

Dbench Results (1 client)rrSmall score drop for native f2fsand btrfsr Loss of some optimizationsleading to random writesdm-zoned cases showsignificantly higher scoresr Short term benefits of puresequential write pattern(reduced seek overhead)2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.24

Dbench Results (32 clients)rrSame small score dropobserved for btrfsr f2fs improvesdm-zoned cases advantagestill presentr Write pattern not changingwith higher number ofclients2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.25

dm-zoned High Duty-Cycle PerformancerBuffer zone reclaim has a cost under sustained write accessr Incoming write operations must wait for buffer zones reclaim2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.26

Release SchedulerAiming for inclusion of block I/O stack changes into kernel 4.9r Stable release likely in Decemberr May be delayed to 4.10 (February 2017)rr4.9 merge window rapidly approachingFollowing releases will likely see inclusion of support for file systemsand ideally a device mapperr F2fs, btrfs, r Dm-zoned, zdm, .2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.27

ConclusionrZBC support plan is a compromise between simplicity and usabilityr Changes limited to the block I/O stackrrrrMost within the SCSI disk driverCritical areas such as the page cache are untouchedEarly work on file systems validated the overall architecture and APIr Changes for native support mostly limited to ensuring sequentialwrite submissionDevice mapper enables all that cannot easily be natively supportedr Performance will depend on application2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.28

Thank you !Questions ?2016 Storage Developer Conference. 2016 Western Digital Corporation. All Rights Reserved.29

r ZAC command set and translation from ZBC implemented r But no ZBC support in the SCSI disk driver r SG_IO is the only interface available to issue ZBC commands r From applications only r Host aware drives are seen as regular block devices r No differentiation with regular disks r Host managed drives are exposed as SG node r No block device .