Curing The Ailments Of Nanometer CMOS Through Self-Healing And Resiliency

Transcription

Curing the Ailments ofNanometer CMOS throughSelf-Healing and ResiliencyJan M. RabaeyDirector Gigascale Silicon Research CenterCo-Director Berkeley Wireless Research CenterUniversity of California at BerkeleySOCC, Sept. 25, 2006

The Silicon Age Still on a Roll, But High VolumeManufacturingTechnology Node (nm)Integration Capacity (BT)Delay CV/I scalingEnergy/Logic Op 61182481632641282560.7 0.7 0.7Delay scaling will slow down 0.35 0.5 0.5Energy scaling will slow downBulk Planar CMOSHigh ProbabilityLow ProbabilityAlternate, 3G etcLow ProbabilityHigh ProbabilityVariabilityMediumILD (K)RC DelayMetal Layers2 3 31116-77-88-9HighVery HighReduce slowly towards 2-2.511110.5 to 1 layer per generationSome Major Hurdles on The Way!SOCC, Sept. 200620032003 ITRSITRS RoadmapRoadmap1

The Challenges of the Next Decade(s) The Physics and Manufacturing Challenges– A whole slew of static and dynamic variations and errormechanisms The Design Introduction Challenge– Complexity, risk, time, cost The n-furcation of the Market3SOCC, Sept. 2006

Variations Becoming mmicron 0.1Gap100 rmalized Frequency13nmEUV1.41.3 10202030%1.2130nm1.11101.05X100900.91234Normalized Leakage (Isb)5807060Design becoming “statistical” makes verification substantially harder challenging synchronization strategies “error-free” design untenable4SOCC, Sept. 2006Courtesy: Shekhar Borkar, Intel50Y40XTemperature (C)180nm130nm

Just One Example of Where We are GoingVVTT VariationVariation –– Short/NarrowShort/NarrowVVTT VariationVariation –– Long/WideLong/Wide5SOCC, Sept. 2006Courtesy:Courtesy: ColinColin McAndrew,McAndrew, FreescaleFreescale

Variations Come in Many Different FlavorsAlso,Also, locallocal versusversus global,global, correlatedcorrelated versusversus random,random, temperaltemperal versusversus spatialspatial6SOCC, Sept. 2006DifferentDifferent sourcessources leadlead toto differentdifferentsolutionssolutions

Variations Become Indistinguishable from Failure7SOCC, Sept. 2006Source:Source: K.K. Nowka,Nowka, IBMIBM

Failures Becoming More ProminentElectromigration(Weak-defective interconnects)Transient Faults due toCosmic Rays & Alpha Particles(Increase exponentially withnumber of devices on chip)Time-DependentDielectric Breakdown (TDDB)(Ultra-thin gate oxides)Manufacturing DefectsThat Escape TestingTransistor Reliability(Inefficient Burn-in Testing)NowIncreased igherTransistorLeakageTransistor Lifetime (years)8SOCC, Sept. 2006 just more complexityCourtesy:Courtesy: T.T. AustinAustin

Failures Becoming More ProminentErraticErratic bitbit failuresfailures inin memoriesmemories causedcaused byby temporarytemporary trappedtrapped chargescharges9SOCC, Sept. 2006

ComplexityComplexityDealing with variations and Fully structured and regular fabrics2000200010SOCC, Sept. 20062005200520102010BeyondBeyondTheThe farfar beyondbeyond

Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessRegular implementation fabricsAbsolutely required for manufacturabilityDriven by photo-lithography and eventuallyself-assembly constraintsAlso for variability, reliability, and time-tomarket11SOCC, Sept. 2006

Regular Fabrics – A Plethora of ChoicesVPGAVPGACMUCMUTrade-off between area,Trade-offarea,performance,performance, powerpower andandtime-to-markettime-to-market(factors(factors 55 toto 10)10)RiverRiver PLAPLABerkeleyBerkeleyFPGAFPGA12SOCC, Sept. 2006StructuredStructured ASICASIC (e.g.(e.g. LSILSI RapidChip)RapidChip)

Regular Fabrics - ExampleCMU Regular Logic Bricks Standard-cell library with fewer ( 10),coarser, configurable (w/ vias),micro-regular brick layouts that exhibit macro-regularitywhen assembled at chip-levelASIC “spatial” regularity2-D FFT plotsof poly-SipatternsBrick “spatial” regularity2-D FFT plotsof poly-Sipatterns13SOCC, Sept. 2006[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]

CMU Regular Logic Bricks14SOCC, Sept. 2006[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]

Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessSelf-Healing Architectures On chip-test and diagnostics used tocorrect for variations and stress Static and dynamic15SOCC, Sept. 2006

Self-Healing Introduce sensors that monitor key aspects of system– Manufacturing and environmental conditionsProcess variations, temperature, voltage, activity, etc– Key properties that accelerate failure mechanisms Employ system-level intelligent control to reduce stress– Temperature control via resource assignment– Active management of voltage-reliability trade-offs Utilize tuning and healing to alleviate reliability threats– NBTI reversal– In-field clock tuning16SOCC, Sept. 2006Courtesy:Courtesy: T.T. AustinAustin

Test Moving On-LineDiag. testprogramOn-ChipMemoryCPUResponsemap On-chip resources used tominimize test cost Also available for dynamicre-evaluation and adaptationVCIBusInterface Master WrapperOn-chip 00000000009090 nmnm ItaniumItaniumOn-chipOn-chip noisenoise samplerssamplers17SOCC, Sept. 2006On-chipOn-chip leakageleakage sensorsensor

Adaptive Biasing Using On-Line TestTest inputsand responsesTclockTestModuleModuleVddVbbDynamically adjust supplyand threshold designparameters to center thedesign in the presence ofprocess variations!Energy-performance trade-offEswitching (fJ)50Adaptive TuningWorst Case, w/o Vth tuningNominal, w/ Vth tuning4540353010x2520Easier Again inRegular Fabrics151051.0E 0318SOCC, Sept. 20061.0E 041.0E 051.0E 061.0E 07Path Delay (ps)Courtesy: K. Cao, Berkeley

5.3mmmm5.3Adaptive (Body) Biasing SOCC, Sept. 2006Courtesy: P. Gelsinger and S. Borkar, Intel (DAC04)

Dynamic Resource AllocationIn the Interconnect SpaceUse routing throttling toperform thermal managementIn the Multi-Processor SpaceCompiler combines loadassignment with DVS3DDFEN o rm a liz e d E n e rg y100908070605040LUSPLATMGRIDWAVE5More savings with more processors!24816Number of Processorsmdl group at PSU20SOCC, Sept. 200632ThermalHerd (L.S. Peh, Princeton)

RejuvenationNegativeNegative BiasBias TemperatureTemperature InstabilityInstability21Source:Source: D.D. Blaauw,Blaauw, UMichUMichSOCC, Sept. 2006

Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessRedundancy GaloreThe only way to provide true error-resiliency!With billions of transistors, overhead factors of 2 to3 are reasonable if leading to 100% yield, supremeperformance, or new applications.22SOCC, Sept. 2006

Error-Resilient SystemsIncorporate facilities to push through system faults Error detection technologies– Systems checkers, online testing, continuous functionalverification Fault diagnosis– Fine-grained testing, online testing System state recovery– Microarchitectural checkpointing, algorithmic tolerance Physical repair– Sparing, TMR23SOCC, Sept. 2006Courtesy:Courtesy: T.T. AustinAustin

A Gradual Introduction ProcessExample: Aggressive Deployment using “Razor”A “pseudo-synchronous”approach to address processvariations and powerminimization with minimaloverhead by combining circuitand architectural techniquesclkQDFFError ptimal VoltageSupply urtesy: T.T. Austin,Austin, D.D. Blaauw,Blaauw, MichiganMichiganerrorbubblerecoverrecover24SOCC, Sept. 2006error(read-only)flushIDRazor FFerrorMEMEXRazor FFPCRecovEnergyIDRazor FFIFerrorrecoverflushIDbubbleStabilizer FF“razored“razored pipeline”pipeline”TotalRazor FFEnergyclk delWB(reg/mem)

Example 2: Minimizing standby leakage in SRAMsThe Memory Data-RetentionVoltage (DRV) V1 V2VDD0VDDM5V1LeakagecurrentM1M4 Left inverter V1 V2, when VDD DRVRight inverter0M3V2M2DRV Condition:M6VDDVTC of SRAM cell inverters0.4LeakagecurrentVDD 0.4V2When Vdd scales down to DRV, theVoltage Transfer Curves (VTC) ofthe internal inverters degrade tosuch a level that Static NoiseMargin (SNM) of the SRAM cellreduces to zero.V2 (V)0.30.2 V 0.18VDD0.1VTC1VTC2025Source: Huifang Qin, ISQED 2004SOCC, Sept. 200600.10.2V1 (V)0.30.4

The Impact of Process VariationsHistogram of 32K SRAM cellsDRV Spatial Distribution(256*128 Cells)6000500040003000200010000100200300400DRV (mV)26SOCC, Sept. 2006130 nm CMOS

Supply based tradeoffvSData int 0ErrorControlCodeSRAMData outt TstGoal:Minimize power/bit27SOCC, Sept. 2006

Power tradeoff with ECCAtAt thethe expenseexpense ofof timetime andand areaareaoverheadoverheadECC saves standby power Hamming [31, 26, 3] achieves 33% powersaving Reed-Muller [256, 219, 8] achieves 35%power saving28SOCC, Sept. 2006MinimumMinimum standbystandby timetime toto achieveachievepowerpower savingssavings

Prototype Design1.1mm Error tolerant SRAMoptimized for ultra-lowvoltage standby Selected implementationHamming [31, 26, 3] 50% cell designoverhead 19% parity overhead Tapeout: May 200629SOCC, Sept. 2006Original mem1024x26encCustomized1024x31dec1.1mm

“Aggressive” Deployment At the Algorithm Levelx[n ]Main Blockya [n] Thyˆ[n]Estimatorye [n]PTOTPECEnergy savingsPower1.0VoltageVoltage overscaleoverscale MainMain Block.Block.CorrectCorrect errorserrors usingusing Estimator.Estimator.PowerPower savingssavings 3X!3X!Pmain30SOCC, Sept. 2006Voltage1.0Courtesy:Courtesy: N.N. Shanbhag,Shanbhag, IllinoisIllinois

Leveraging resiliency to increase valueerror-freewith errorsLowLow powerpower motionmotion estimationestimationarchitecturearchitecture usingusing AlgorithmicAlgorithmicNoiseNoise ToleranceTolerance (Shanbhag,(Shanbhag, UIUC)UIUC)UpUp toto 71%71% energyenergy reductionreduction demonstrateddemonstrated31SOCC, Sept. 2006error-corrected

Moving the Verification on the ChipPerformanceCoreIFIDCorrectnessEX/MEMREN REGSCHEDULER Core function validated by checkerspeculativeinstructionsin-orderwith PC, inst,inputs, addrCheckerCHK CTAlpha 21264REMORAChecker Checker relaxes burden of correctness oncore processor Core does the heavy lifting, removeshazards that could slow the simplechecker12 mm2205 mm2Self-checking processor32SOCC, Sept. 2006Courtesy: Todd Austin, Univ. of Michigan

“On-Line X”(X Verification, Test, Tuning, Reliability, Resource,Power and Leakage Management)“Turning lemonsinto lemonade”T. Austin33From Design time to Run Time Yield Improvement!SOCC, Sept. 2006

Runtime Validation of Multithreaded ProcessorsIntraIntra-threadControl FlowCorrectnessProperties ingHardwareHardwareSynchronization UnitContext Status RegisterPer-thread retired instructionsdispatchSMT ProcessorIntraIntra-threadData FlowDIVA checker processorDIVA checker processorReg. FileCoordinated Forward Error y: S.S. Malik,Malik, PrincetonPrinceton1.01134SOCC, Sept. 20060.99FFTLUCHOLESKYRuntime Validation ConfigurationBARNESFMMWATERWATERNSQUARED SPATIALFault Rate 1/1KFault Rate 1/1M

BulletProof Silicon – The Next GenerationGoal: Single-defect tolerance for 5% area overhead35SOCC, Sept. 2006WBMEMEXIDIFspeculative stateµprocessor pipelines checkerBISTCIRCUIT ENVELOPE– logic-level testing and reconfigurationepochs boundaryARCHITECTURAL ENVELOPE– Check-pointing and epoch restoreReconfigurationApproach:1. Execute and protect state2. Test concurrently when Hw idle3. If tests fails roll back state disable component restartepochs boundarynon-speculative stateKey ideas: No expensive computation checking Protect computation and test Hw Repair by disabling redundant partsCourtesy:Courtesy: Austin,Austin,Bertacco,Bertacco, U.U. MichMich

BulletProof Router Exploit the properties of the CMP switch design to provide end-to-end errordetection and recovery– Enhance switch output channelsa: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit bufferedwith CRC checkers– Split flits into two parts and routeeInputBufferse d c b adthem independently usingdifferent resourcesTail Head RecoveryHead– Add a Recovery PointerError Detection SignalCRCCheckerRoutedFlitoutgoing packets- Head pointers are set torecovery pointers- Restart execution36SOCC, Sept. - All CRC checkers drop- Switch pipeline is flushedSwitch RecoveryRouting Logicto each input buffer– On Error erCheckerCRCTail utedFlitCRCCheckerCross-barCross-bar ControllerVC State Routing LogicSwitch ArbiterSwitch ArbiterCRCChecker

Towards malleable, resilient architecturesThe Quest: Scaleable (hard and soft) architectures thatprovide flexible redundancy to accommodate systematic andrandom, static and dynamic errors while avoiding brittleness!37SOCC, Sept. 2006

Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessMaintaining a purely deterministicBoolean abstraction ultimately becomesuntenable!Maintaining our abstractions Slowlyabandon them !!38SOCC, Sept. 2006

The Search for (New) Scaleable and StackableAbstractionsArtificialArtificial neuronneuronAllow devices to make errorsand use models-of-computation thattolerate them(signal processing, communication,coding, information theory)An Interesting Case Study:The “Neural Network” MOCProperties: Works well on noisy signals Uses “soft” decisions Operates in the presence of failures ofcomponents and interconnections39Challenge: Limited scopeWorks mostly for classification problemsSOCC, Sept. 2006

Exploring the Yellow Brick RoadHumansAnts 10-15% of terrestrial animal biomass 109 Neurons/”node” Since 105 years ago 10-15% of terrestrial animal biomass 105 Neurons/”node” Since 108 years ago40SOCC, Sept. 2006Easier to make ants than humans“Small, simple, swarm”CourtesyD. Petrovic, UCB

Inspired by the Sensor Network ParadigmArtificial SkinCommunication Backplanes41SOCC, Sept. 2006Smart SurfacesReal-time Health Monitoring

Example: Collaborative Networks LargeLargenumbernumberofofstates/nodesstates/nodes r,non-deterministicnon-deterministiclinkslinks emergentemergentbehaviorbehavior entresilienttotofailurefailureSensor Network-on-a-chip42Source: N. Shangbah, D. JonesSOCC, Sept. 2006

SN-on-a-chip – A simple exampleEstimatorsEstimators needneed toto bebe independentindependentforfor thisthis schemescheme toto bebe effectiveeffectiveAA simplesimple study:study:22 differentdifferent addersadders withwithvoltagevoltage over-scalingover-scaling43SOCC, Sept. 2006Source:Source: N.N. Shanbhag,Shanbhag, UIUCUIUC

Distributed Collaborative Systems on a ChipExample: A configurable radio architecturebased on collaborative autonomous entitiesArray of locally-coupled cheaplow-power oscillator-based units Known to exhibit complex, spontaneous patternformation Operation mode selected through choice ofcoupling factors and operational nodes44SOCC, Sept. 2006Emerging patternas a function of coupling factorSource: J. Roychowdhury, J. Rabaey

The Mechanical RadioThe Ultimate ULP Tunable Wireless Transceiver?Support BeamsInput Wine-GlassElectrodeDiskOutputElectrodeR 32 μmCoupling BeamAnchor9 wine-glass disc oscillator-based GSMcompliant oscillator45Source: C. Nguyen, UC MichiganSOCC, Sept. 2006

Transitioning to the Post-Silicon ion platforms that work under very low SNR,are non-deterministic, unpredictable and unreliable SOCC, Sept. 2006

Some Concluding RemarksFormidable challenges over the next decadesto dramatically alter design paradigms VariabilityVariability andand reliabilityreliability toto leadlead toto novelnovel micro-architecturesmicro-architecturesandand computationalcomputational modelsmodels RegularityRegularity andand redundancyredundancy centralcentral tenetstenetsThe opportunities: UseUse thethe abundanceabundance ofof transistorstransistors toto movemove thethe burdenburden fromfromprepre- oror post-manufacturingpost-manufacturing evaluationevaluation toto on-lineon-line activitiesactivities GradualGradual incorporationincorporation ofof error-resilienterror-resilient computationalcomputationalmodelsmodels47SOCC, Sept. 2006

GSRC: The best answer to formidablechallenges is a critical mass ncurrentCoreNow482020’sThe GSRC System-Design RoadmapSOCC, Sept. 2006

The GSRCAgendaS. MalikK. LutzJ.RabaeyStructured along the line49SOCC, Sept. 2006ASVSystem DesignCore Framework41 Faculty17 InstitutionsN. y out-of-the-boxthinkingT. AustinResilient SystemsProvokes multi-W.M. HwuConcurrent Systemsof big challenges ratherthan technologiesDesign DriverJ. Wawrzynek

Thank you!“Creativity is the ability to introduce order intothe randomness of nature”― Eric HofferThe contributions of all the GSRC faculty to this presentation aregreatly appreciated, so is the funding by the MARCO membercompanies and the US Government.50SOCC, Sept. 2006

2 SOCC, Sept. 2006 The Silicon Age Still on a Roll, But Variability Medium High Very High Energy/Logic Op scaling 0.35 0.5 0.5 Energy scaling will slow down Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation RC Delay 1 1 1 1 1 1 1 1 ILD (K) 3 3 Reduce slowly towards 2-2.5 Alternate, 3G etc Low Probability High Probability