Transcription
Curing the Ailments ofNanometer CMOS throughSelf-Healing and ResiliencyJan M. RabaeyDirector Gigascale Silicon Research CenterCo-Director Berkeley Wireless Research CenterUniversity of California at BerkeleySOCC, Sept. 25, 2006
The Silicon Age Still on a Roll, But High VolumeManufacturingTechnology Node (nm)Integration Capacity (BT)Delay CV/I scalingEnergy/Logic Op 61182481632641282560.7 0.7 0.7Delay scaling will slow down 0.35 0.5 0.5Energy scaling will slow downBulk Planar CMOSHigh ProbabilityLow ProbabilityAlternate, 3G etcLow ProbabilityHigh ProbabilityVariabilityMediumILD (K)RC DelayMetal Layers2 3 31116-77-88-9HighVery HighReduce slowly towards 2-2.511110.5 to 1 layer per generationSome Major Hurdles on The Way!SOCC, Sept. 200620032003 ITRSITRS RoadmapRoadmap1
The Challenges of the Next Decade(s) The Physics and Manufacturing Challenges– A whole slew of static and dynamic variations and errormechanisms The Design Introduction Challenge– Complexity, risk, time, cost The n-furcation of the Market3SOCC, Sept. 2006
Variations Becoming mmicron 0.1Gap100 rmalized Frequency13nmEUV1.41.3 10202030%1.2130nm1.11101.05X100900.91234Normalized Leakage (Isb)5807060Design becoming “statistical” makes verification substantially harder challenging synchronization strategies “error-free” design untenable4SOCC, Sept. 2006Courtesy: Shekhar Borkar, Intel50Y40XTemperature (C)180nm130nm
Just One Example of Where We are GoingVVTT VariationVariation –– Short/NarrowShort/NarrowVVTT VariationVariation –– Long/WideLong/Wide5SOCC, Sept. 2006Courtesy:Courtesy: ColinColin McAndrew,McAndrew, FreescaleFreescale
Variations Come in Many Different FlavorsAlso,Also, locallocal versusversus global,global, correlatedcorrelated versusversus random,random, temperaltemperal versusversus spatialspatial6SOCC, Sept. 2006DifferentDifferent sourcessources leadlead toto differentdifferentsolutionssolutions
Variations Become Indistinguishable from Failure7SOCC, Sept. 2006Source:Source: K.K. Nowka,Nowka, IBMIBM
Failures Becoming More ProminentElectromigration(Weak-defective interconnects)Transient Faults due toCosmic Rays & Alpha Particles(Increase exponentially withnumber of devices on chip)Time-DependentDielectric Breakdown (TDDB)(Ultra-thin gate oxides)Manufacturing DefectsThat Escape TestingTransistor Reliability(Inefficient Burn-in Testing)NowIncreased igherTransistorLeakageTransistor Lifetime (years)8SOCC, Sept. 2006 just more complexityCourtesy:Courtesy: T.T. AustinAustin
Failures Becoming More ProminentErraticErratic bitbit failuresfailures inin memoriesmemories causedcaused byby temporarytemporary trappedtrapped chargescharges9SOCC, Sept. 2006
ComplexityComplexityDealing with variations and Fully structured and regular fabrics2000200010SOCC, Sept. 20062005200520102010BeyondBeyondTheThe farfar beyondbeyond
Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessRegular implementation fabricsAbsolutely required for manufacturabilityDriven by photo-lithography and eventuallyself-assembly constraintsAlso for variability, reliability, and time-tomarket11SOCC, Sept. 2006
Regular Fabrics – A Plethora of ChoicesVPGAVPGACMUCMUTrade-off between area,Trade-offarea,performance,performance, powerpower andandtime-to-markettime-to-market(factors(factors 55 toto 10)10)RiverRiver PLAPLABerkeleyBerkeleyFPGAFPGA12SOCC, Sept. 2006StructuredStructured ASICASIC (e.g.(e.g. LSILSI RapidChip)RapidChip)
Regular Fabrics - ExampleCMU Regular Logic Bricks Standard-cell library with fewer ( 10),coarser, configurable (w/ vias),micro-regular brick layouts that exhibit macro-regularitywhen assembled at chip-levelASIC “spatial” regularity2-D FFT plotsof poly-SipatternsBrick “spatial” regularity2-D FFT plotsof poly-Sipatterns13SOCC, Sept. 2006[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]
CMU Regular Logic Bricks14SOCC, Sept. 2006[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]
Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessSelf-Healing Architectures On chip-test and diagnostics used tocorrect for variations and stress Static and dynamic15SOCC, Sept. 2006
Self-Healing Introduce sensors that monitor key aspects of system– Manufacturing and environmental conditionsProcess variations, temperature, voltage, activity, etc– Key properties that accelerate failure mechanisms Employ system-level intelligent control to reduce stress– Temperature control via resource assignment– Active management of voltage-reliability trade-offs Utilize tuning and healing to alleviate reliability threats– NBTI reversal– In-field clock tuning16SOCC, Sept. 2006Courtesy:Courtesy: T.T. AustinAustin
Test Moving On-LineDiag. testprogramOn-ChipMemoryCPUResponsemap On-chip resources used tominimize test cost Also available for dynamicre-evaluation and adaptationVCIBusInterface Master WrapperOn-chip 00000000009090 nmnm ItaniumItaniumOn-chipOn-chip noisenoise samplerssamplers17SOCC, Sept. 2006On-chipOn-chip leakageleakage sensorsensor
Adaptive Biasing Using On-Line TestTest inputsand responsesTclockTestModuleModuleVddVbbDynamically adjust supplyand threshold designparameters to center thedesign in the presence ofprocess variations!Energy-performance trade-offEswitching (fJ)50Adaptive TuningWorst Case, w/o Vth tuningNominal, w/ Vth tuning4540353010x2520Easier Again inRegular Fabrics151051.0E 0318SOCC, Sept. 20061.0E 041.0E 051.0E 061.0E 07Path Delay (ps)Courtesy: K. Cao, Berkeley
5.3mmmm5.3Adaptive (Body) Biasing SOCC, Sept. 2006Courtesy: P. Gelsinger and S. Borkar, Intel (DAC04)
Dynamic Resource AllocationIn the Interconnect SpaceUse routing throttling toperform thermal managementIn the Multi-Processor SpaceCompiler combines loadassignment with DVS3DDFEN o rm a liz e d E n e rg y100908070605040LUSPLATMGRIDWAVE5More savings with more processors!24816Number of Processorsmdl group at PSU20SOCC, Sept. 200632ThermalHerd (L.S. Peh, Princeton)
RejuvenationNegativeNegative BiasBias TemperatureTemperature InstabilityInstability21Source:Source: D.D. Blaauw,Blaauw, UMichUMichSOCC, Sept. 2006
Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessRedundancy GaloreThe only way to provide true error-resiliency!With billions of transistors, overhead factors of 2 to3 are reasonable if leading to 100% yield, supremeperformance, or new applications.22SOCC, Sept. 2006
Error-Resilient SystemsIncorporate facilities to push through system faults Error detection technologies– Systems checkers, online testing, continuous functionalverification Fault diagnosis– Fine-grained testing, online testing System state recovery– Microarchitectural checkpointing, algorithmic tolerance Physical repair– Sparing, TMR23SOCC, Sept. 2006Courtesy:Courtesy: T.T. AustinAustin
A Gradual Introduction ProcessExample: Aggressive Deployment using “Razor”A “pseudo-synchronous”approach to address processvariations and powerminimization with minimaloverhead by combining circuitand architectural techniquesclkQDFFError ptimal VoltageSupply urtesy: T.T. Austin,Austin, D.D. Blaauw,Blaauw, MichiganMichiganerrorbubblerecoverrecover24SOCC, Sept. 2006error(read-only)flushIDRazor FFerrorMEMEXRazor FFPCRecovEnergyIDRazor FFIFerrorrecoverflushIDbubbleStabilizer FF“razored“razored pipeline”pipeline”TotalRazor FFEnergyclk delWB(reg/mem)
Example 2: Minimizing standby leakage in SRAMsThe Memory Data-RetentionVoltage (DRV) V1 V2VDD0VDDM5V1LeakagecurrentM1M4 Left inverter V1 V2, when VDD DRVRight inverter0M3V2M2DRV Condition:M6VDDVTC of SRAM cell inverters0.4LeakagecurrentVDD 0.4V2When Vdd scales down to DRV, theVoltage Transfer Curves (VTC) ofthe internal inverters degrade tosuch a level that Static NoiseMargin (SNM) of the SRAM cellreduces to zero.V2 (V)0.30.2 V 0.18VDD0.1VTC1VTC2025Source: Huifang Qin, ISQED 2004SOCC, Sept. 200600.10.2V1 (V)0.30.4
The Impact of Process VariationsHistogram of 32K SRAM cellsDRV Spatial Distribution(256*128 Cells)6000500040003000200010000100200300400DRV (mV)26SOCC, Sept. 2006130 nm CMOS
Supply based tradeoffvSData int 0ErrorControlCodeSRAMData outt TstGoal:Minimize power/bit27SOCC, Sept. 2006
Power tradeoff with ECCAtAt thethe expenseexpense ofof timetime andand areaareaoverheadoverheadECC saves standby power Hamming [31, 26, 3] achieves 33% powersaving Reed-Muller [256, 219, 8] achieves 35%power saving28SOCC, Sept. 2006MinimumMinimum standbystandby timetime toto achieveachievepowerpower savingssavings
Prototype Design1.1mm Error tolerant SRAMoptimized for ultra-lowvoltage standby Selected implementationHamming [31, 26, 3] 50% cell designoverhead 19% parity overhead Tapeout: May 200629SOCC, Sept. 2006Original mem1024x26encCustomized1024x31dec1.1mm
“Aggressive” Deployment At the Algorithm Levelx[n ]Main Blockya [n] Thyˆ[n]Estimatorye [n]PTOTPECEnergy savingsPower1.0VoltageVoltage overscaleoverscale MainMain Block.Block.CorrectCorrect errorserrors usingusing Estimator.Estimator.PowerPower savingssavings 3X!3X!Pmain30SOCC, Sept. 2006Voltage1.0Courtesy:Courtesy: N.N. Shanbhag,Shanbhag, IllinoisIllinois
Leveraging resiliency to increase valueerror-freewith errorsLowLow powerpower motionmotion estimationestimationarchitecturearchitecture usingusing AlgorithmicAlgorithmicNoiseNoise ToleranceTolerance (Shanbhag,(Shanbhag, UIUC)UIUC)UpUp toto 71%71% energyenergy reductionreduction demonstrateddemonstrated31SOCC, Sept. 2006error-corrected
Moving the Verification on the ChipPerformanceCoreIFIDCorrectnessEX/MEMREN REGSCHEDULER Core function validated by checkerspeculativeinstructionsin-orderwith PC, inst,inputs, addrCheckerCHK CTAlpha 21264REMORAChecker Checker relaxes burden of correctness oncore processor Core does the heavy lifting, removeshazards that could slow the simplechecker12 mm2205 mm2Self-checking processor32SOCC, Sept. 2006Courtesy: Todd Austin, Univ. of Michigan
“On-Line X”(X Verification, Test, Tuning, Reliability, Resource,Power and Leakage Management)“Turning lemonsinto lemonade”T. Austin33From Design time to Run Time Yield Improvement!SOCC, Sept. 2006
Runtime Validation of Multithreaded ProcessorsIntraIntra-threadControl FlowCorrectnessProperties ingHardwareHardwareSynchronization UnitContext Status RegisterPer-thread retired instructionsdispatchSMT ProcessorIntraIntra-threadData FlowDIVA checker processorDIVA checker processorReg. FileCoordinated Forward Error y: S.S. Malik,Malik, PrincetonPrinceton1.01134SOCC, Sept. 20060.99FFTLUCHOLESKYRuntime Validation ConfigurationBARNESFMMWATERWATERNSQUARED SPATIALFault Rate 1/1KFault Rate 1/1M
BulletProof Silicon – The Next GenerationGoal: Single-defect tolerance for 5% area overhead35SOCC, Sept. 2006WBMEMEXIDIFspeculative stateµprocessor pipelines checkerBISTCIRCUIT ENVELOPE– logic-level testing and reconfigurationepochs boundaryARCHITECTURAL ENVELOPE– Check-pointing and epoch restoreReconfigurationApproach:1. Execute and protect state2. Test concurrently when Hw idle3. If tests fails roll back state disable component restartepochs boundarynon-speculative stateKey ideas: No expensive computation checking Protect computation and test Hw Repair by disabling redundant partsCourtesy:Courtesy: Austin,Austin,Bertacco,Bertacco, U.U. MichMich
BulletProof Router Exploit the properties of the CMP switch design to provide end-to-end errordetection and recovery– Enhance switch output channelsa: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit bufferedwith CRC checkers– Split flits into two parts and routeeInputBufferse d c b adthem independently usingdifferent resourcesTail Head RecoveryHead– Add a Recovery PointerError Detection SignalCRCCheckerRoutedFlitoutgoing packets- Head pointers are set torecovery pointers- Restart execution36SOCC, Sept. - All CRC checkers drop- Switch pipeline is flushedSwitch RecoveryRouting Logicto each input buffer– On Error erCheckerCRCTail utedFlitCRCCheckerCross-barCross-bar ControllerVC State Routing LogicSwitch ArbiterSwitch ArbiterCRCChecker
Towards malleable, resilient architecturesThe Quest: Scaleable (hard and soft) architectures thatprovide flexible redundancy to accommodate systematic andrandom, static and dynamic errors while avoiding brittleness!37SOCC, Sept. 2006
Curing the Nanometer Ailments Regularity and Structure Self-Healing Error-Resiliency Embracing RandomnessMaintaining a purely deterministicBoolean abstraction ultimately becomesuntenable!Maintaining our abstractions Slowlyabandon them !!38SOCC, Sept. 2006
The Search for (New) Scaleable and StackableAbstractionsArtificialArtificial neuronneuronAllow devices to make errorsand use models-of-computation thattolerate them(signal processing, communication,coding, information theory)An Interesting Case Study:The “Neural Network” MOCProperties: Works well on noisy signals Uses “soft” decisions Operates in the presence of failures ofcomponents and interconnections39Challenge: Limited scopeWorks mostly for classification problemsSOCC, Sept. 2006
Exploring the Yellow Brick RoadHumansAnts 10-15% of terrestrial animal biomass 109 Neurons/”node” Since 105 years ago 10-15% of terrestrial animal biomass 105 Neurons/”node” Since 108 years ago40SOCC, Sept. 2006Easier to make ants than humans“Small, simple, swarm”CourtesyD. Petrovic, UCB
Inspired by the Sensor Network ParadigmArtificial SkinCommunication Backplanes41SOCC, Sept. 2006Smart SurfacesReal-time Health Monitoring
Example: Collaborative Networks LargeLargenumbernumberofofstates/nodesstates/nodes r,non-deterministicnon-deterministiclinkslinks emergentemergentbehaviorbehavior entresilienttotofailurefailureSensor Network-on-a-chip42Source: N. Shangbah, D. JonesSOCC, Sept. 2006
SN-on-a-chip – A simple exampleEstimatorsEstimators needneed toto bebe independentindependentforfor thisthis schemescheme toto bebe effectiveeffectiveAA simplesimple study:study:22 differentdifferent addersadders withwithvoltagevoltage over-scalingover-scaling43SOCC, Sept. 2006Source:Source: N.N. Shanbhag,Shanbhag, UIUCUIUC
Distributed Collaborative Systems on a ChipExample: A configurable radio architecturebased on collaborative autonomous entitiesArray of locally-coupled cheaplow-power oscillator-based units Known to exhibit complex, spontaneous patternformation Operation mode selected through choice ofcoupling factors and operational nodes44SOCC, Sept. 2006Emerging patternas a function of coupling factorSource: J. Roychowdhury, J. Rabaey
The Mechanical RadioThe Ultimate ULP Tunable Wireless Transceiver?Support BeamsInput Wine-GlassElectrodeDiskOutputElectrodeR 32 μmCoupling BeamAnchor9 wine-glass disc oscillator-based GSMcompliant oscillator45Source: C. Nguyen, UC MichiganSOCC, Sept. 2006
Transitioning to the Post-Silicon ion platforms that work under very low SNR,are non-deterministic, unpredictable and unreliable SOCC, Sept. 2006
Some Concluding RemarksFormidable challenges over the next decadesto dramatically alter design paradigms VariabilityVariability andand reliabilityreliability toto leadlead toto novelnovel micro-architecturesmicro-architecturesandand computationalcomputational modelsmodels RegularityRegularity andand redundancyredundancy centralcentral tenetstenetsThe opportunities: UseUse thethe abundanceabundance ofof transistorstransistors toto movemove thethe burdenburden fromfromprepre- oror post-manufacturingpost-manufacturing evaluationevaluation toto on-lineon-line activitiesactivities GradualGradual incorporationincorporation ofof error-resilienterror-resilient computationalcomputationalmodelsmodels47SOCC, Sept. 2006
GSRC: The best answer to formidablechallenges is a critical mass ncurrentCoreNow482020’sThe GSRC System-Design RoadmapSOCC, Sept. 2006
The GSRCAgendaS. MalikK. LutzJ.RabaeyStructured along the line49SOCC, Sept. 2006ASVSystem DesignCore Framework41 Faculty17 InstitutionsN. y out-of-the-boxthinkingT. AustinResilient SystemsProvokes multi-W.M. HwuConcurrent Systemsof big challenges ratherthan technologiesDesign DriverJ. Wawrzynek
Thank you!“Creativity is the ability to introduce order intothe randomness of nature”― Eric HofferThe contributions of all the GSRC faculty to this presentation aregreatly appreciated, so is the funding by the MARCO membercompanies and the US Government.50SOCC, Sept. 2006
2 SOCC, Sept. 2006 The Silicon Age Still on a Roll, But Variability Medium High Very High Energy/Logic Op scaling 0.35 0.5 0.5 Energy scaling will slow down Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation RC Delay 1 1 1 1 1 1 1 1 ILD (K) 3 3 Reduce slowly towards 2-2.5 Alternate, 3G etc Low Probability High Probability