Transcription
Fundamental Limits on the Precision of In-memoryArchitectures(Invited Talk)Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. .eduDepartment of Electrical and Computer EngineeringUniversity of Illinois at Urbana-ChampaignABSTRACT1This paper obtains the fundamental limits on the computationalprecision of in-memory computing architectures (IMCs). Variouscompute SNR metrics for IMCs are defined and their interrelationships analyzed to show that the accuracy of IMCs is fundamentallylimited by the compute SNR (SNRa ) of its analog core, and thatactivation, weight and output precision needs to be assigned appropriately for the final output SNR SNRT SNRa . The minimumprecision criterion (MPC) is proposed to minimize the output andhence the column analog-to-digital converter (ADC) precision. Thecharge summing (QS) compute model and its associated IMC QSArch are studied to obtain analytical models for its compute SNR,minimum ADC precision, energy and latency. Compute SNR models of QS-Arch are validated via Monte Carlo simulations in a 65 nmCMOS process. Employing these models, upper bounds on SNRaof a QS-Arch-based IMC employing a 512 row SRAM array areobtained and it is shown that QS-Archβs energy cost reduces by3.3 for every 6 dB drop in SNRa , and that the maximum achievableSNRa reduces with technology scaling while the energy cost at thesame SNRa increases. These models also indicate the existence ofan upper bound on the dot product dimension π due to voltageheadroom clipping, and this bound can be doubled for every 3 dBdrop in SNRa .In-memory computing (IMC) [13, 19, 28, 34] has emerged as anattractive alternative to conventional von Neumann (digital) architectures for addressing the energy and latency cost of memoryaccesses in data-centric machine learning workloads. IMCs embedanalog mixed-signal computations in close proximity to the bit-cellarray (BCA) in order to execute machine learning computationssuch as matrix-vector multiply (MVM) and dot products (DPs) asan intrinsic part of the read cycle and thereby avoid the need toaccess raw data.IMCs exhibit a fundamental trade-off between its energy-delayproduct (EDP) and the accuracy or signal-to-noise ratio (SNR) of itsanalog computations. This trade-off arises due to constraints onthe maximum bit-line (BL) voltage discharge and due to processvariations, specifically spatial variations in the threshold voltageπt , which limit the dynamic range and the SNR. Additionally, IMCsalso exhibit noise due to the quantization of its input activationand weight parameters and due to the column analog-to-digitalconverters (ADCs). Henceforth, we use "compute SNR" to refer tothe computational precision/accuracy of an IMC, and "precision"to the number of bits assigned to various signals.Today, a large number of IMC prototype ICs have been demonstrated [1, 3, 4, 7, 12, 15β17, 31β33, 36, 38, 40]. While these IMCshave shown impressive reductions in the EDP over a von Neumannequivalent with minimal loss in inference accuracy, it is not clearthat these gains are sustainable for larger problem sizes across datasets and inference tasks. Unlike digital architectures whose compute SNR can be made arbitrarily high by assigning sufficiently highprecision to various signals, IMCs need to contend with both quantization noise as well as analog non-idealities. Therefore, IMCs willhave intrinsic limits on their compute SNR. Since the compute SNRtrades-off with energy and delay, it raises the following question:What are the fundamental limits on the achievable computationalprecision of IMCs?Answering this question is made challenging due to the richdesign space occupied by IMCs encompassing a huge diversity ofavailable memory devices, bitcell circuit topologies, circuit and architectural design methods. Todayβs IMCs tend to employ ad-hocapproaches to assign input and ADC precisions or tend to overprovision its analog SNR in order to emulate the determinism ofdigital computations. An analytical understanding of the relationship between precision, compute SNR, energy, and delay in IMCs,is presently missing.This paper attempts to fill this gap by: 1) defining compute SNRmetrics for IMCs, 2) developing a systematic methodology to obtainKEYWORDSin-memory computing, taxonomy of in-memory, in-memory noise,machine learning, accelerator, in-memory precision, in-memoryaccuracy, compute in-memoryACM Reference Format:Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag.2020. Fundamental Limits on the Precision of In-memory Architectures:(Invited Talk). In IEEE/ACM International Conference on Computer-AidedDesign (ICCAD β20), November 2β5, 2020, Virtual Event, USA. ACM, NewYork, NY, USA, 9 pages. https://doi.org/10.1145/3400302.3416344Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.ICCAD β20, November 2β5, 2020, Virtual Event, USA 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8026-3/20/11. . . UCTION
ICCAD β20, November 2β5, 2020, Virtual Event, USASujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhaganalog coreπ±!bitcell arrayDACa minimum precision assignment for activations, weights and outputs of fixed-point DPs realized on IMCs to meet network accuracyrequirements, and 3) employing this methodology to obtain thelimits on achievable compute SNR of a commonly employed IMCtopology, and quantify it energy vs. accuracy trade-off.π"#π°!analog processingπ°π±π π, π π% ππ π#π¦&π¦ADC2 NOTATION AND PRELIMINARIES2.1 General Notationπ¦(a)We employ the term signal-to-quantization noise ratio (SQNR)when only quantization noise (denoted as π) is involved. The termSNR is employed when analog noise sources are included and useπ to denote such sources. SNR is also employed when both quantization and analog noise sources are present.2.2SQNRπ₯ (dB) 10 log10 (SQNRπ₯ ) 6π΅π₯ 4.78 ππ₯ 2, ππ2π₯ 12 , and ππ₯ (dB) 10 log10 ( ) is thepeak-to-average (power) ratio (PAR) of π₯. Equation (1) quantifiesthe familiar 6 dB SQNR gain per bit of precision.where SQNRπ₯ 2.3Figure 1: System noise model of IMC: (a) a generic IMC blockdiagram, and (b) dominant noise sources in fixed-point DPcomputation on IMCs.by:The Additive Quantization Noise ModelUnder the additive quantization noise model, a floating-point (FL)signal π₯ quantized to π΅π₯ bits is represented as π₯π π₯ ππ₯ , where ππ₯is the quantization noise assumed to be independent of the signalπ₯.If π₯ [ π₯ m, π₯ m ] and ππ₯ π [ 0.5Ξπ₯ , 0.5Ξπ₯ ] where Ξπ₯ π₯ m 2 (π΅π₯ 1) is the quantization step size and π [π, π] denotes theuniform distribution over the interval [π, π], then the signal-toquantization noise ratio (SQNRπ₯ ) is given by:The Dot-Product (DP) ComputationTπ¦o w x πΓπ€ππ₯π2π π¦2 o π ππ€E[π₯ 2 ]; ππ2π¦ Ξ2π¦12; ππ2ππ¦ π 22Ξπ€ E[π₯ 2 ] Ξπ₯2 ππ€(5)122ππ€whereis the variance of the weights, Ξπ€ π€ m 2 π΅ π€ 1 , Ξπ₯ π΅π₯π₯m2and Ξπ¦ π¦m 2 π΅ π¦ 1 are the weight, activation, and outputquantization step-sizes, respectively.3COMPUTE SNR LIMITS OF IMCSWe propose the system noise model in Fig. 1 for obtaining precisionlimits on IMC architectures. Such architectures (Fig. 1(a)) accept aquantized input (xπ ) and a quantized weight vector (wπ ) to implement multiple FX DP computations of (4) in parallel in its analogcore. Hence, unlike digital architectures, IMC architectures sufferfrom both quantization and analog noise sources such as SRAMcell current variations, thermal noise, and charge injection, as wellas the limited headroom, which limits its compute SNR.3.1Consider the FL dot product (DP) computation defined as:(b)Compute SNR Metrics for IMCsThe following equations describe IMC noise model in Fig. 1:(2)π 1where π¦o is the DP of two π -dimensional real-valued vectors w [π€ 1, . . . , π€ π ] T (weight vector) and x [π₯ 1, . . . , π₯ π ] T (activationvector) of precision π΅ π€ and π΅π₯ , respectively.In DNNs, the dot product in (2) is computed with π€ [ π€ m, π€ m ](signed weights), input π₯ [0, π₯ m ] (unsigned activations) andoutput π¦ [ π¦m, π¦m ] (signed outputs). Assuming the additivequantization noise model from Section 2.2, the fixed-point (FX)computation of the DP (2) is described by:π¦ π¦o πππ¦ π a π π¦ ;πa πe πh(6)where π¦o is the ideal DP value defined in (2), πππ¦ is the input quantization noise reflected at the output πππ¦ , π a is the analog noise termcomprising both clipping noise π h due to limited headroom, andπ e being all other noise sources, and π π¦ is the quantization noiseintroduced by the ADC.We define the following fundamental compute SNR metrics:π π¦2 oπ π¦2 oπ π¦2 oSQNRπππ¦ 2 ; SNRa 2 ; SQNRπ π¦ 2ππππ¦ππaππ π¦(7)(4)where SNRa is the analog SNR, SQNRπππ¦ is the propagated SQNR atthe output due to input (weight and activation) quantization noiseand is given by:where wπ w qπ€ and xπ x qπ₯ are the quantized weight andactivation vectors, respectively, πππ¦ is the total input (weight andactivation) quantization noise seen at the output π¦, and π π¦ is theadditional output quantization noise due to round-off/truncationin digital architectures or from the finite resolution of the columnADCs in IMC architectures.Assuming that the weights (signed) and inputs (unsigned) arei.i.d. random variables (RVs), the variances of signals in (4) are givenSQNRπππ¦ (dB) 6(π΅π₯ π΅ π€ ) 4.8 [ππ₯ (dB) π π€ (dB) ] 2π΅π₯ 222π΅ π€ 10 log10 (8)ππ₯ππ€ 2 2 π₯π€where ππ₯ (dB) 10 log10 4E[π₯m 2 ] and π π€ (dB) 10 log10 π 2m areπ€the PARs of the (unsigned) activations and (signed) weights, respectively, and SQNRπ π¦ is the digitization SQNR solely due to ADCπ¦ wπT xπ π π¦ (w qπ€ ) T (x qπ₯ ) π π¦ wT x wT qπ₯ qTπ€ x π π¦ π¦o πππ¦ π π¦(3)
Fundamental Limits on the Precision of In-memory ArchitecturesICCAD β20, November 2β5, 2020, Virtual Event, USAFor example, if SQNRπππ¦ (dB), SQNRπ π¦ (dB) SNRa(dB) 9 dB thenSNRa(dB) SNRT(dB) 0.5 dB, i.e., SNRT(dB) lies within 0.5 dB ofSNRa(dB) . In this manner, by appropriate choices for π΅π₯ , π΅ π€ , andπ΅ π¦ , IMCs can be designed such that SNRT SNRa , which is thefundamental limit on SNRT .From the above discussion it is clear that the input precisions π΅π₯and π΅ π€ are dictated by network accuracy requirements, while theoutput precision π΅ π¦ needs to be set sufficiently high to avoid becoming a significant noise contributor. To ensure that a sufficiently highvalue for π΅ π¦ , digital architectures employ the bit growth criterion(BGC) described next.504540SNRT(dB)3530252015105012345678910 11 12 13 14 15 16Layer IndexFigure 2: Per-layer SNRT(dB) requirements of DP computations in VGG-16 deployed on ImageNet.3.3Bit Growth Criterion (BGC)The BGC is commonly employed to assign the output precision π΅ π¦in digital architectures [9, 25]. BGC sets π΅ π¦ as:quantization noise and is given by:SQNRπ π¦ (dB) 6π΅ π¦ 4.8 [ππ₯(dB) π π€(dB) ] 10 log10 (π )(9)which is obtained by the substitutions: π΅π₯ π΅ π¦ and ππ₯(dB) π π¦(dB) ππ₯(dB) π π€(dB) 10 log10 (π ) in (1).From (6) and (7), it is straightforward to show:# 1"π π¦2 o11(10)SNRA 2 SNRa SQNRπππ¦ππππ¦ ππ2a"# 1π π¦2 o11SNRT 2 (11)SNRA SQNRπ π¦ππππ¦ ππ2a ππ2π¦where SNRA is the pre-ADC SNR and SNRT is the total output SNRincluding all noise sources. Note: (10)-(11) can be repurposed fordigital architectures by setting SNRa since quantization isthe only noise source implying SNRA SQNRπππ¦ . Equations (8)-(9)indicate that SQNRπππ¦ and SQNRπ π¦ can be made arbitrarily largeby assigning sufficiently high precision to the DP inputs (π΅π₯ andπ΅ π€ ) and the output (π΅ π¦ ). Thus, from (10)-(11), SNRT in IMCs isfundamentally limited by SNRa which depends on the analog noisesources as one expects.3.2π΅ BGC π΅π₯ π΅ π€ log2 (π )π¦Precision Assignment Methodology forIMCsPrior work [25, 26], indicates the requirement SNRT(dB) SNR T(dB) 10 dB-40 dB (see Fig. 2) for the inference accuracy of an FX networkto be within 1% of the corresponding FL network for popular DNNs(AlexNet, VGG-9, VGG-16, ResNet-18) deployed on the ImageNetand CIFAR-10 datasets. To meet this SNRT(dB) requirement, digitalarchitectures choose π΅π₯ and π΅ π€ such that SQNRπππ¦ SNR T , andthen choose π΅ π¦ sufficiently high to guarantee SQNRπ π¦ SQNRπππ¦so that SNRT SQNRπππ¦ .In contrast, for IMCs, we first need to ensure that SNRa SNR Tso that SNRT can be made to approach SNRa with appropriateprecision assignment via the following methodology:(1) Assign sufficiently high values for π΅π₯ and π΅ π€ per (8) suchthat SQNRπππ¦ SNRa so that SNRA SNRa per (10).(2) Assign sufficiently a high value for π΅ π¦ per (9) such thatSQNRπ π¦ SNRA so that SNRT SNRA per (11).(12)Substituting π΅ π¦ π΅ BGCfrom (12) into (9) and employing the relaπ¦tionship π π¦(dB) 10 log10 (π ) ππ₯(dB) π π€(dB) , the resulting SQNRdue to output quantization using the BGC is given by:SQNRπBGCπ¦ (dB) 10 log10π π¦2 o!ππ2π¦ 6(π΅π₯ π΅ π€ ) 4.8 [ππ₯ (dB) π π€ (dB) ] 10 log10 (π ).(13)Recall that SQNRπBGC SNRA in order to ensure SNRT is close toπ¦its upper bound. Comparing (9) and (13), we see that, for high valuesof DP dimensionality π , BGC is overly conservative since it assignslarge values to π΅ π¦ per (12). Some digital architectures truncatethe LSBs to control bit growth. The SQNR of such truncated BGC(tBGC) can be obtained from (9) by setting the value of π΅ π¦ π΅ BGCπ¦ .BGCβs high precision requirements is accommodated by digitalarchitectures by increasing the precision of arithmetic units at acommensurate increase in the computational energy, latency, andactivation storage costs. However, IMCs cannot afford to use thiscriterion since π΅ π¦ is the precision of the BL ADCs which impacts itsenergy, latency, and area. Indeed, recent works [24] have claimedthat BL ADCs dominate the energy and latency costs of IMCsassuming BGC to assign π΅ π¦ .In the next section, we propose an alternative to BGC referredto the minimum precision criterion (MPC), that can be employedby both digital and IMC architectures which achieves a desiredSQNRπ π¦ with much fewer bits than BGC.3.4The Minimum Precision Criterion (MPC)We propose MPC to reduce π΅ π¦ without incurring any loss in SQNRπ π¦compared to BGC. Unlike BGC, MPC accounts for the statistics ofπ¦o to permit controlled amounts of clipping to occur. In MPC (seeFig. 3(a)), the output π¦o is clipped to lie in the range [ π¦c, π¦c ] instead of [ π¦m, π¦m ] as in BGC (see Fig. 3(b)), where π¦c π¦m (π¦c :clipping level), and the π΅ π¦ bits are employed to quantize this reduced range. The clipping probability π c Pr{ π¦π π¦c } is kept toa small user-defined value, e.g., π¦c 4π π¦o ensures that π c 0.001
ICCAD β20, November 2β5, 2020, Virtual Event, USASujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhagclippedclipped(a)(a)(b)(c)Figure 3: Comparison of BGC and MPC: (a) MPC quantization levels, (b) BGC quantization levels, and (c) distributionππ (π¦o ) of the ideal DP output π¦o vs. DP dimensionality π .Figure 4: Trends in SQNRπ π¦ (dB) for DP computation withπ΅π₯ π΅ π€ 7: (a) SQNRπ π¦ (dB) vs. π for MPC (π π¦ 4), BGC,if π¦o N (0, π π¦2 o ). The resulting SQNRπ¦ is given by:SQNRπMPCπ¦ (dB)MPC 6π΅ π¦ 4.8 π π¦(dB)π2 10 log10 1 π c ππππ2π¦(b)!(14) π¦22 E (π¦ π¦ ) 2 π¦ π¦ iswhere π π¦MPC 10 log10 π 2c , and πππococ(dB)π¦othe conditional clipping noise variance. Setting π¦c π π¦MPC π π¦π yieldsMPC 10 log (π MPC ) 2 indicating that π is a decreasing functionπ π¦(dB)c10 π¦of π π¦MPC . Thus, (14) has the same form as (1) with an additional (lastterm) clipping noise factor.MPC exploits a key insight (see Fig. 3(c)), which follows from theCentral Limit Theorem (CLT) β in a π -dimensional DP computation(2), π π¦o grows sub-linearly (as π ) as compared to the maximum π¦mwhich grows linearly with π . Furthermore, (14) shows a quantizationvs. clipping noise trade-off controlled by the clipping level π¦c . Thistrade-off, illustrated in Fig. 3(c), is absent in BGC and tBGC, and iscritical to MPCβs ability to realize desired values of SQNRπ π¦ withsmaller values of π΅ π¦ .Assuming π¦o N (0, π π¦2 o ), and substituting π¦c 4π π¦o , andπ c 0.001 into (14), we obtain the following lower bound: πΎ i1hπ΅ MPC SNRA(dB) 7.2 πΎ 10 log10 1 10 10(15)π¦6in order for SNRA(dB) SNRT(dB) πΎ. For instance, the choice πΎ 0.5 dB yields π΅ MPC 16 SNRA(dB) 16.3 which correspondsπ¦to SQNRMPC SNRA(dB) 9 dB as discussed in Section 3.2.π¦(dB)tBGC, and (b) SQNRπMPCvs. π π¦MPC when π΅ π¦ 8.π¦ (dB)3.5Simulation ResultsTo illustrate the difference between MPC, BGC and tBGC, we assume that SNRa(dB) 31 dB, so that SNRT(dB) 30 dB providedSQNRπππ¦ (dB), SQNRπ π¦ (dB) 40 dB per (10)-(11). We further assumeDPs of varying dimension π with 7-b quantized unsigned inputsand signed weights randomly sampled from uniform distributions.Substituting π΅π₯ π΅ π€ 7, ππ₯(dB) 1.3 dB, and π π€(dB) 4.8 dBinto (8), we obtain SQNRπππ¦ (dB) 41 dB. Thus, all that remains is toassign π΅ π¦ such that SQNRπ π¦ (dB) 40 dB, for which there are threechoices - MPC, BGC and tBGC.Figure 4(a) compares the SQNRπ π¦ achieved by the three methods.Per (15), MPC meets the SQNRπ π¦ (dB) 40 dB requirement by set-ting π΅ π¦ 8 and π π¦MPC 4 independent of π . In contrast, per (12),BGC assigns 16 π΅ π¦ 20 as a function of π to achieve the sameSNRT as MPC. Furthermore, tBGC meets the SQNRπ π¦ requirementwith 11 π΅ π¦ 13 but fails to do so with π΅ π¦ 8. Figure 4(b)shows that SQNRπMPCis maximized when π π¦MPC 4, i.e., when(dB)π¦clipping level π¦c 4π π¦o thereby illustrating MPCβs quantization vs.clipping noise trade-off described by (14). Figure 4 also validatesthe analytical expressions (8), (9), (13), and (14) (bold) by indicating
Fundamental Limits on the Precision of In-memory ArchitecturesICCAD β20, November 2β5, 2020, Virtual Event, USATable 1: A Taxonomy of IMCs using In-memory ComputeModelsBeyond CMOSCMOSIn-memoryCompute ModelQS ISQRKang et al. [15]Biswas et al. [1]Zhang et al. [40]Valavi et al. [33]Khwa et al. [16]Jiang et al. [12]Si et al. [30]Jia et al. [11]Okumura et al. [23]Kim et al. [17]Guo et al. [8]Yue et al. [38]Su et al. [32]Dong et al. [4]Si et al. [31]Chen et al. [2]Fick et al. [5]Xue et al.[35]Yan et al.[37]Zha et al.[39]Xue et al.[36]Analog 81111151T115142TA3114ADCPrecisionπ΅ ADC871113.465881355453A4116T: Ternary; A: Analog/Continuous-valueda close match to ensemble-averaged values of SQNRπ π¦ obtainedfrom Monte Carlo simulations (dotted).Note: the theoretically optimal quantizer given an arbitrary signal distribution is obtained from the Lloyd-Max (LM) algorithm[18]. Unfortunately, the LM quantization levels are non-uniformlyspaced which makes it hard to design efficient arithmetic units toprocess such signals. MPC offers a practical alternative to LM.4ANALYTICAL MODELS FOR COMPUTE SNRThis section derives analytical expressions for SNRa of a typicalIMC. First, we show that most IMCs can be βexplainedβ via a fewin-memory compute models.4.1π"" π π&In-memory Compute ModelsAll IMCs are viewed as employing one or more in-memory computemodels defined as a mapping of algorithmic variables π¦o , π₯ π and π€ πin (2) to physical quantities such as time, charge, current, or voltage,in order to (usually partially) realize an analog BL computation ofthe multi-bit DP in (2).Furthermore, we suggest that most IMCs today employ one ormore of the following three in-memory compute models (see Fig. 5):(a) charge summing (QS) [7, 14, 15, 40]; (b) current summing (IS)[12, 16, 17, 30]; and (c) charge redistribution (QR) [1, 7, 15, 33], andconjecture that these compute models are in some sense universal inthat they represent an approximation to a βcomplete setβ of practical,i.e., realizable, mappings of variables from the algorithmic to thecircuit domain as shown in Table 1.Henceforth, we discuss the QS model and the corresponding QSbased IMC referred to as QS-Arch in detail since it is very commonlyused. Analytical expressions for circuit domain equivalents of π eπ&πΌ&π&π'π)π'π'πΊ π"πΊ'πΊ)π'πΌ'π(π )(b)π&π"π#π"πΆ"π π#πΆ#π πΆ (c)Figure 5: In-memory compute models: (a) charge summing(QS), (b) current summing (IS), and (c) charge redistribution(QR) models.and π h in (6) for the QS model are presented. These will be combinedwith algorithm and precision-dependent noise sources πππ¦ and π π¦to obtain SNRT .4.2The Charge Summing (QS) ModelThe QS model (see Fig. 5(a)) realizes the DP in (2) via the variablemapping (π¦o πo , π€ π πΌ π , π₯ π π π ) where the cell current πΌ π isintegrated over the WL pulse duration π π (π 1, . . . , π ) on a BL (orcell) capacitor πΆ resulting an output voltage as shown below:(π¦o πo ) π1Γ(π€ π πΌ π )(π₯ π π π )πΆ π 1(16)where πo is the DP output assuming infinite voltage head-room,i.e., no clipping. The cell current πΌ π depends upon transistor sizesand the WL voltage πWL , and typical values are: πΆ (a few hundredfFs), πΌ π (tens of πAs), and π π (hundreds of ps).Noise Models: The noise contributions in QS arise from the following sources: (1) variations in the pulse-widths π π of currentswitch pulses π π (Fig. 5(a)); (2) their finite rise and fall times (seeFig. 6(b)); (3) spatial variations in the currents πΌ π ; (4) thermal noisein the discharge RC-network; and (5) clipping due to limited voltage head-room. Thus, the analog DP output πa corresponding toπ¦a π¦o π a is given by:(π¦π πa ) (π¦o πo ) (π e π£ e ) (π h π£ c ),π1Γπ£ e π£π π π π π πΌ π (π‘ π π‘ rf ),πΆ π 1 π£ c min πo, πo,max πo,(17)
ICCAD β20, November 2β5, 2020, Virtual Event, USASujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhagπ±" !π %π# πΌ" π°!π±" "!π!π€" ","π₯ π€"!,#π₯%πΆ π°%!π₯!π"π%π±" ππ€" ", π°% "&BLBπ₯&(b)ADCFigure 6: Modeling the discharge process in the QS computemodel: (a) cell current πΌ π , and (b) the word-line voltage pulseπWL .ValueParameterValueπ β² (πA/V2 )ππ 0 (ps)ΞπBL,max (V)πt (V)π (K)2202.30.8-to-0.90.4270πΌππt (mV)πWL (V)π0 (ps)π (JK 1 )1.823.80.4-to-0.81001.38e-23π πΌππt πΌ π πDπWL πt π π π πtrfπ‘ rf ππ WLπWLπΌ 1rqπππππ β π ππ 0, ππ πΆWLADCADCπ¦! ADCPOTSπ¦'Figure 7: The charge summing IMC (QS-Arch).πΈ QS E [πa ] πddπΆ πΈ sucurrent mismatch, and π‘ π N (0, ππ2 ) is the noise due to (temporal)πpulse-width mismatch, respectively, both of which are modeledas zero mean Gaussian random variables, π‘ rf models the impactof finite rise and fall times of the current switching pulses, andπ£π N (0, ππ ) is the thermal noise. Note: πo,max can be as high as0.9 V when πdd 1 V.Analytical expressions to estimate the noise standard deviationsππΌ π , πππ , ππ , and π‘ rf , (see appendix) are provided below: π°% "&Energy and Delay Models: The average energy consumption inthe QS model is given by:where πo,max is the maximum allowable output voltage, and π£ e andπ£ c are the voltage domain noise due to circuit non-idealities andclipping, respectively, π π N (0, ππΌ2 ) is the noise due to (spatial)ππΌ π πΌ π POTSTable 2: QS Model Parameters in a 65 nm CMOS ProcessParameterπ€" ",% Ξπ'(time(a) π'' π) BL π°'π°%!(18)(19)(20)2 is normalized current mismatch variance, π β π iswhere πDππ 0the delay of a β π -stage WL driver composed unit elements withdelay π0 each, ππ 0 is the standard deviation of π0 , πr and πf are WLpulse rise and fall times (see Fig. 6(b)), πΌ is a fitting parameter in theπΌ-law transistor equation, ππt is standard deviation of πt variations,π is the Boltzmann constant, and π is the absolute temperature.Note that typically the WL voltage πWL is identical for all rowsin the memory array with a few exceptions such as [40] whichmodulate πWL to tune the cell current πΌ π . The effects of rise/falltimes and delay variations can be mitigated by carefully designingthe WL pulse generators. Therefore, noise in QS is dominated byspatial threshold voltage variations. Indeed, using the typical valuesfrom Table 2, we find that ππΌ π /πΌ π ranges from 8% to 25%, while πππ /π πranges from 0.5% to 3%.(21)where the spatio-temporal expectation E [πa ] is taken over inputs(temporal) and over columns (spatial) πΈ su is the energy cost oftoggling switches π π s. Equation (21) shows that the energy consumption in the QS model increases with πΆ array size, the supplyvoltage πdd , and the mean value of the DP E [πa ].The delay of the QS model is given by πQS πmax πsu, where πsuis the time required to precharge the capacitors and setup currents,and πmax max{π π } is the longest allowable pulse-width.Table 2 tabulates parameters of the QS model in a representative65 nm CMOS process.4.3QS-ArchThe charge summing architecture (QS-Arch) in Fig. 7(b) employs a6T [8] or 8T [30] SRAM bitcell within the QS model (see Section 4.2).This architecture implements fully-binarized DPs on the BLs bymapping the input bit π₯Λπ,π to the WL access pulse πWL,π while theweights π€Λ π,π are stored across π΅ π€ columns of the BCA so that the BCcurrents πΌπ,π π€Λ π,π . The output πo ΞπBL is the voltage dischargeon the BL and the capacitance πΆ πΆ BL is the BL capacitance in (16).QS-Arch sequentially (bit-serially) processes one multi-bit inputvector x in π΅π₯ in-memory compute cycles followed by a digitalsumming of the binarized DPs to obtain the final multi-bit DP (2).Table 3 summarizes the noise and energy models for QS-Arch.We derive the analytical expressions of architecture-level noisemodels for QS-Arch using those of the QS model described in Section 4.2. In QS-Arch, clipping occurs in each of the π΅π₯ π΅ π€ binarizedDPs and contributes to the overall clipping noise variance ππ2h atthe multi-bit DP output. Circuit noise from each binarized DP isaggregated to obtain the final circuit noise variance ππ2 . In addition, employing MPC imposed requirement on the final DP outputprecision π΅ π¦ (15), we obtain the lower bound on ADC precisionπ΅ ADC .Since the multi-bit DP computation in (2) is high-dimensional(π can be in hundreds), it is clear that the limited BL dynamicrange e.g., πo,max in (17), will begin to dominate SNRa in (7). It isfor this reason that most, if not all, IMCs resort to some form ofbinarization of the multi-bit DP in (2) prior to employing one of thein-memory compute models (see Table 1). Ultimately, SNRa limits
Fundamental Limits on the Precision of In-memory ArchitecturesICCAD β20, November 2β5, 2020, Virtual Event, USATable 3: Model Parameters for QS-ArchBitcell type6T or 8TAnalog Core PrecisionEnergy cost per DPπΈ QS-Arch π΅ π€ π΅π₯ (πΈ QS πΈ ADC ) πΈ miscCompute modelmappingππ2ππ¦12 212 π Ξπ₯ π π€ππ2eπ πD2 1 π Ξ2 E π₯ 2 12π€ππ2hπ΅π₯ 1, π΅ π€ 1πΆ πΆ BLπo ΞπBLπ π πWL,π 4 1 4 π΅ π€1 4 π΅π₯9 π 3 π πΓπ(π π h ) 2 ππ 144π πh(1 4 π΅ π€)(91 4 π΅π₯)π΅ ADC min SNRA(dB) 16.2, log2 (π h ), log2 (π )6ππΌBL.maxπ h ΞπΞπBL,unit ; πD πΌ is the normalized standard deviation of the bit-cell current (18); (π₯) max(π₯, 0).component levelMonte Carlomeasurements fromDIMA prototypeSPICEsimulationsarchitecturemodelnoise modelsSNR expressionsSNR "non-linearbehavioral modelsample-accurate SNR !Python simulation SNRSNR !comparison"Figure 8: SNR validation methodology.the number and accuracy of BL computations per read cycle andhence the overall energy efficiency of IMCs.5SIMULATION RESULTSThis section describes the noise model validation methodology forvalidating the noise expressions in Table 3 and simulation resultsfor QS-Arch.5.1Noise Model Validation MethodologyFigure 8, we obtain the QS model parameters (Section 4) usingMonte Carlo circuit simulations in a representative 65 nm CMOSprocess, with experimental validation of some of these, e.g., ππe ,from our IMC prototype ICs [6, 15] when possible.Incorporating non-linear circuit behavior along with noise models, sample-accurate Monte Carlo Python simulations are employedto numerically calculate SNR values using ensemble averaged (over1000 instances) statistics. We compare the SNR values obtainedthrough sample-accurate simulations with those obtained by evaluating the analytical expressions in Table 3.The quantitative results in subsequent sections employ the QSmodel parameter values in Table 2 along with QS-Arch energyand noise models from Table 3. An SRAM BCA with 512 rows andπΆ BL 270 fF is assumed throughout. Energy and accuracy of QSArch is traded-off by tuning πWL . We assume zero mean signedweights π€ π and unsigned inputs π₯ π drawn independently from twodifferent distributions. We set π΅π₯ π΅ π€ 6 everywhere, unlessotherwise stated, so that SQNRπππ¦ (dB) 38.9 dB SNRa(dB) andtherefore SNRA SNRa from (10). Next, we show how SNRA andSNRT trade-off with π and π΅ ADC .5.2SNR Trade-offs in QS-ArchFigure 9(a) shows that the maximum achievable SNRA increaseswith πWL . Further, for a fixed πWL , QS-Arch also exhibits a sharpdrop in SNRA at high values of π π max , e.g., SNRA 19.6 dB forπ 125 and then drops with increase in π . A key reason for thistrade-off is that ππ2h decreases while ππ2e increases as πWL is reduced(see Table 3), and since ππ2h limits π and ππ2e limits SNRa . Thus, bycontrolling πWL , we can trade-off π max with SNRA . Specifically,π max increases by 2 for every 3 dB drop in SNRA .In QS-Arch, the minimum value of π΅ ADC (see Table 3) dependsupon the minimum of: 1) the MPC term (15); 2) the headroomclipping term; and 3) the small π case where BL discharge ΞπBL hasa finite number of discrete levels. Figure 9(b) shows that SNRT SNRA of Fig. 9(a) when π΅ ADC is greater than the lower bound(circled) in Table 3 for different values of πWL and π .5.3Impact of ADC PrecisionMinimizing the column ADC energy is c
3 COMPUTE SNR LIMITS OF IMCS We propose the system noise model in Fig. 1 for obtaining precision limits on IMC architectures. Such architectures (Fig. 1(a)) accept a quantized input (x ) and a quantized weight vector (w ) to imple-ment multiple FX DP computations of (4) in parallel in its analog core.