Fundamental Limits On The Precision Of In-memory Architectures

Transcription

Fundamental Limits on the Precision of In-memoryArchitectures(Invited Talk)Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. .eduDepartment of Electrical and Computer EngineeringUniversity of Illinois at Urbana-ChampaignABSTRACT1This paper obtains the fundamental limits on the computationalprecision of in-memory computing architectures (IMCs). Variouscompute SNR metrics for IMCs are defined and their interrelationships analyzed to show that the accuracy of IMCs is fundamentallylimited by the compute SNR (SNRa ) of its analog core, and thatactivation, weight and output precision needs to be assigned appropriately for the final output SNR SNRT SNRa . The minimumprecision criterion (MPC) is proposed to minimize the output andhence the column analog-to-digital converter (ADC) precision. Thecharge summing (QS) compute model and its associated IMC QSArch are studied to obtain analytical models for its compute SNR,minimum ADC precision, energy and latency. Compute SNR models of QS-Arch are validated via Monte Carlo simulations in a 65 nmCMOS process. Employing these models, upper bounds on SNRaof a QS-Arch-based IMC employing a 512 row SRAM array areobtained and it is shown that QS-Arch’s energy cost reduces by3.3 for every 6 dB drop in SNRa , and that the maximum achievableSNRa reduces with technology scaling while the energy cost at thesame SNRa increases. These models also indicate the existence ofan upper bound on the dot product dimension 𝑁 due to voltageheadroom clipping, and this bound can be doubled for every 3 dBdrop in SNRa .In-memory computing (IMC) [13, 19, 28, 34] has emerged as anattractive alternative to conventional von Neumann (digital) architectures for addressing the energy and latency cost of memoryaccesses in data-centric machine learning workloads. IMCs embedanalog mixed-signal computations in close proximity to the bit-cellarray (BCA) in order to execute machine learning computationssuch as matrix-vector multiply (MVM) and dot products (DPs) asan intrinsic part of the read cycle and thereby avoid the need toaccess raw data.IMCs exhibit a fundamental trade-off between its energy-delayproduct (EDP) and the accuracy or signal-to-noise ratio (SNR) of itsanalog computations. This trade-off arises due to constraints onthe maximum bit-line (BL) voltage discharge and due to processvariations, specifically spatial variations in the threshold voltage𝑉t , which limit the dynamic range and the SNR. Additionally, IMCsalso exhibit noise due to the quantization of its input activationand weight parameters and due to the column analog-to-digitalconverters (ADCs). Henceforth, we use "compute SNR" to refer tothe computational precision/accuracy of an IMC, and "precision"to the number of bits assigned to various signals.Today, a large number of IMC prototype ICs have been demonstrated [1, 3, 4, 7, 12, 15–17, 31–33, 36, 38, 40]. While these IMCshave shown impressive reductions in the EDP over a von Neumannequivalent with minimal loss in inference accuracy, it is not clearthat these gains are sustainable for larger problem sizes across datasets and inference tasks. Unlike digital architectures whose compute SNR can be made arbitrarily high by assigning sufficiently highprecision to various signals, IMCs need to contend with both quantization noise as well as analog non-idealities. Therefore, IMCs willhave intrinsic limits on their compute SNR. Since the compute SNRtrades-off with energy and delay, it raises the following question:What are the fundamental limits on the achievable computationalprecision of IMCs?Answering this question is made challenging due to the richdesign space occupied by IMCs encompassing a huge diversity ofavailable memory devices, bitcell circuit topologies, circuit and architectural design methods. Today’s IMCs tend to employ ad-hocapproaches to assign input and ADC precisions or tend to overprovision its analog SNR in order to emulate the determinism ofdigital computations. An analytical understanding of the relationship between precision, compute SNR, energy, and delay in IMCs,is presently missing.This paper attempts to fill this gap by: 1) defining compute SNRmetrics for IMCs, 2) developing a systematic methodology to obtainKEYWORDSin-memory computing, taxonomy of in-memory, in-memory noise,machine learning, accelerator, in-memory precision, in-memoryaccuracy, compute in-memoryACM Reference Format:Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag.2020. Fundamental Limits on the Precision of In-memory Architectures:(Invited Talk). In IEEE/ACM International Conference on Computer-AidedDesign (ICCAD ’20), November 2–5, 2020, Virtual Event, USA. ACM, NewYork, NY, USA, 9 pages. https://doi.org/10.1145/3400302.3416344Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.ICCAD ’20, November 2–5, 2020, Virtual Event, USA 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8026-3/20/11. . . UCTION

ICCAD ’20, November 2–5, 2020, Virtual Event, USASujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhaganalog core𝐱!bitcell arrayDACa minimum precision assignment for activations, weights and outputs of fixed-point DPs realized on IMCs to meet network accuracyrequirements, and 3) employing this methodology to obtain thelimits on achievable compute SNR of a commonly employed IMCtopology, and quantify it energy vs. accuracy trade-off.π‘ž"#𝐰!analog processing𝐰𝐱𝒇 𝐚, 𝐛 𝐚% π›πœ‚ π‘ž#𝑦&𝑦ADC2 NOTATION AND PRELIMINARIES2.1 General Notation𝑦(a)We employ the term signal-to-quantization noise ratio (SQNR)when only quantization noise (denoted as π‘ž) is involved. The termSNR is employed when analog noise sources are included and useπœ‚ to denote such sources. SNR is also employed when both quantization and analog noise sources are present.2.2SQNRπ‘₯ (dB) 10 log10 (SQNRπ‘₯ ) 6𝐡π‘₯ 4.78 𝜁π‘₯ 2, πœŽπ‘ž2π‘₯ 12 , and 𝜁π‘₯ (dB) 10 log10 ( ) is thepeak-to-average (power) ratio (PAR) of π‘₯. Equation (1) quantifiesthe familiar 6 dB SQNR gain per bit of precision.where SQNRπ‘₯ 2.3Figure 1: System noise model of IMC: (a) a generic IMC blockdiagram, and (b) dominant noise sources in fixed-point DPcomputation on IMCs.by:The Additive Quantization Noise ModelUnder the additive quantization noise model, a floating-point (FL)signal π‘₯ quantized to 𝐡π‘₯ bits is represented as π‘₯π‘ž π‘₯ π‘žπ‘₯ , where π‘žπ‘₯is the quantization noise assumed to be independent of the signalπ‘₯.If π‘₯ [ π‘₯ m, π‘₯ m ] and π‘žπ‘₯ π‘ˆ [ 0.5Ξ”π‘₯ , 0.5Ξ”π‘₯ ] where Ξ”π‘₯ π‘₯ m 2 (𝐡π‘₯ 1) is the quantization step size and π‘ˆ [π‘Ž, 𝑏] denotes theuniform distribution over the interval [π‘Ž, 𝑏], then the signal-toquantization noise ratio (SQNRπ‘₯ ) is given by:The Dot-Product (DP) ComputationT𝑦o w x 𝑁Õ𝑀𝑗π‘₯𝑗2𝜎 𝑦2 o 𝑁 πœŽπ‘€E[π‘₯ 2 ]; πœŽπ‘ž2𝑦 Ξ”2𝑦12; πœŽπ‘ž2𝑖𝑦 𝑁 22Δ𝑀 E[π‘₯ 2 ] Ξ”π‘₯2 πœŽπ‘€(5)122πœŽπ‘€whereis the variance of the weights, Δ𝑀 𝑀 m 2 𝐡 𝑀 1 , Ξ”π‘₯ 𝐡π‘₯π‘₯m2and Δ𝑦 𝑦m 2 𝐡 𝑦 1 are the weight, activation, and outputquantization step-sizes, respectively.3COMPUTE SNR LIMITS OF IMCSWe propose the system noise model in Fig. 1 for obtaining precisionlimits on IMC architectures. Such architectures (Fig. 1(a)) accept aquantized input (xπ‘ž ) and a quantized weight vector (wπ‘ž ) to implement multiple FX DP computations of (4) in parallel in its analogcore. Hence, unlike digital architectures, IMC architectures sufferfrom both quantization and analog noise sources such as SRAMcell current variations, thermal noise, and charge injection, as wellas the limited headroom, which limits its compute SNR.3.1Consider the FL dot product (DP) computation defined as:(b)Compute SNR Metrics for IMCsThe following equations describe IMC noise model in Fig. 1:(2)𝑗 1where 𝑦o is the DP of two 𝑁 -dimensional real-valued vectors w [𝑀 1, . . . , 𝑀 𝑁 ] T (weight vector) and x [π‘₯ 1, . . . , π‘₯ 𝑁 ] T (activationvector) of precision 𝐡 𝑀 and 𝐡π‘₯ , respectively.In DNNs, the dot product in (2) is computed with 𝑀 [ 𝑀 m, 𝑀 m ](signed weights), input π‘₯ [0, π‘₯ m ] (unsigned activations) andoutput 𝑦 [ 𝑦m, 𝑦m ] (signed outputs). Assuming the additivequantization noise model from Section 2.2, the fixed-point (FX)computation of the DP (2) is described by:𝑦 𝑦o π‘žπ‘–π‘¦ πœ‚ a π‘ž 𝑦 ;πœ‚a πœ‚e πœ‚h(6)where 𝑦o is the ideal DP value defined in (2), π‘žπ‘–π‘¦ is the input quantization noise reflected at the output π‘žπ‘–π‘¦ , πœ‚ a is the analog noise termcomprising both clipping noise πœ‚ h due to limited headroom, andπœ‚ e being all other noise sources, and π‘ž 𝑦 is the quantization noiseintroduced by the ADC.We define the following fundamental compute SNR metrics:𝜎 𝑦2 o𝜎 𝑦2 o𝜎 𝑦2 oSQNRπ‘žπ‘–π‘¦ 2 ; SNRa 2 ; SQNRπ‘ž 𝑦 2πœŽπ‘žπ‘–π‘¦πœŽπœ‚aπœŽπ‘ž 𝑦(7)(4)where SNRa is the analog SNR, SQNRπ‘žπ‘–π‘¦ is the propagated SQNR atthe output due to input (weight and activation) quantization noiseand is given by:where wπ‘ž w q𝑀 and xπ‘ž x qπ‘₯ are the quantized weight andactivation vectors, respectively, π‘žπ‘–π‘¦ is the total input (weight andactivation) quantization noise seen at the output 𝑦, and π‘ž 𝑦 is theadditional output quantization noise due to round-off/truncationin digital architectures or from the finite resolution of the columnADCs in IMC architectures.Assuming that the weights (signed) and inputs (unsigned) arei.i.d. random variables (RVs), the variances of signals in (4) are givenSQNRπ‘žπ‘–π‘¦ (dB) 6(𝐡π‘₯ 𝐡 𝑀 ) 4.8 [𝜁π‘₯ (dB) 𝜁 𝑀 (dB) ] 2𝐡π‘₯ 222𝐡 𝑀 10 log10 (8)𝜁π‘₯πœπ‘€ 2 2 π‘₯𝑀where 𝜁π‘₯ (dB) 10 log10 4E[π‘₯m 2 ] and 𝜁 𝑀 (dB) 10 log10 𝜎 2m are𝑀the PARs of the (unsigned) activations and (signed) weights, respectively, and SQNRπ‘ž 𝑦 is the digitization SQNR solely due to ADC𝑦 wπ‘žT xπ‘ž π‘ž 𝑦 (w q𝑀 ) T (x qπ‘₯ ) π‘ž 𝑦 wT x wT qπ‘₯ qT𝑀 x π‘ž 𝑦 𝑦o π‘žπ‘–π‘¦ π‘ž 𝑦(3)

Fundamental Limits on the Precision of In-memory ArchitecturesICCAD ’20, November 2–5, 2020, Virtual Event, USAFor example, if SQNRπ‘žπ‘–π‘¦ (dB), SQNRπ‘ž 𝑦 (dB) SNRa(dB) 9 dB thenSNRa(dB) SNRT(dB) 0.5 dB, i.e., SNRT(dB) lies within 0.5 dB ofSNRa(dB) . In this manner, by appropriate choices for 𝐡π‘₯ , 𝐡 𝑀 , and𝐡 𝑦 , IMCs can be designed such that SNRT SNRa , which is thefundamental limit on SNRT .From the above discussion it is clear that the input precisions 𝐡π‘₯and 𝐡 𝑀 are dictated by network accuracy requirements, while theoutput precision 𝐡 𝑦 needs to be set sufficiently high to avoid becoming a significant noise contributor. To ensure that a sufficiently highvalue for 𝐡 𝑦 , digital architectures employ the bit growth criterion(BGC) described next.504540SNRT(dB)3530252015105012345678910 11 12 13 14 15 16Layer IndexFigure 2: Per-layer SNRT(dB) requirements of DP computations in VGG-16 deployed on ImageNet.3.3Bit Growth Criterion (BGC)The BGC is commonly employed to assign the output precision 𝐡 𝑦in digital architectures [9, 25]. BGC sets 𝐡 𝑦 as:quantization noise and is given by:SQNRπ‘ž 𝑦 (dB) 6𝐡 𝑦 4.8 [𝜁π‘₯(dB) 𝜁 𝑀(dB) ] 10 log10 (𝑁 )(9)which is obtained by the substitutions: 𝐡π‘₯ 𝐡 𝑦 and 𝜁π‘₯(dB) 𝜁 𝑦(dB) 𝜁π‘₯(dB) 𝜁 𝑀(dB) 10 log10 (𝑁 ) in (1).From (6) and (7), it is straightforward to show:# 1"𝜎 𝑦2 o11(10)SNRA 2 SNRa SQNRπ‘žπ‘–π‘¦πœŽπ‘žπ‘–π‘¦ πœŽπœ‚2a"# 1𝜎 𝑦2 o11SNRT 2 (11)SNRA SQNRπ‘ž π‘¦πœŽπ‘žπ‘–π‘¦ πœŽπœ‚2a πœŽπ‘ž2𝑦where SNRA is the pre-ADC SNR and SNRT is the total output SNRincluding all noise sources. Note: (10)-(11) can be repurposed fordigital architectures by setting SNRa since quantization isthe only noise source implying SNRA SQNRπ‘žπ‘–π‘¦ . Equations (8)-(9)indicate that SQNRπ‘žπ‘–π‘¦ and SQNRπ‘ž 𝑦 can be made arbitrarily largeby assigning sufficiently high precision to the DP inputs (𝐡π‘₯ and𝐡 𝑀 ) and the output (𝐡 𝑦 ). Thus, from (10)-(11), SNRT in IMCs isfundamentally limited by SNRa which depends on the analog noisesources as one expects.3.2𝐡 BGC 𝐡π‘₯ 𝐡 𝑀 log2 (𝑁 )𝑦Precision Assignment Methodology forIMCsPrior work [25, 26], indicates the requirement SNRT(dB) SNR T(dB) 10 dB-40 dB (see Fig. 2) for the inference accuracy of an FX networkto be within 1% of the corresponding FL network for popular DNNs(AlexNet, VGG-9, VGG-16, ResNet-18) deployed on the ImageNetand CIFAR-10 datasets. To meet this SNRT(dB) requirement, digitalarchitectures choose 𝐡π‘₯ and 𝐡 𝑀 such that SQNRπ‘žπ‘–π‘¦ SNR T , andthen choose 𝐡 𝑦 sufficiently high to guarantee SQNRπ‘ž 𝑦 SQNRπ‘žπ‘–π‘¦so that SNRT SQNRπ‘žπ‘–π‘¦ .In contrast, for IMCs, we first need to ensure that SNRa SNR Tso that SNRT can be made to approach SNRa with appropriateprecision assignment via the following methodology:(1) Assign sufficiently high values for 𝐡π‘₯ and 𝐡 𝑀 per (8) suchthat SQNRπ‘žπ‘–π‘¦ SNRa so that SNRA SNRa per (10).(2) Assign sufficiently a high value for 𝐡 𝑦 per (9) such thatSQNRπ‘ž 𝑦 SNRA so that SNRT SNRA per (11).(12)Substituting 𝐡 𝑦 𝐡 BGCfrom (12) into (9) and employing the rela𝑦tionship 𝜁 𝑦(dB) 10 log10 (𝑁 ) 𝜁π‘₯(dB) 𝜁 𝑀(dB) , the resulting SQNRdue to output quantization using the BGC is given by:SQNRπ‘žBGC𝑦 (dB) 10 log10𝜎 𝑦2 o!πœŽπ‘ž2𝑦 6(𝐡π‘₯ 𝐡 𝑀 ) 4.8 [𝜁π‘₯ (dB) 𝜁 𝑀 (dB) ] 10 log10 (𝑁 ).(13)Recall that SQNRπ‘žBGC SNRA in order to ensure SNRT is close to𝑦its upper bound. Comparing (9) and (13), we see that, for high valuesof DP dimensionality 𝑁 , BGC is overly conservative since it assignslarge values to 𝐡 𝑦 per (12). Some digital architectures truncatethe LSBs to control bit growth. The SQNR of such truncated BGC(tBGC) can be obtained from (9) by setting the value of 𝐡 𝑦 𝐡 BGC𝑦 .BGC’s high precision requirements is accommodated by digitalarchitectures by increasing the precision of arithmetic units at acommensurate increase in the computational energy, latency, andactivation storage costs. However, IMCs cannot afford to use thiscriterion since 𝐡 𝑦 is the precision of the BL ADCs which impacts itsenergy, latency, and area. Indeed, recent works [24] have claimedthat BL ADCs dominate the energy and latency costs of IMCsassuming BGC to assign 𝐡 𝑦 .In the next section, we propose an alternative to BGC referredto the minimum precision criterion (MPC), that can be employedby both digital and IMC architectures which achieves a desiredSQNRπ‘ž 𝑦 with much fewer bits than BGC.3.4The Minimum Precision Criterion (MPC)We propose MPC to reduce 𝐡 𝑦 without incurring any loss in SQNRπ‘ž 𝑦compared to BGC. Unlike BGC, MPC accounts for the statistics of𝑦o to permit controlled amounts of clipping to occur. In MPC (seeFig. 3(a)), the output 𝑦o is clipped to lie in the range [ 𝑦c, 𝑦c ] instead of [ 𝑦m, 𝑦m ] as in BGC (see Fig. 3(b)), where 𝑦c 𝑦m (𝑦c :clipping level), and the 𝐡 𝑦 bits are employed to quantize this reduced range. The clipping probability 𝑝 c Pr{ π‘¦π‘œ 𝑦c } is kept toa small user-defined value, e.g., 𝑦c 4𝜎 𝑦o ensures that 𝑝 c 0.001

ICCAD ’20, November 2–5, 2020, Virtual Event, USASujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhagclippedclipped(a)(a)(b)(c)Figure 3: Comparison of BGC and MPC: (a) MPC quantization levels, (b) BGC quantization levels, and (c) distributionπ‘“π‘Œ (𝑦o ) of the ideal DP output 𝑦o vs. DP dimensionality 𝑁 .Figure 4: Trends in SQNRπ‘ž 𝑦 (dB) for DP computation with𝐡π‘₯ 𝐡 𝑀 7: (a) SQNRπ‘ž 𝑦 (dB) vs. 𝑁 for MPC (𝜁 𝑦 4), BGC,if 𝑦o N (0, 𝜎 𝑦2 o ). The resulting SQNR𝑦 is given by:SQNRπ‘žMPC𝑦 (dB)MPC 6𝐡 𝑦 4.8 𝜁 𝑦(dB)𝜎2 10 log10 1 𝑝 c π‘π‘πœŽπ‘ž2𝑦(b)!(14) 𝑦22 E (𝑦 𝑦 ) 2 𝑦 𝑦 iswhere 𝜁 𝑦MPC 10 log10 𝜎 2c , and πœŽπ‘π‘ococ(dB)𝑦othe conditional clipping noise variance. Setting 𝑦c 𝜁 𝑦MPC 𝜎 π‘¦π‘œ yieldsMPC 10 log (𝜁 MPC ) 2 indicating that 𝑝 is a decreasing function𝜁 𝑦(dB)c10 𝑦of 𝜁 𝑦MPC . Thus, (14) has the same form as (1) with an additional (lastterm) clipping noise factor.MPC exploits a key insight (see Fig. 3(c)), which follows from theCentral Limit Theorem (CLT) – in a 𝑁 -dimensional DP computation(2), 𝜎 𝑦o grows sub-linearly (as 𝑁 ) as compared to the maximum 𝑦mwhich grows linearly with 𝑁 . Furthermore, (14) shows a quantizationvs. clipping noise trade-off controlled by the clipping level 𝑦c . Thistrade-off, illustrated in Fig. 3(c), is absent in BGC and tBGC, and iscritical to MPC’s ability to realize desired values of SQNRπ‘ž 𝑦 withsmaller values of 𝐡 𝑦 .Assuming 𝑦o N (0, 𝜎 𝑦2 o ), and substituting 𝑦c 4𝜎 𝑦o , and𝑝 c 0.001 into (14), we obtain the following lower bound: 𝛾 i1h𝐡 MPC SNRA(dB) 7.2 𝛾 10 log10 1 10 10(15)𝑦6in order for SNRA(dB) SNRT(dB) 𝛾. For instance, the choice 𝛾 0.5 dB yields 𝐡 MPC 16 SNRA(dB) 16.3 which corresponds𝑦to SQNRMPC SNRA(dB) 9 dB as discussed in Section 3.2.𝑦(dB)tBGC, and (b) SQNRπ‘žMPCvs. 𝜁 𝑦MPC when 𝐡 𝑦 8.𝑦 (dB)3.5Simulation ResultsTo illustrate the difference between MPC, BGC and tBGC, we assume that SNRa(dB) 31 dB, so that SNRT(dB) 30 dB providedSQNRπ‘žπ‘–π‘¦ (dB), SQNRπ‘ž 𝑦 (dB) 40 dB per (10)-(11). We further assumeDPs of varying dimension 𝑁 with 7-b quantized unsigned inputsand signed weights randomly sampled from uniform distributions.Substituting 𝐡π‘₯ 𝐡 𝑀 7, 𝜁π‘₯(dB) 1.3 dB, and 𝜁 𝑀(dB) 4.8 dBinto (8), we obtain SQNRπ‘žπ‘–π‘¦ (dB) 41 dB. Thus, all that remains is toassign 𝐡 𝑦 such that SQNRπ‘ž 𝑦 (dB) 40 dB, for which there are threechoices - MPC, BGC and tBGC.Figure 4(a) compares the SQNRπ‘ž 𝑦 achieved by the three methods.Per (15), MPC meets the SQNRπ‘ž 𝑦 (dB) 40 dB requirement by set-ting 𝐡 𝑦 8 and 𝜁 𝑦MPC 4 independent of 𝑁 . In contrast, per (12),BGC assigns 16 𝐡 𝑦 20 as a function of 𝑁 to achieve the sameSNRT as MPC. Furthermore, tBGC meets the SQNRπ‘ž 𝑦 requirementwith 11 𝐡 𝑦 13 but fails to do so with 𝐡 𝑦 8. Figure 4(b)shows that SQNRπ‘žMPCis maximized when 𝜁 𝑦MPC 4, i.e., when(dB)𝑦clipping level 𝑦c 4𝜎 𝑦o thereby illustrating MPC’s quantization vs.clipping noise trade-off described by (14). Figure 4 also validatesthe analytical expressions (8), (9), (13), and (14) (bold) by indicating

Fundamental Limits on the Precision of In-memory ArchitecturesICCAD ’20, November 2–5, 2020, Virtual Event, USATable 1: A Taxonomy of IMCs using In-memory ComputeModelsBeyond CMOSCMOSIn-memoryCompute ModelQS ISQRKang et al. [15]Biswas et al. [1]Zhang et al. [40]Valavi et al. [33]Khwa et al. [16]Jiang et al. [12]Si et al. [30]Jia et al. [11]Okumura et al. [23]Kim et al. [17]Guo et al. [8]Yue et al. [38]Su et al. [32]Dong et al. [4]Si et al. [31]Chen et al. [2]Fick et al. [5]Xue et al.[35]Yan et al.[37]Zha et al.[39]Xue et al.[36]Analog 81111151T115142TA3114ADCPrecision𝐡 ADC871113.465881355453A4116T: Ternary; A: Analog/Continuous-valueda close match to ensemble-averaged values of SQNRπ‘ž 𝑦 obtainedfrom Monte Carlo simulations (dotted).Note: the theoretically optimal quantizer given an arbitrary signal distribution is obtained from the Lloyd-Max (LM) algorithm[18]. Unfortunately, the LM quantization levels are non-uniformlyspaced which makes it hard to design efficient arithmetic units toprocess such signals. MPC offers a practical alternative to LM.4ANALYTICAL MODELS FOR COMPUTE SNRThis section derives analytical expressions for SNRa of a typicalIMC. First, we show that most IMCs can be ’explained’ via a fewin-memory compute models.4.1𝑉"" 𝑉 πœ™&In-memory Compute ModelsAll IMCs are viewed as employing one or more in-memory computemodels defined as a mapping of algorithmic variables 𝑦o , π‘₯ 𝑗 and 𝑀 𝑗in (2) to physical quantities such as time, charge, current, or voltage,in order to (usually partially) realize an analog BL computation ofthe multi-bit DP in (2).Furthermore, we suggest that most IMCs today employ one ormore of the following three in-memory compute models (see Fig. 5):(a) charge summing (QS) [7, 14, 15, 40]; (b) current summing (IS)[12, 16, 17, 30]; and (c) charge redistribution (QR) [1, 7, 15, 33], andconjecture that these compute models are in some sense universal inthat they represent an approximation to a β€˜complete set’ of practical,i.e., realizable, mappings of variables from the algorithmic to thecircuit domain as shown in Table 1.Henceforth, we discuss the QS model and the corresponding QSbased IMC referred to as QS-Arch in detail since it is very commonlyused. Analytical expressions for circuit domain equivalents of πœ‚ e𝑇&𝐼&πœ™&𝑉'𝑉)πœ™'𝑇'𝐺 𝑉"𝐺'𝐺)πœ™'𝐼'πœ™(𝑉 )(b)𝑉&πœ™"πœ™#𝑉"𝐢"πœ™ 𝑉#𝐢#𝑉 𝐢 (c)Figure 5: In-memory compute models: (a) charge summing(QS), (b) current summing (IS), and (c) charge redistribution(QR) models.and πœ‚ h in (6) for the QS model are presented. These will be combinedwith algorithm and precision-dependent noise sources π‘žπ‘–π‘¦ and π‘ž 𝑦to obtain SNRT .4.2The Charge Summing (QS) ModelThe QS model (see Fig. 5(a)) realizes the DP in (2) via the variablemapping (𝑦o 𝑉o , 𝑀 𝑗 𝐼 𝑗 , π‘₯ 𝑗 𝑇 𝑗 ) where the cell current 𝐼 𝑗 isintegrated over the WL pulse duration 𝑇 𝑗 (𝑗 1, . . . , 𝑁 ) on a BL (orcell) capacitor 𝐢 resulting an output voltage as shown below:(𝑦o 𝑉o ) 𝑁1Γ•(𝑀 𝑗 𝐼 𝑗 )(π‘₯ 𝑗 𝑇 𝑗 )𝐢 𝑗 1(16)where 𝑉o is the DP output assuming infinite voltage head-room,i.e., no clipping. The cell current 𝐼 𝑗 depends upon transistor sizesand the WL voltage 𝑉WL , and typical values are: 𝐢 (a few hundredfFs), 𝐼 𝑗 (tens of πœ‡As), and 𝑇 𝑗 (hundreds of ps).Noise Models: The noise contributions in QS arise from the following sources: (1) variations in the pulse-widths 𝑇 𝑗 of currentswitch pulses πœ™ 𝑗 (Fig. 5(a)); (2) their finite rise and fall times (seeFig. 6(b)); (3) spatial variations in the currents 𝐼 𝑗 ; (4) thermal noisein the discharge RC-network; and (5) clipping due to limited voltage head-room. Thus, the analog DP output 𝑉a corresponding to𝑦a 𝑦o πœ‚ a is given by:(π‘¦π‘Ž 𝑉a ) (𝑦o 𝑉o ) (πœ‚ e 𝑣 e ) (πœ‚ h 𝑣 c ),𝑁1Õ𝑣 e π‘£πœƒ 𝑖 𝑗 𝑇 𝑗 𝐼 𝑗 (𝑑 𝑗 𝑑 rf ),𝐢 𝑗 1 𝑣 c min 𝑉o, 𝑉o,max 𝑉o,(17)

ICCAD ’20, November 2–5, 2020, Virtual Event, USASujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag𝐱" !𝑉 %𝑉# 𝐼" 𝐰!𝐱" "!𝑇!𝑀" ","π‘₯ 𝑀"!,#π‘₯%𝐢 𝐰%!π‘₯!𝑇"𝑇%𝐱" πŸπ‘€" ", 𝐰% "&BLBπ‘₯&(b)ADCFigure 6: Modeling the discharge process in the QS computemodel: (a) cell current 𝐼 𝑗 , and (b) the word-line voltage pulse𝑉WL .ValueParameterValueπ‘˜ β€² (πœ‡A/V2 )πœŽπ‘‡ 0 (ps)Δ𝑉BL,max (V)𝑉t (V)𝑇 (K)2202.30.8-to-0.90.4270π›ΌπœŽπ‘‰t (mV)𝑉WL (V)𝑇0 (ps)π‘˜ (JK 1 )1.823.80.4-to-0.81001.38e-23𝑗 π›ΌπœŽπ‘‰t 𝐼 𝑗 𝜎D𝑉WL 𝑉t 𝑉 𝑉 𝑇 𝑇trf𝑑 rf π‘‡π‘Ÿ WL𝑉WL𝛼 1rqπ‘˜π‘‡πœŽπ‘‡π‘— β„Ž 𝑗 πœŽπ‘‡ 0, πœŽπœƒ 𝐢WLADCADC𝑦! ADCPOTS𝑦'Figure 7: The charge summing IMC (QS-Arch).𝐸 QS E [𝑉a ] 𝑉dd𝐢 𝐸 sucurrent mismatch, and 𝑑 𝑗 N (0, πœŽπ‘‡2 ) is the noise due to (temporal)𝑗pulse-width mismatch, respectively, both of which are modeledas zero mean Gaussian random variables, 𝑑 rf models the impactof finite rise and fall times of the current switching pulses, andπ‘£πœƒ N (0, πœŽπœƒ ) is the thermal noise. Note: 𝑉o,max can be as high as0.9 V when 𝑉dd 1 V.Analytical expressions to estimate the noise standard deviations𝜎𝐼 𝑗 , πœŽπ‘‡π‘— , πœŽπœƒ , and 𝑑 rf , (see appendix) are provided below: 𝐰% "&Energy and Delay Models: The average energy consumption inthe QS model is given by:where 𝑉o,max is the maximum allowable output voltage, and 𝑣 e and𝑣 c are the voltage domain noise due to circuit non-idealities andclipping, respectively, 𝑖 𝑗 N (0, 𝜎𝐼2 ) is the noise due to (spatial)𝜎𝐼 𝑗 𝐼 𝑗 POTSTable 2: QS Model Parameters in a 65 nm CMOS ProcessParameter𝑀" ",% Δ𝑉'(time(a) 𝑉'' 𝑉) BL 𝐰'𝐰%!(18)(19)(20)2 is normalized current mismatch variance, 𝑇 β„Ž 𝑇 iswhere 𝜎D𝑗𝑗 0the delay of a β„Ž 𝑗 -stage WL driver composed unit elements withdelay 𝑇0 each, πœŽπ‘‡ 0 is the standard deviation of 𝑇0 , 𝑇r and 𝑇f are WLpulse rise and fall times (see Fig. 6(b)), 𝛼 is a fitting parameter in the𝛼-law transistor equation, πœŽπ‘‰t is standard deviation of 𝑉t variations,π‘˜ is the Boltzmann constant, and 𝑇 is the absolute temperature.Note that typically the WL voltage 𝑉WL is identical for all rowsin the memory array with a few exceptions such as [40] whichmodulate 𝑉WL to tune the cell current 𝐼 𝑗 . The effects of rise/falltimes and delay variations can be mitigated by carefully designingthe WL pulse generators. Therefore, noise in QS is dominated byspatial threshold voltage variations. Indeed, using the typical valuesfrom Table 2, we find that 𝜎𝐼 𝑗 /𝐼 𝑗 ranges from 8% to 25%, while πœŽπ‘‡π‘— /𝑇 𝑗ranges from 0.5% to 3%.(21)where the spatio-temporal expectation E [𝑉a ] is taken over inputs(temporal) and over columns (spatial) 𝐸 su is the energy cost oftoggling switches πœ™ 𝑗 s. Equation (21) shows that the energy consumption in the QS model increases with 𝐢 array size, the supplyvoltage 𝑉dd , and the mean value of the DP E [𝑉a ].The delay of the QS model is given by 𝑇QS 𝑇max 𝑇su, where 𝑇suis the time required to precharge the capacitors and setup currents,and 𝑇max max{𝑇 𝑗 } is the longest allowable pulse-width.Table 2 tabulates parameters of the QS model in a representative65 nm CMOS process.4.3QS-ArchThe charge summing architecture (QS-Arch) in Fig. 7(b) employs a6T [8] or 8T [30] SRAM bitcell within the QS model (see Section 4.2).This architecture implements fully-binarized DPs on the BLs bymapping the input bit π‘₯ˆ𝑖,𝑗 to the WL access pulse 𝑉WL,𝑗 while theweights 𝑀ˆ 𝑖,𝑗 are stored across 𝐡 𝑀 columns of the BCA so that the BCcurrents 𝐼𝑖,𝑗 𝑀ˆ 𝑖,𝑗 . The output 𝑉o Δ𝑉BL is the voltage dischargeon the BL and the capacitance 𝐢 𝐢 BL is the BL capacitance in (16).QS-Arch sequentially (bit-serially) processes one multi-bit inputvector x in 𝐡π‘₯ in-memory compute cycles followed by a digitalsumming of the binarized DPs to obtain the final multi-bit DP (2).Table 3 summarizes the noise and energy models for QS-Arch.We derive the analytical expressions of architecture-level noisemodels for QS-Arch using those of the QS model described in Section 4.2. In QS-Arch, clipping occurs in each of the 𝐡π‘₯ 𝐡 𝑀 binarizedDPs and contributes to the overall clipping noise variance πœŽπœ‚2h atthe multi-bit DP output. Circuit noise from each binarized DP isaggregated to obtain the final circuit noise variance πœŽπ‘’2 . In addition, employing MPC imposed requirement on the final DP outputprecision 𝐡 𝑦 (15), we obtain the lower bound on ADC precision𝐡 ADC .Since the multi-bit DP computation in (2) is high-dimensional(𝑁 can be in hundreds), it is clear that the limited BL dynamicrange e.g., 𝑉o,max in (17), will begin to dominate SNRa in (7). It isfor this reason that most, if not all, IMCs resort to some form ofbinarization of the multi-bit DP in (2) prior to employing one of thein-memory compute models (see Table 1). Ultimately, SNRa limits

Fundamental Limits on the Precision of In-memory ArchitecturesICCAD ’20, November 2–5, 2020, Virtual Event, USATable 3: Model Parameters for QS-ArchBitcell type6T or 8TAnalog Core PrecisionEnergy cost per DP𝐸 QS-Arch 𝐡 𝑀 𝐡π‘₯ (𝐸 QS 𝐸 ADC ) 𝐸 miscCompute modelmappingπœŽπ‘ž2𝑖𝑦12 212 𝑁 Ξ”π‘₯ 𝜎 π‘€πœŽπœ‚2e𝑁 𝜎D2 1 𝑁 Ξ”2 E π‘₯ 2 12π‘€πœŽπœ‚2h𝐡π‘₯ 1, 𝐡 𝑀 1𝐢 𝐢 BL𝑉o Δ𝑉BL𝑇 𝑗 𝑇WL,𝑗 4 1 4 𝐡 𝑀1 4 𝐡π‘₯9 π‘˜ 3 𝑁 π‘˜Γπ‘(π‘˜ π‘˜ h ) 2 π‘π‘˜ 144π‘˜ π‘˜h(1 4 𝐡 𝑀)(91 4 𝐡π‘₯)𝐡 ADC min SNRA(dB) 16.2, log2 (π‘˜ h ), log2 (𝑁 )6𝜎𝐼BL.maxπ‘˜ h Δ𝑉Δ𝑉BL,unit ; 𝜎D 𝐼 is the normalized standard deviation of the bit-cell current (18); (π‘₯) max(π‘₯, 0).component levelMonte Carlomeasurements fromDIMA prototypeSPICEsimulationsarchitecturemodelnoise modelsSNR expressionsSNR "non-linearbehavioral modelsample-accurate SNR !Python simulation SNRSNR !comparison"Figure 8: SNR validation methodology.the number and accuracy of BL computations per read cycle andhence the overall energy efficiency of IMCs.5SIMULATION RESULTSThis section describes the noise model validation methodology forvalidating the noise expressions in Table 3 and simulation resultsfor QS-Arch.5.1Noise Model Validation MethodologyFigure 8, we obtain the QS model parameters (Section 4) usingMonte Carlo circuit simulations in a representative 65 nm CMOSprocess, with experimental validation of some of these, e.g., πœŽπœ‚e ,from our IMC prototype ICs [6, 15] when possible.Incorporating non-linear circuit behavior along with noise models, sample-accurate Monte Carlo Python simulations are employedto numerically calculate SNR values using ensemble averaged (over1000 instances) statistics. We compare the SNR values obtainedthrough sample-accurate simulations with those obtained by evaluating the analytical expressions in Table 3.The quantitative results in subsequent sections employ the QSmodel parameter values in Table 2 along with QS-Arch energyand noise models from Table 3. An SRAM BCA with 512 rows and𝐢 BL 270 fF is assumed throughout. Energy and accuracy of QSArch is traded-off by tuning 𝑉WL . We assume zero mean signedweights 𝑀 𝑗 and unsigned inputs π‘₯ 𝑗 drawn independently from twodifferent distributions. We set 𝐡π‘₯ 𝐡 𝑀 6 everywhere, unlessotherwise stated, so that SQNRπ‘žπ‘–π‘¦ (dB) 38.9 dB SNRa(dB) andtherefore SNRA SNRa from (10). Next, we show how SNRA andSNRT trade-off with 𝑁 and 𝐡 ADC .5.2SNR Trade-offs in QS-ArchFigure 9(a) shows that the maximum achievable SNRA increaseswith 𝑉WL . Further, for a fixed 𝑉WL , QS-Arch also exhibits a sharpdrop in SNRA at high values of 𝑁 𝑁 max , e.g., SNRA 19.6 dB for𝑁 125 and then drops with increase in 𝑁 . A key reason for thistrade-off is that πœŽπœ‚2h decreases while πœŽπœ‚2e increases as 𝑉WL is reduced(see Table 3), and since πœŽπœ‚2h limits 𝑁 and πœŽπœ‚2e limits SNRa . Thus, bycontrolling 𝑉WL , we can trade-off 𝑁 max with SNRA . Specifically,𝑁 max increases by 2 for every 3 dB drop in SNRA .In QS-Arch, the minimum value of 𝐡 ADC (see Table 3) dependsupon the minimum of: 1) the MPC term (15); 2) the headroomclipping term; and 3) the small 𝑁 case where BL discharge Δ𝑉BL hasa finite number of discrete levels. Figure 9(b) shows that SNRT SNRA of Fig. 9(a) when 𝐡 ADC is greater than the lower bound(circled) in Table 3 for different values of 𝑉WL and 𝑁 .5.3Impact of ADC PrecisionMinimizing the column ADC energy is c

3 COMPUTE SNR LIMITS OF IMCS We propose the system noise model in Fig. 1 for obtaining precision limits on IMC architectures. Such architectures (Fig. 1(a)) accept a quantized input (x ) and a quantized weight vector (w ) to imple-ment multiple FX DP computations of (4) in parallel in its analog core.