# A Low-Power 8-bit Switched Capacitor Convolution Engine Optimized for Artificial Neural Networks in 65nm CMOS

## Beom Kyu Seo<sup>1</sup> and Jintae Kim<sup>a</sup>

Department of Electronics Engineering, Konkuk University E-mail : <sup>1</sup>bk.seo@msel.konkuk.ac.kr

Abstract - In this paper, we present a study on a neural network operator that performs low resolution, low power, and high efficiency convolution operations in analog domains. The proposed operator is consisted of multiplying DAC (MDAC) with integrator structure and successive-approximation ADC (SAR ADC). The memory access frequency is lower than that of the digital operation because the addition operation is performed while the multiplication operation is performed, and the information is stored in the form of charge on the opamp output terminal. A digital-input, digital-output calculator consisting of MDAC and ADC was designed using a 65nm CMOS process. The result of transistorlevel simulation was 30.11uW of power at 33.3MHz, which is equivalent to 2.21TOPS/W. And it shows improved power efficiency than conventional digital convolution operator.

*Keywords*—Convolutional Neural Network, Deep Learning, Switched-Capacitor

#### I. INTRODUCTION

With the advent of deep learning technology, technological advancement of artificial intelligence has been spurred. In addition to demonstrating excellent performance in image processing, the deep learning technology is rapidly evolving since it has exceeded human recognition rates in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 [1]. Most of the operations of Convolutional Neural Network that make up Deep Learning artificial intelligence take up the convolution, and GPU is used to massively parallelize these operations. As a by-product of continuous research to improve the recognition rate, the complexity and the computational requirement of the neural network are steadily increasing. For example, AlexNet [2], released in 2012, consisted of only eight hidden layers, but GoogLeNet [3] released in 2014 had 22 hidden layers and ResNet [4] released in 2015 had a maximum of 152 hidden layers.

a. Corresponding author; jintae.kim@msel.konkuk.ac.kr

Manuscript Received Jun. 13, 2019, Revised Sep. 23, 2019, Accepted Sep. 30, 2019

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (<u>http://creativecommons.org/licenses/bync/3.0</u>) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. The composition of this paper is as follows. In Section II, we describe the overall architecture of an operator and the operation of an analog integrator-based circuit. In Section III, we analyze power consumption and performance through SPICE simulation of the operation of a composite multiplier designed in 65nm CMOS process. In section IV, we conclude the paper by summarizing the merits of the proposed circuit.

### II. OVERALL ARCHITECTURE OF CONVOLUTION ENGINE

## A. Circuit Structure

Two multiplying DACs (DACs) are pipelined as shown in Figure 1-(a). The ADC was added to the structure that multiplies the data and the weight, and the operator was completed. An 8-bit Switched-Capacitor architecture suitable for low-power computer design is adopted as an analog operation unit, and a relatively simple and low-power inverter-based amplifier is used as an opamp. Inverter-based amplifiers have been used in low-power ADCs including audio ADCs [9]. Finally, the ADC is a successiveapproximation ADC (SAR ADC) with low power characteristics and advantageous for future process scale [10].



Fig. 1. (a) Operation circuit with 8-bit input and output, (b) Timing diagram of operation circuit

The operation circuit operates as follows. Each time a 2phase non-overlap clock passes, a pair of data is multiplied by the weight and the information is stored in the form of a charge on the output of the multiply-accumulate (MAC) DAC. From the second operation, the multiplied data is added to the previous data at the MAC DAC output stage. As shown in Figure 1-(b), the SAR ADC operates when four data calculated in this manner are stored.

Figure 2-(a) shows internal circuit of the multiplying DAC in Figure 1. Pseudo-differential is adopted as switchedcapacitor structure and is configured to take digital input value and apply voltage to amplifier output stage. As the unit capacitor Cu to realize 8-bit DAC operation, 748aF, the minimum size provided in the process, was used. The LSB sampling capacitors of the multiplying DACs used two Cu in series to reduce the size of the entire sampling capacitor by half. As shown in Figure 2- (b), the opamp is implemented as an input inverter-based circuit. The similar structure to digital circuits also has advantages in future process scales.

The result of the single multiplication operation and the differential output  $V_{\text{DIFF},\text{OUT}}$ 

$$V_{DIFF.OUT} = \frac{C_{sample}^2 V_{DD} X W}{C_{feedback}^2}$$
(1)

and is formed as a differential voltage at the output terminal of the MAC DAC. The value of the full sampling capacitor  $C_{sample}$  of the multiplying DAC is 95.7fF and the value of the feedback capacitor  $C_{feedback}$  is 74.8fF. X and W represent the input data of the multiplication operation corresponding to -0.5 to +0.5 and the normalized value of the weight.

## B. SAR ADC

The ADC operates after all MAC operations have been completed but consumes about 10% of the total power consumption. In addition, the ADC is unnecessary when designing a computer with an analog method when compared with a digital computer.

On the other hand, the data of the MAC operation in the artificial neural network is subjected to a post-processing function before being processed as the input of the next hidden layer. ReLU, leaky ReLU, and Sigmoid are the post-processing functions. Up to now, ReLU in Figure 3-(a) has been recognized as the most efficient post-processing function [2]. When designing the ADC with a differential structure, the difference in the  $V_{CM}$  at both ends of Figure 4 causes undesirable effects such as offset and additional techniques may be required to compensate for this [13]. However, if this phenomenon is reversed, it will produce the output as shown in Figure 3-(b) and implement the function of ReLU function without additional power consumption. The differential output of the ADC internal CDAC is

$$DAC_{OUP} - DAC_{OUTN} = 2V_{CMP} - 2V_{CMN} + V_{INN} - V_{INP} (2)$$



Fig. 2. (a) Multiplying DAC internal structure, (b) Inverter based amplifier circuit



Fig. 3. (a) Post-processing function ReLU, (b) Implementation of ReLU function in ADC  $% \left( {{{\rm{ADC}}} \right)_{\rm{ADC}} \right)$ 



Fig. 4. Back-end successive-approximation ADC (SAR ADC)

If  $V_{CMP}$  is increased by the same voltage and  $V_{CMN}$  is decreased, the nonlinear section of the CDAC output value is increased in proportion to the difference. Figure 3-(b) shows the SPICE simulation of the characteristics of the entire computer when  $V_{CMP} = 712$ mV,  $V_{CMN} = 332$ mV, and  $V_{CM} = 522$ mV. When the analog output of the calculator is negative, it is converted to a specific DC value. It can be confirmed that it is digitally converted. In order to achieve this, the ADC adopts the asynchronous bottom plate sampling SAR ADC [14] which can provide offset by  $V_{CM}$  voltage adjustment.

| Performance comparison table |              |        |                     |         |                   |
|------------------------------|--------------|--------|---------------------|---------|-------------------|
|                              | This<br>work | [6]    | [16]                | [5]     | Stratix10<br>FPGA |
| Process (nm)                 | 65           | 28     | 65                  | 65      | 14                |
| Operation method             | Analog       | Analog | Analog              | Digital | Digital           |
| resolution (bit)             | 8            | 8      | Input:7<br>Filter:1 | 16      | 8                 |
| Speed (Hz)                   | 33.3 M       | 19.2 M | 364 M               | 250 M   | 920 M             |
| Supply voltage (V)           | 1.2          | 1.0    | 1.2                 | 1.17    |                   |
| Power (W)                    | 30.11 u      | 7.74 u | 380.7 u             | 278 m   |                   |
| Efficiency (OPS/W)           | 2.21 T       | 9.61 T | 28.1 T              | 302 G   | 400 G             |
| Area $(um^2)$                | 0.092*       | 0.012* | 0.067*              | 16      |                   |

TABLE I. Performance comparison table

Efficiency =  $\left(\frac{\text{Power}}{\text{Speed}}\right) * (\# of operation in one period)$ 

\* Direct comparison is difficult because the number of arithmetic core is different.





III. SIMULATION RESULT

In this paper, we verified the actual operation and performance through SPICE simulation. Power consumption was calculated by transistor-level PEX simulation. The calculator was designed using a 65nm CMOS process and the layout is shown in Figure 5. The area of the computing core is  $0.092 \ \mu\text{m}^2$ .

Figure 5 shows the result of 2304 operations on all input data and some filter data from -127 to +127, and the result of operation error extracted from the largest output. In the simulation in Figure 6, only the multiplication operation was performed to collect data.

In some computation results and errors, there is an inverted staircase-type error every time the LSB is changed. In order to reduce the power consumption and the load impedance of the amplifier in the design, two LSB unit capacitors are connected in series. The final computation error in the computation result of Figure 6 is limited to the 2LSB range. This error does not affect the final recognition rate in artificial neural network computation. It is possible to confirm the recognized even in case of 5LSB operation error in the previous study [6]. Figure 8 shows



Fig. 8. Hidden layer data distribution diagram of VGG-F artificial neural network

the distribution of the hidden layer coefficient data of VGG-F [15], which is the simplest of the artificial neural networks, VGGNet. Figure 7 shows that in the worst case, the result of multiplication and addition of the operation result is accumulated 16 times, which is about 4 LSB errors. Since the hidden layer coefficient data of the actual artificial neural network is distributed similar to the normal distribution as shown in Figure 8, the operation in the worst case does not occur frequently.

Table I summarizes the performance comparison between transistor-level PEX simulation results and conventional digital and analog operators. The proposed algorithm is slower than the digital processor Stratix10 FPGA but has computation efficiency as high as 5 times. In addition, since the conventional analog calculator [6] uses a relatively new

http://www.idec.or.kr

process, the power efficiency is low, but it has faster computation speed. [16] has higher speed and efficiency than the arithmetic unit proposed in this paper. However, since the hidden layer data is stored in the SRAM as binary data, the applicable range is limited to a relatively simple data set such as MNIST have. The proposed algorithm can be applied to relatively complicated data sets such as CIFAR-10 because 8-bit operation is possible.

#### IV. CONCLUSIONS

In this paper, artificial neural network modeling human brain maintains reliable inference accuracy even at low resolution [7] and computation at low resolution is based on previous research results that analogue method is more efficient than digital method [8] A low - resolution high efficiency artificial neural network computing circuit was designed. The proposed arithmetic unit improves the computation speed more than the conventional analog arithmetic unit [6]. Also, it has the advantage of no additional energy consumption in addition operation in MAC operation, and saves energy used for memory access compared to digital type arithmetic operators [5].

We designed a digital-input, digital-output artificial neural network 8-bit arithmetic circuit with a layout in 65nm CMOS process, a computation speed of 33.3MHz and a computation efficiency of 2.21TOPS/W.

## ACKNOWLEDGMENT

This study was carried out with support of the Nanomaterial Technology Development Project of the Ministry of Science and Technology(2016M3A7B4909668) and the support of the Industrial Innovation Technology Future Semiconductor Project (10080611) of the Ministry of Industry and Commerce.

#### REFERENCES

- O. Russakovsky, et al., "ImageNet Large Scale Visual Recognition Challenge", International Journal of Computer Vision, Vol. 115, Issue 3, pp. 211-252, December 2015.
- [2] A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks", In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.
- [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions", CVPR, 2015.
- [4] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," arXiv preprint arXiv: 1512.03385, 2015.
- [5] Y.H Chen and J.S Emer, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional

Neural Networks", IEEE Journal of Solid-State Circuits, VOL.52, pp. 127-138, January 2017.

- [6] D. Bankman and B. Murmann, "An 8-bit, 16 input, 3.2 pJ/op Switched-Capacitor Dot Product Circuit in 28-nm FDSOI CMOS", IEEE Asian Solid-State Circuits Conference, pp. 21-24, November 2016.
- [7] D. Miyashita, S. Kousai, T Suzuki, J. Deguchi, "Time-Domain Neural Network: A 48.5 TSOp/s/W Neuromorphic Chip Optimized for Deep Learning and CMOS Technology", IEEE Asian Solid-State Circuits Conference, pp. 25-28, November 2016.
- [8] R. Sarpeshkar, "Analog Versus Digital: Extrapolating from Electronics to Neurobiology", IEEE Neural Computation, pp. 1601-1638, October 1998.
- [9] T. Christen, "A 15-bit 140-µW Scalable-Bandwidth Inverter-Based Δ∑Modulator for a MEMS Microphone With Digital Output", IEEE Journal of Solid-State Circuits, VOL.48, pp. 1605-1614, July 2013.
- [10] H.W. Shin, J.M. Jeong, T.J. An, J.S Park, S.H. Lee, "A 0.16mm2 12b 30MS/s 0.18um CMOS SAR ADC Based on Low-Power Composite Switching", Journal of The Institute of Electronics and Information Engineers, Vol.53, NO.7, pp. 1027-1038, July 2016.
- [11] Y. Chae, G. Han, "Low Voltage, Low Power, Inverter-Based Switched-Capacitor Delta-Sigma Modulator", IEEE Journal of Solid-State Circuits, VOL.44, pp. 458-472, February 2009.
- [12] J.H. Choi, J.H. Seong, K.S. Yoon, "Design of a Inverter-Based 3rd Order ∆∑ Modulator Using 1.5bit Comparators", Journal of The Institute of Electronics and Information Engineers, Vol.53, NO.7, pp. 1039-1046, July 2016.
- [13] Y.S. Cho, H.S, Shim, S.H. Lee, "A Non-Calibrated 2x Interleaved 10b 120MS/s Pipeline SAR ADC with Minimized Channel Offset Mismatch", Journal of The Institute of Electronics and Information Engineers, Vol.52, NO.9, pp. 1631-1641, September 2015.
- [14] C-C. Liu, S.-J. Chang, G.-Y. Huang, and Y.-Z. Lin., "A 10-bit-50MS/s SAR ADC With a monotonic capacitor switching procedure", IEEE Journal of Solid-State Circuits, vol. 45, no. 4, pp. 731–740, March 2010.
- [15] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, "Return of the Devil in the Details: Delving Deep into Convolutional Nets", arXiv:1405.3531, November 2014.
- [16] A. Biswas, A. P. Chandrakasan, "Conv-RAM: An Energy-Efficient SRAM with Embedded Convolution Computation for Low-Power CNN-Based Machine Learning Applications", ISSCC, pp. 488-490, 2018.



**Beom Kyu Seo** received the B.S. and M.S. degrees in electrical engineering from Konkuk University, Seoul, Korea, in 2018. His research interest includes converter circuits and neural network for designing convolution engine. Especially, he is currently conducting the research on lowpower convolution engine for image processing neural network.



Jin Tae Kim received the B.S. degree in Electrical Engineering from Seoul National University, Seoul, Korea, in 1997, and the M.S. and Ph.D. degrees in Electrical Engineering from University of California, Los Angeles, CA, in 2004 and 2008, respectively. He held various industry positions at Barcelona Design, CA, SiTime Corporation, CA, and Agilent

Technologies, CA, as a key technical contributor for their high-speed A/D converters and timing IC products. He is currently an Associate Professor in Electronics Engineering Department at Konkuk University, Seoul, Korea, where he is focusing on low power mixed-signal IC designs for communication and sensor applications. Dr. Kim is a recipient of the IEEE Solid-State Circuits Predoctoral Fellowship in 2007.