Kao, Chen ChienChen ChienKaoHsieh, Yi YenYi YenHsiehChen, Chao HungChao HungChenCHIA-HSIANG YANG2023-06-152023-06-152022-01-01978166540279815483746https://scholars.lib.ntu.edu.tw/handle/123456789/632697A tensor is a multi-dimensional array, which is embedded for neural networks. The multiply-accumulate (MAC) operations involved in a large-scale tensor introduces high computational complexity. Since the tensor usually features a low rank, the computational complexity can be largely reduced through canonical polyadic decomposition (CPD). This work presents an energy-efficient hardware accelerator that implements randomized CPD in large-scale tensors for neural network compression. A mixing method that combines the Walsh-Hadamard transform and discrete cosine transform is proposed to replace the fast Fourier transform with faster convergence. It reduces the computations for transformation by 83%. 75% of computations for solving the required least squares problem are also reduced. The proposed accelerator is flexible to support tensor decomposition with a size of up to 512\times 512\times 9\times 9. Compared to the prior dedicated processor for tensor computation, this work support larger tensors and achieves a 112\times lower latency given the same condition.canonical polyadic decomposition | hardware acceleration | Neural network compression | tensor decomposition[SDGs]SDG7Hardware Acceleration in Large-Scale Tensor Decomposition for Neural Network Compressionconference paper10.1109/MWSCAS54063.2022.98594402-s2.0-85137476085https://api.elsevier.com/content/abstract/scopus_id/85137476085