This paper presents a novel tensor-wise post-training quantization flow suitable for the sparse and dense tensors present in Graph Convolutional Networks (GCNs), which are a popular type of graph neural networks. The quantization approach employs KL (Kullback-Leibler) divergence and range analysis at the tensor granularity level to address the distinct sources of quantization errors in GCNs and its attractiveness lies in its independence from retraining or access to the full training dataset. The evaluation is performed with the popular citation datasets and shows that our method is competitive with the accuracy of the original floating-point model and also with quantization-aware training (QAT) approaches tailored for INT8 and INT4 precision. This is despite the significant potential for precision optimization of QAT at the cost of retraining. The obtained quantized and sparse tensors are used by a hardware overlay accelerator obtained using High-level synthesis (HLS) and integrated in Pytorch. The results using the Zynq Z7020 device available on the PYNQ-Z2 board show an 11x speedup over the integrated CPU and a 4.2x reduction in memory consumption.
Funding Agencies|Wallenberg AI autonomous systems and software (WASP) program - Knut and Alice Wallenberg Foundation