Graph neural networks (GNNs) combine sparse and densedata compute requirements that are challenging to meet in resourceconstrained embedded hardware. In this paper, we investigate a dataflowof dataflows architecture that optimizes data access and processing element utilization. The architecture is described with high-level synthesisand offers multiple configuration options including varying the number ofindependent hardware threads, the interface data width and the numberof compute units per thread. Each hardware thread uses a fine-graineddataflow to stream words with a bit-width that depends on the network precision while a coarse-grained dataflow links the thread stagesstreaming partially-computed matrix tiles. The accelerator is mappedto the programmable logic of a Zynq Ultrascale device whose processing system runs Pytorch extended with PYNQ overlays. Results basedon the citation networks show a performance gain of up to 140x withmulti-threaded hardware configurations compared with the optimizedsoftware implementation available in Pytorch. The results also show competitive performance of the embedded hardware compared with otherhigh-performance state-of-the-art hardware accelerators.
Funding Agencies|Wallenberg AI autonomous autonomous systems and software (WASP) program - Knut and Alice Wallenberg Foundation