In this paper we present a hardware architecture optimized for sparse and dense matrix processing in TensorFlow Lite and compatible with embedded-heterogeneous devices that integrate CPU and FPGA resources. The FADES (Fused Architecture for DEnse and Sparse matrices) design offers multiple configuration options that trade-off parallelism and complexity and uses a dataflow model to create four stages that read, compute, scale and write results. All stages are designed to support TensorFlow Lite operations including asymmetric quantized activations, column-major matrix write, per-filter/per-axis bias values and current scaling specifications. The configurable accelerator is integrated with the TensorFlow Lite inference engine running on the ARMv8 processor. We compare performance/power/energy with the state-of-the-art RUY software multiplication library showing up to 18x acceleration and 48x in dense and sparse modes respectively. The sparse mode benefits from structural pruning to fully utilize the DSP blocks present in the FPGA device.
Funding: Royal Society Industry fellowship [INF\192044]; EPSRC HOPWARE [EP040863\1]; Leverhurme trust international fellowship Highperformance video analytics with parallel heterogeneous neural networks [IF-2021-003]