This post contains my personal notes from the course Efficient Computing of Deep Neural Networks (04835640, Summer 2025, at Peking University). The course is instructed by Professor Yu Bei from The Chinese University of Hong Kong, who teaches a formal identically named course there (CMSC5743, Fall 2024). Copyright for all relevant figures belongs to the Lecturer.
Overview
In comtemporary deep neural network (DNN) applications, achieving real-time online inference presents a high speed demand (~10 fps, ~60 ms) for many models. In this course, we will discuss some effective methods to accelerate inference speed by reducing computational operations and memory burden. Specifically, our discussion is organized into 2 branches, Model level (Mo) and Implementation level (Im):
Model level
Mo1: Pruning
Mo2: Decomposition
Mo3: Quantization
Mo4: BNN
Mo5: KD
Mo6: NAS
Implementation level
Im1: GEMM
Im2: Direct Conv
Im3: Winograd
Im4: Sparse Conv
Im5: CUDA
Im6: TVM