Edge AI for Real-Time Vision: Challenges, Limitations, and Future Research Directions
DOI:
https://doi.org/10.26389/Keywords:
Edge AI, real-time vision, lightweight CNNs, vision transformers, quantization, ViT, hybrid, pruning, distillation, compiler, NPU, TPU, FPGA, microcontrollerAbstract
This survey synthesizes methods that enable real-time computer vision on edge hardware under tight latency, energy, and memory constraints. We conducted a systematic review of 150+ studies (2018–2025) spanning lightweight CNNs and ViTs, compression (pruning, quantization, distillation), compiler/runtime optimization, hardware acceleration (NPUs/TPUs/FPGAs), and edge–cloud collaboration. Across comparable settings, INT8 quantization typically yields 2–4× higher throughput and 2–5× lower energy than FP32; representative mobile backbones (e.g., MobileNetV3-L) achieve millisecond-level latency with competitive accuracy on NPUs; and hybrid CNN–ViT models offer a ~15–20% better accuracy–latency balance than pure CNN or ViT baselines when compiler fusion is effective. We also document trade-offs where preprocessing/postprocessing can account for 20–60% of end-to-end time, and cases where compression underperforms due to operator support gaps. Our unique contribution is a unified taxonomy aligned to practical deployment choices (model class × optimization × hardware), plus prescriptive “when-to-use-what” recommendations for mobile, embedded, and micro-edge targets. Recommendations: prefer INT8 with hardware-supported ops; pair hybrids with fusion-aware toolchains; budget for pre/post-processing; and consider split inference for heavy workloads.
References
Published
Issue
Section
License
Copyright (c) 2025 The Arab Institute for Science and Research Publishing (AISRP)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.





