Artificial Intelligence (AI) is no longer confined to high-performance computing centers. It has ventured into the compact, resource-constrained world of embedded systems, thanks to innovations in model optimization and hardware capabilities. A recent study showcases how a self-supervised audio spectrogram transformer (SSAST) can efficiently operate on a low-power NVIDIA Jetson Orin Nano System-on-Chip (SoC), paving the way for AI’s integration into devices like IoT gadgets, wearable tech, and more.
Transformers Meet Embedded Systems
Transformers, like the SSAST, are known for their prowess in natural language processing and audio recognition tasks. However, their deployment on embedded systems has been limited by the high demands of inference processes. This study breaks new ground by demonstrating how to make these models work efficiently on low-power GPUs, highlighting optimization techniques that ensure minimal resource usage without compromising performance.
Key Optimizations for Efficiency
- Batch Size Tuning: Larger batch sizes drastically reduce inference times and energy consumption. However, they require careful management to avoid exceeding memory constraints. In this study, a batch size of 16 struck the perfect balance, optimizing time, energy, and space.
- Model Compilation with TensorRT: NVIDIA’s TensorRT framework was a game-changer, offering accelerated inference through precision-optimized kernels and memory optimization. Compiled models demonstrated up to 2x faster inference times compared to their non-compiled counterparts.
- Precision Reduction: By reducing data precision to half-floating point (FP16) or even 8-bit integers, the study achieved significant gains in speed and energy efficiency. Notably, accuracy degradation was negligible, with less than 1% loss observed during 8-bit post-training quantization.
Experimental Insights
The team used the Google Speech Commands Dataset to test the SSAST model on the Jetson Orin Nano SoC, which houses a six-core ARM CPU and an NVIDIA Ampere GPU. Results showed that:
- GPUs were 6x faster than CPUs for single-sample inference and up to 32x faster for batched inputs.
- Compiled models consistently reduced energy consumption and inference time, even at larger batch sizes.
- Memory utilization increased with batch size, but the optimized configurations ensured that other tasks weren’t disrupted.
Real-World Applications
These findings open the door to deploying AI in areas like:
- Healthcare: Wearable devices for real-time health monitoring.
- IoT: Smart home systems capable of real-time voice recognition.
- Automotive: Low-power AI solutions for autonomous vehicle features.
Looking Ahead
The study sets the stage for broader adoption of AI on embedded systems. Future work will explore:
- Quantization Aware Training (QAT) to further minimize accuracy loss during precision reduction.
- Deployment of larger models, including Large Language Models (LLMs), on edge devices using tools like TensorRT-LLM.
- Utilization of advanced hardware like the Jetson Xavier for even greater performance.
Conclusion
This research proves that cutting-edge transformer models can thrive in low-power environments, balancing time, energy, and memory efficiency without sacrificing accuracy. By leveraging optimization techniques like TensorRT and precision reduction, we can bring the transformative power of AI to embedded systems, driving innovation across industries.
Article derived from: Martin-Salinas, I., Badia, J.M., Valls, O. et al. Evaluating and accelerating vision transformers on GPU-based embedded edge AI systems. J Supercomput 81, 349 (2025). https://doi.org/10.1007/s11227-024-06807-1
Check out the cool NewsWade YouTube video about this article!