Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs

A. Alper Goksoy,Alish Kanani,Satrajit Chatterjee,Umit Ogras
DOI: https://doi.org/10.1109/tcad.2024.3445815
IF: 2.9
2024-11-09
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.
engineering, electrical & electronic,computer science, interdisciplinary applications, hardware & architecture
What problem does this paper attempt to address?