Abstract:Federated Learning (FL) enables multiple devices to collaboratively train a shared model while preserving data privacy. Ever-increasing model complexity coupled with limited memory resources on the participating devices severely bottlenecks the deployment of FL in real-world scenarios. Thus, a framework that can effectively break the memory wall while jointly taking into account the hardware and statistical heterogeneity in FL is urgently required. In this paper, we propose SmartSplit, a framework that effectively reduces the memory footprint on the device side while guaranteeing the training progress and model accuracy for heterogeneous FL through model <a class="link-external link-http" href="http://splitting.Towards" rel="external noopener nofollow">this http URL</a> this end, SmartSplit employs a hierarchical structure to adaptively guide the overall training process. In each training round, the central manager, hosted on the server, dynamically selects the participating devices and sets the cutting layer by jointly considering the memory budget, training capacity, and data distribution of each device. The MEC manager, deployed within the edge server, proceeds to split the local model and perform training of the server-side portion. Meanwhile, it fine-tunes the splitting points based on the time-evolving statistical importance. The on-device manager, embedded inside each mobile device, continuously monitors the local training status while employing cost-aware checkpointing to match the runtime dynamic memory budget. Extensive experiments on representative datasets are conducted on both commercial off-the-shelf mobile device testbeds. The experimental results show that SmartSplit excels in FL training on highly memory-constrained mobile SoCs, offering up to a 94% peak latency reduction and 100-fold memory savings. It enhances accuracy performance by 1.49%-57.18% and adaptively adjusts to dynamic memory budgets through cost-aware recomputation.

Melon: breaking the memory wall for resource-efficient on-device machine learning

Explore Training of Deep Convolutional Neural Networks on Battery-powered Mobile Devices: Design and Application

Close the Gap Between Deep Learning and Mobile Intelligence by Incorporating Training in the Loop

DaDianNao: A Machine-Learning Supercomputer

On-Device Training Under 256KB Memory

Breaking On-device Training Memory Wall: A Systematic Survey

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Pinpointing the Memory Behaviors of DNN Training

FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices.

Accelerating On-Device Learning with Layer-Wise Processor Selection Method on Unified Memory

Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

An Application-oblivious Memory Scheduling System for DNN Accelerators

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Devices

14.7 A 288µW programmable deep-learning processor with 270KB on-chip weight storage using non-uniform memory hierarchy for mobile intelligence.

Boosting Mobile CNN Inference through Semantic Memory

YOLoC: DeploY Large-Scale Neural Network by ROM-based Computing-in-Memory using ResiduaL Branch on a Chip

Breaking the Memory Wall for Heterogeneous Federated Learning via Model Splitting

Smart-DNN: Efficiently Reducing the Memory Requirements of Running Deep Neural Networks on Resource-constrained Platforms

Overcoming Memory Constraint for Improved Target Classification Performance on Embedded Deep Learning Systems