Abstract:Differentially private stochastic gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks, compared to ordinary stochastic gradient descent (SGD). In this paper, we perform a detailed study and comparison of the two processes and unveil several new insights. By comparing the behavior of the two processes separately in early and late epochs, we find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result. This separate analysis of the clipping and noise addition steps of DP-SGD shows that while noise introduces errors to the process, gradient descent can recover from these errors when it is not clipped, and clipping appears to have a larger impact than noise. These effects are amplified in higher dimensions (large neural networks), where the loss basin occupies a lower dimensional space. We argue theoretically and using extensive experiments that magnitude pruning can be a suitable dimension reduction technique in this regard, and find that heavy pruning can improve the test accuracy of DPSGD.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of **poor performance of Differential Privacy Stochastic Gradient Descent (DP - SGD) when training large - scale neural networks**. Specifically, compared with ordinary Stochastic Gradient Descent (SGD), DP - SGD performs poorly in training and testing performance. By studying and comparing the behaviors of these two processes in detail, the author reveals several new insights into the performance gap. #### Main problems and challenges 1. **Impact of the early stage vs the late stage**: - Previous research assumed that DP - SGD performs poorly in the early stage of optimization, resulting in its inability to find a good loss basin. However, through experiments, this paper finds that in fact, the performance of DP - SGD in the later stage is more critical. 2. **Impact of Clipping and Noise Addition**: - The noise and clipping operations introduced by DP - SGD will introduce errors, especially in high - dimensional spaces, and these errors are difficult to recover. The author finds that clipping has a greater impact on model performance than noise. 3. **Characteristics of the loss basin in high - dimensional space**: - In high - dimensional space, the loss basin occupies a lower - dimensional space, which makes it more difficult for DP - SGD to find and stay at the bottom of the basin. 4. **Impact of the number of model parameters**: - As the number of model parameters increases, the performance of DP - SGD decreases significantly because more noise needs to be added to ensure privacy. #### Solutions To solve the above problems, the author proposes and verifies several methods: - **Magnitude Pruning**: By reducing the number of model parameters, the impact of noise can be reduced, thereby improving the performance of DP - SGD. Experiments show that heavy pruning can significantly improve the test accuracy of DP - SGD. - **Phased training strategy**: By dividing the training process into two phases (Phase 1 and Phase 2) and using SGD and DP - SGD for training respectively, the author finds that the training method in the later stage has a greater impact on the final performance. - **Theoretical analysis**: By defining a term \( R \) based on the variance of each dimension and the norm of the true gradient, the author quantitatively analyzes the impact of pruning on gradient descent and proves that pruning can reduce the harmful effects of pruning operations. ### Summary This paper deeply analyzes the performance problems of DP - SGD when training large - scale neural networks and proposes effective methods such as pruning to alleviate these problems, thereby improving the practicality and performance of DP - SGD.

Inference and Interference: The Role of Clipping, Pruning and Loss Landscapes in Differentially Private Stochastic Gradient Descent

Improving Differentially Private SGD via Randomly Sparsified Gradients

DP-SGD with weight clipping

Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach

A(DP)$^2$SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

A(DP)$^2$2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning

Differential Privacy Meets Neural Network Pruning

Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training

Enhancing DP-SGD through Non-monotonous Adaptive Scaling Gradient Weight

Towards Efficient and Scalable Training of Differentially Private Deep Learning

Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization

Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models

Pre-Pruning and Gradient-Dropping Improve Differentially Private Image Classification

Clip Body and Tail Separately: High Probability Guarantees for DPSGD with Heavy Tails

Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses

Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent

Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGD

Differentially Private Learning with Per-Sample Adaptive Clipping.

Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent