Publications
I became too busy to update this list regularly. Please check my
Google Scholar Profile to see what I have been working on recently.
Refereed Publications
- Simultaneous Training of Partially Masked Neural NetworksAmirkeivan MohtashamiMartin JaggiSebastian U. StichAbstractAISTATS2022
For deploying deep learning models to lower end devices, it is necessary to train less resource-demanding variants of state-of-the-art architectures. This does not eliminate the need for more expensive models as they have a higher performance. In order to avoid training two separate models, we show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We extend on prior methods that focused only on core networks of smaller width, while we focus on supporting arbitrary core network architectures. Our proposed training scheme switches consecutively between optimizing only the core part of the network and the full one. The accuracy of the full model remains comparable, while the core network achieves better performance than when it is trained in isolation. In particular, we show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone. We analyze our training scheme theoretically, and show its convergence under assumptions that are either standard or practically justified. Moreover, we show that the developed theoretical framework allows analyzing many other partial training schemes for neural networks.
- The Peril of Popular Deep Learning Uncertainty Estimation MethodsYehao LiuMatteo PagliardiniTatjana ChavdarovaSebastian U. StichAbstractCodeNeurIPS 2021 Workshop: Bayesian Deep Learning2021
Uncertainty estimation (UE) techniques -- such as the Gaussian process (GP), Bayesian neural networks (BNN), Monte Carlo dropout (MCDropout) -- aim to improve the interpretability of machine learning models by assigning an estimated uncertainty value to each of their prediction outputs. However, since too high uncertainty estimates can have fatal consequences in practice, this paper analyzes the above techniques.
Firstly, we show that GP methods always yield high uncertainty estimates on out of distribution (OOD) data. Secondly, we show on a 2D toy example that both BNNs and MCDropout do not give high uncertainty estimates on OOD samples. Finally, we show empirically that this pitfall of BNNs and MCDropout holds on real world datasets as well. Our insights (i) raise awareness for the more cautious use of currently popular UE methods in Deep Learning, (ii) encourage the development of UE methods that approximate GP-based methods -- instead of BNNs and MCDropout, and (iii) our empirical setups can be used for verifying the OOD performances of any other UE method.
- An Improved Analysis of Gradient Tracking for Decentralized Machine LearningAnastasia KoloskovaTao LinSebastian U. StichAbstractPosterNeurIPS342021
We consider decentralized machine learning over a network where the training data is distributed across agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between workers' local data distributions, the known convergence rates for GT algorithms are not optimal with respect to their dependence on the mixing parameter (related to the spectral gap of the connectivity matrix).
We provide a tighter analysis of the GT method in the stochastic strongly convex, convex and non-convex settings. We improve the dependency on p from p to pc in the noiseless case and from p to pc in the general stochastic case, where c>p is related to the negative eigenvalues of the connectivity matrix (and is a constant in most practical applications). This improvement was possible due to a new proof technique which could be of independent interest.
- RelaySum for Decentralized Deep Learning on Heterogeneous DataThijs VogelsLie HeAnastasia KoloskovaTao LinSai Praneeth R. KarimireddySebastian U. StichMartin JaggiAbstractCodeNeurIPS342021
In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data.
- Breaking the centralized barrier for cross-device federated learning (aka MIME)Sai Praneeth R. KarimireddyMartin JaggiSatyen KaleMehryar MohriSashank J. ReddiSebastian U. StichAnanda T. SureshAbstractCodeNeurIPS342021
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates client drift and ii) adapts arbitrary centralized optimization algorithms such as momentum and Adam to the cross-device federated learning setting. Mime uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. We prove a reduction result showing that Mime can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Further, we show that when combined with momentum based variance reduction, Mime is provably faster than any centralized method--the first such result. We also perform a thorough experimental exploration of Mime's performance on real world datasets.
- Semantic Perturbations with Normalizing Flows for Improved GeneralizationOğuz Kaan YükselSebastian U. StichMartin JaggiTatjana ChavdarovaAbstractICCV (extended abstract in: ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models)2021
Data augmentation is a widely adopted technique for avoiding overfitting when
training deep neural networks. However, this approach requires domain-specific
knowledge and is often limited to a fixed set of hard-coded transformations.
Recently, several works proposed to use generative models for generating
semantically meaningful perturbations to train a classifier. However, because
accurate encoding and decoding are critical, these methods, which use
architectures that approximate the latent-variable inference, remained limited
to pilot studies on small datasets. Exploiting the exactly reversible encoder-decoder structure of normalizing
flows, we perform on-manifold perturbations in the latent space to define fully
unsupervised data augmentations. We demonstrate that such perturbations match
the performance of advanced data augmentation techniques -- reaching 96.6% test
accuracy for CIFAR-10 using ResNet-18 and outperform existing methods,
particularly in low data regimes -- yielding 10--25% relative improvement of
test accuracy from classical training. We find that our latent adversarial
perturbations adaptive to the classifier throughout its training are most
effective, yielding the first test accuracy improvement results on real-world
datasets -- CIFAR-10/100 -- via latent-space perturbations.
- Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous DataTao LinSai Praneeth R. KarimireddySebastian U. StichMartin JaggiAbstractVideoICML2021
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks.
In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization
challenge and may severely deteriorate the generalization performance.
In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees
of data heterogeneity. We propose a novel momentum-based method to mitigate this decentralized training difficulty. We show in extensive
empirical experiments on various CV/NLP datasets (CIFAR-10, ImageNet, AG News, and SST2) and several network topologies (Ring and
Social Network) that our method is much more robust to the heterogeneity of clients' data than other existing methods, by a
significant improvement in test performance (1%−20%).
- Consensus Control for Decentralized Deep LearningLingjing KongTao LinAnastasia KoloskovaMartin JaggiSebastian U. StichAbstractVideoICML2021
Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a data-center setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning.
We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we propose practical training guidelines for the data-center setup as the important first step.
- Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous UpdatesSebastian U. StichAmirkeivan MohtashamiMartin JaggiAbstractVideoAISTATS2021
It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and non-convex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments.
- LENA: Communication-Efficient Distributed Learning with Self-Triggered Gradient UploadsHossein Shokri GhadikolaeiSebastian U. StichMartin JaggiAbstractVideoAISTATS2021
In distributed optimization, parameter updates from the gradient computing node devices have to be aggregated in every iteration on the orchestrating server. When these updates are sent over an arbitrary commodity network, bandwidth and latency can be limiting factors. We propose a communication framework where nodes may skip unnecessary uploads. Every node locally accumulates an error vector in memory and self-triggers the upload of the memory contents to the parameter server using a significance filter. The server then uses a history of the nodes’ gradients to update the parameter. We characterize the convergence rate of our algorithm in smooth settings (strongly-convex, convex, and non-convex) and show that it enjoys the same convergence rate as when sending gradients every iteration, with substantially fewer uploads. Numerical experiments on real data indicate a significant reduction of used network resources (total communicated bits and latency), especially in large networks, compared to state-of-the-art algorithms. Our results provide important practical insights for using machine learning over resource-constrained networks, including Internet-of-Things and geo-separated datasets across the globe.
- A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!Dmitry KovalevAnastasia KoloskovaMartin JaggiPeter RichtárikSebastian U. StichAbstractVideoAISTATS2021
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system.
We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages. By combining our scheme with a new variance reduction technique that progressively throughout the iterations reduces the adverse effect of the injected quantization noise, we obtain a scheme that converges linearly on strongly convex decentralized problems while using compressed communication only.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline which does not perform any communication compression while still allowing for a significant compression factor which depends on the conditioning of the problem and the topology of the network. Our key theoretical findings are supported by numerical experiments.
- Taming GANs with Lookahead-MinmaxTatjana ChavdarovaMatteo PagliardiniSebastian U. StichFrançois FleuretMartin JaggiAbstractVideoICLR2021
Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead–minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead–minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources.
- Ensemble Distillation for Robust Model Fusion in Federated LearningTao LinLingjing KongSebastian U. StichMartin JaggiAbstractVideoNeurIPS332020
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model while keeping the training data decentralized. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. However, directly averaging model parameters is only possible if all models have the same structure and size, which could be a restrictive constraint in many scenarios.
In this work we investigate more powerful and more flexible aggregation schemes for FL. Specifically, we propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients. This knowledge distillation technique mitigates privacy risk and cost to the same extent as the baseline FL algorithms, but allows flexible aggregation over heterogeneous client models that can differ e.g. in size, numerical precision or structure. We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10/100, ImageNet, AG News, SST2) and settings (heterogeneous models/data) that the server model can be trained much faster, requiring fewer communication rounds than any existing FL technique so far.
- The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed CommunicationSebastian U. StichSai Praneeth R. KarimireddyAbstractBibJournal of Machine Learning ResearchJMLR 21(237), 1–362020
We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients&emdash;compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
- A Unified Theory of Decentralized SGD with Changing Topology and Local UpdatesAnastasia KoloskovaNicolas LoizouSadra BoreiriMartin JaggiSebastian U. StichAbstractVideoICML2020
Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and non-convex) problems and the rates interpolate between the heterogeneous (non-identically distributed data) and iid-data settings, recovering linear convergence rates in many special cases, for instance for over-parametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD).
- Extrapolation for Large-batch Training in Deep LearningTao LinLingjing KongSebastian U. StichMartin JaggiAbstractVideoICML2020
Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
- Is Local SGD Better than Minibatch SGD?Blake WoodworthKumar Kshitij PatelSebastian U. StichZhen DaiBrian BullinsH. Brendan McMahanOhad ShamirNathan SrebroAbstractVideoICML2020
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
- Advances and Open Problems in Federated LearningPeter KairouzH. Brendan McMahanBrendan AventAurélien BelletMehdi BennisArjun Nitin BhagojiKeith BonawitzZachary CharlesGraham CormodeRachel CummingsRafael G.L. D'OliveiraHubert EichnerSalim El RouayhebDavid EvansJosh GardnerZachary GarrettAdrià GascónBadih GhaziPhillip B. GibbonsMarco GruteserZaid HarchaouiChaoyang HeLie HeZhouyuan HuoBen HutchinsonJustin HsuMartin JaggiTara JavidiGauri JoshiMikhail KhodakJakub KonečnýAleksandra KorolovaFarinaz KoushanfarSanmi KoyejoTancrède LepointYang LiuPrateek MittalMehryar MohriRichard NockAyfer ÖzgürRasmus PaghMariana RaykovaHang QiDaniel RamageRamesh RaskarDawn SongWeikang SongSebastian U. StichZiteng SunAnanda T. SureshFlorian TramèrPraneeth VepakommaJianyu WangLi XiongZheng XuQiang YangFelix X. YuHan YuSen ZhaoAbstractFoundations and Trends® in Machine Learning14(1–2), 1–2102021
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
- SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated LearningSai Praneeth R. KarimireddySatyen KaleMehryar MohriSashank J. ReddiSebastian U. StichAnanda T. SureshAbstractVideoICML2020
Federated learning is a key scenario in modern large-scale machine learning. In that scenario, the training data remains distributed over a large number of clients, which may be phones, other mobile devices, or network sensors and a centralized model is learned without ever transmitting client data over the network. The standard optimization algorithm used in this scenario is Federated Averaging (FedAvg). However, when client data is heterogeneous, which is typical in applications, FedAvg does not admit a favorable convergence guarantee. This is because local updates on clients can drift apart, which also explains the slow convergence and hard-to-tune nature of FedAvg in practice. This paper presents a new Stochastic Controlled Averaging algorithm (SCAFFOLD) which uses control variates to reduce the drift between different clients. We prove that the algorithm requires significantly fewer rounds of communication and benefits from favorable convergence guarantees.
- Dynamic Model Pruning with FeedbackTao LinSebastian U. StichLuis BarbaDaniil DmitrievMartin JaggiAbstractBibICLR2020
Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).
- Decentralized Deep Learning with Arbitrary Communication CompressionAnastasia KoloskovaTao LinSebastian U. StichMartin JaggiAbstractBibICLR2020
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD&emdash;recently introduced and analyzed for strongly-convex objectives only&emdash;converges under arbitrary high compression ratio on general non-convex functions at the rate O(1/√(nT)) where T denotes the number of iterations and n the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming all-reduce time-wise).
- Don't Use Large Mini-Batches, Use Local SGDTao LinSebastian U. StichKumar Kshitij PatelMartin JaggiAbstractBibICLR2020
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a post-local SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of local SGD variants.
- Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAnastasia KoloskovaSebastian U. StichMartin JaggiAbstractVideoCodeBibICMLPMLR 97, 3478–34872019
We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
- Error Feedback Fixes SignSGD and other Gradient Compression SchemesSai Praneeth R. KarimireddyQuentin RebjockSebastian U. StichMartin JaggiAbstractCodeBibICMLPMLR 97, 3252–32612019
Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
- Local SGD Converges Fast and Communicates LittleSebastian U. StichAbstractPosterBibICLR2019
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations.
Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
- Efficient Greedy Coordinate Descent for Composite ProblemsSai Praneeth R. KarimireddyAnastasia KoloskovaSebastian U. StichMartin JaggiAbstractBibAISTATSPMLR 89, 2887–28962019
Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes L1-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.
- Sparsified SGD with MemorySebastian U. StichJean-Baptiste CordonnierMartin JaggiAbstractPosterCodeBibNeurIPS31, 4452–44632018
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
- Global linear convergence of Newton's method without strong-convexity or Lipschitz gradientsSai Praneeth R. KarimireddySebastian U. StichMartin JaggiBibNeurIPS 2019 Workshop: Beyond First Order Methods in ML2018
- Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order OptimizationRobert M. GowerFilip HanzelyPeter RichtárikSebastian U. StichAbstractPosterBibNeurIPS31, 1626–16362018
We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasi-Newton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
- On Matching Pursuit and Coordinate DescentFrancesco LocatelloAnant RajSai Praneeth R. KarimireddyGunnar RätschBernhard SchölkopfSebastian U. StichMartin JaggiBibICMLPMLR 80, 3198–32072018
- Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSai Praneeth R. KarimireddySebastian U. StichMartin JaggiSlidesBibAISTATSPMLR 84, 1204–12132018
- Safe Adaptive Importance SamplingSebastian U. StichAnant RajMartin JaggiVideoSlidesPosterBibNIPS30, 4381–43912017
- On the existence of ordinary trianglesRadoslav FulekHossein N. MojarradMárton NaszódiJózsef SolymosiSebastian U. StichMay SzedlákBibComputational Geometry66, 28–312017
- Approximate Steepest Coordinate DescentSebastian U. StichAnant RajMartin JaggiPosterBibICMLPMLR 70, 3251–32592017
- Efficiency of accelerated coordinate descent method on structured optimization problemsYurii NesterovSebastian U. StichBibSIAM Journal on Optimization27:1, 110–1232017
- Variable Metric Random PursuitSebastian U. StichChristian L. MüllerBernd GärtnerBibMathematical Programming156(1), 549–5792016
- On two continuum armed bandit problems in high dimensionsHemant TyagiSebastian U. StichBernd GärtnerBibTheory of Computing Systems58:1, 191–2222016
- On low complexity Acceleration Techniques for Randomized OptimizationSebastian U. StichSupplBibPPSN XIIISpringer, 130–1402014
- Optimization of Convex Functions with Random PursuitSebastian U. StichChristian L. MüllerBernd GärtnerBibSIAM Journal on Optimization23:2, 1284–13092013
- On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesSebastian U. StichChristian L. MüllerPosterBibPPSN XIISpringer, 448–4572012
- On Two Problems Regarding the Hamiltonian Cycle GameDan HefetzSebastian U. StichBibThe Electronic Journal of Combinatorics16(1)2009
Preprints / Various:
- Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient MethodsAmirkeivan MohtashamiMartin JaggiSebastian U. StichAbstractTechnical Report2022
While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured shuffling when various levels of abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset in theory and in practice. Finally, relying on our measure, we develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.
- ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!Konstantin MishchenkoGrigory MalinovskySebastian U. StichPeter RichtárikAbstractTechnical Report2022
We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth (f)
and an expensive nonsmooth proximable (ψ) function. The canonical approach to solving such problems is via the proximal gradient descent
(ProxGD) algorithm, which is based on the evaluation of the gradient of f and the prox operator of ψ in each iteration.
In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation
of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations.
Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently
on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging.
In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods,
such as FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching,
that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.
- Linear Speedup in Personalized Collaborative LearningEl-Mahdi ChaytiSai Praneeth R. KarimireddySebastian U. StichNicolas FlammarionMartin JaggiAbstractTechnical Report2021
Personalization in federated learning can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In order to develop training algorithms that optimally balance this trade-off, it is necessary to extend our theoretical foundations. In this work, we formalize the personalized collaborative learning problem as stochastic optimization of a user's objective f0(x) while given access to N related but different objectives of other users {f1(x),…,fN(x)}. We give convergence guarantees for two algorithms in this setting -- a popular personalization method known as weighted gradient averaging, and a novel bias correction method -- and explore conditions under which we can optimally trade-off their bias for a reduction in variance and achieve linear speedup w.r.t. the number of users N. Further, we also empirically study their performance confirming our theoretical insights.
- ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive TrainingHui-Po WangSebastian U. StichYang HeMario FritzAbstractTechnical Report2021
Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data. However, training is resource-intensive for edge devices, and limited network bandwidth is often the main bottleneck. Prior work often overcomes the constraints by condensing the models or messages into compact formats, e.g., by gradient compression or distillation. In contrast, we propose ProgFed, the first progressive training framework for efficient and effective federated learning. It inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. We theoretically prove that ProgFed converges at the same asymptotic rate as standard training on full models. Extensive results on a broad range of architectures, including CNNs (VGG, ResNet, ConvNets) and U-nets, and diverse tasks from simple classification to medical image segmentation show that our highly effective training approach saves up to 20% computation and up to 63% communication costs for converged models. As our approach is also complimentary to prior work on compression, we can achieve a wide range of trade-offs, showing reduced communication of up to 50× at only 0.1% loss in utility.
- Escaping Local Minima With Stochastic NoiseHarsh VardhanSebastian U. StichAbstractNeurIPS 2021 Workshop: Optimization for Machine Learning2021
Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using
stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success.
In particular, the standard analyses don’t show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle
points). In this work, we identify a broad class of nonconvex functions for which we can show that
perturbed SGD (gradient descent perturbed by stochastic noise—covering SGD as a special case)
converges to a global minimum, in contrast to gradient descent without noise that can get stuck
in local minima. In particular, for non-convex functions that are relative close to a convex-like
(strongly convex or PŁ) function we prove that SGD converges linearly to a global optimum.
- On Second-order Optimization Methods for Federated LearningSebastian BischoffStephan GunnemannMartin JaggiSebastian U. StichAbstractICML 2021 Workshop: Beyond first-order methods in ML systems2021
We consider federated learning (FL), where the training data is distributed across a large number of clients. The standard optimization method in this setting is Federated Averaging (FedAvg), which performs multiple local first-order optimization steps between communication rounds. In this work, we evaluate the performance of several second-order distributed methods with local steps in the FL setting which promise to have favorable convergence properties.
We (i) show that FedAvg performs surprisingly well against its second-order competitors when evaluated under fair metrics (equal amount of local computations)-in contrast to the results of previous work. Based on our numerical study, we propose (ii) a novel variant that uses second-order local information for updates and a global line search to counteract the resulting local specificity.
- Bi-directional Adaptive Communication for Heterogenous Distributed LearningDmitrii AvdiukhinNikita IvkinMartin JaggiVladimir BravermanAbstractICML 2021 Workshop: Federated Learning for User Privacy and Data Confidentiality2021
Communication constraints are a key bottleneck
in distributed optimization, in particularly bandwidth and latency can be limiting factors when
devices are connected over commodity networks,
such as in Federated Learning. State-of-the-art
techniques tackle these challenges by advanced
compression techniques or delaying communication rounds according to predefined schedules.
We present a new scheme that adaptively skips
communication (broadcast and client uploads) by
detecting slow-varying updates. The scheme automatically adjusts the communication frequency
independently for each worker and the server.
By utilizing an error-feedback mechanism – borrowed from the compression literature – we prove
that the convergence rate is the same as for batch
gradient descent in the convex and nonconvex
smooth cases. We show reduction of the total
number of communication rounds between server
and clients needed to achieve a targeted accuracy, even in the case when the data distribution
is highly non-IID.
- A Field Guide to Federated OptimizationJianyu WangZachary CharlesZheng XuGauri JoshiH. Brendan McMahanBlaise Aguera y ArcasMaruan Al-ShedivatGalen AndrewSalman AvestimehrKatharine DalyDeepesh DataSuhas DiggaviHubert EichnerAdvait GadhikarZachary GarrettAntonious M. GirgisFilip HanzelyAndrew HardChaoyang HeSamuel HorváthZhouyuan HuoAlex IngermanMartin JaggiTara JavidiPeter KairouzSatyen KaleSai Praneeth R. KarimireddyJakub KonečnýSanmi KoyejoTian LiLuyang LiuMehryar MohriHang QiSashank J. ReddiPeter RichtárikKaran SinghalVirginia SmithMahdi SoltanolkotabWeikang SongAnanda T. SureshSebastian U. StichAmeet TalwalkarHongyi WangBlake WoodworthShanshan WuFelix X. YuHonglin YuanManzil ZaheerMi ZhangTong ZhangChunxiang ZhengChen ZhuWennan ZhuAbstractTechnical Report2021
Federated learning and analytics are a distributed approach for
collaboratively learning models (or statistics) from decentralized data,
motivated by and designed for privacy protection. The distributed learning
process can be formulated as solving federated optimization problems, which
emphasize communication efficiency, data heterogeneity, compatibility with
privacy and system requirements, and other constraints that are not primary
considerations in other problem settings. This paper provides recommendations
and guidelines on formulating, designing, evaluating and analyzing federated
optimization algorithms through concrete examples and practical implementation,
with a focus on conducting effective simulations to infer real-world
performance. The goal of this work is not to survey the current literature, but
to inspire researchers and practitioners to design federated learning
algorithms that can be used in various practical applications.
- Decentralized Local Stochastic Extra-Gradient for Variational InequalitiesAleksandr BeznosikovPavel DvurechenskyAnastasia KoloskovaValentin SamokhinSebastian U. StichAlexander GasnikovAbstractTechnical Report2021
We consider decentralized stochastic variational inequalities where the problem data is distributed across many participating devices (heterogeneous, or non-IID data setting). We propose a novel method - based on stochastic extra-gradient - where participating devices can communicate over arbitrary, possibly time-varying network topologies. This covers both the fully decentralized optimization setting and the centralized topologies commonly used in Federated Learning. Our method further supports multiple local updates on the workers for reducing the communication frequency between workers. We theoretically analyze the proposed scheme in the strongly monotone, monotone and non-monotone setting. As a special case, our method and analysis apply in particular to decentralized stochastic min-max problems which are being studied with increased interest in Deep Learning. For example, the training objective of Generative Adversarial Networks (GANs) are typically saddle point problems and the decentralized training of GANs has been reported to be extremely challenging. While SOTA techniques rely on either repeated gossip rounds or proximal updates, we alleviate both of these requirements. Experimental results for decentralized GAN demonstrate the effectiveness of our proposed algorithm.
- On Communication Compression for Distributed Optimization on Heterogeneous DataSebastian U. StichAbstractTechnical Report2020
Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models. We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD (D-QSGD) with arbitrary unbiased quantizers and (ii) distributed SGD with error-feedback and biased compressors (D-EF-SGD) in the heterogeneous (non-iid) data setting.
Our results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high. We propose two alternatives that are not (or much less) affected by heterogenous data distributions: a new method that is only applicable to strongly convex problems, and we point out a more general approach that is applicable to linear compressors.
- On the Convergence of SGD with Biased GradientsAhmad AjalloeianSebastian U. StichAbstractICML 2020 Workshop: Beyond First Order Methods in ML Systems2020
The classic stochastic gradient (SGD) algorithm crucially depends on the availability of unbiased stochastic gradient oracles. However, in certain applications only biased gradient updates are available or preferred over unbiased ones for performance reasons.
For instance, in the domain of distributed learning, biased gradient compression techniques such as top-$k$ compression have been proposed as a tool to alleviate the communication bottleneck and in derivative-free optimization, only biased gradient estimators can be queried.
We present a general framework to analyze SGD with biased gradient oracles and derive convergence rates for smooth (non-convex) functions and give improved rates under the Polyak-Lojasiewicz condition. Our rates show how biased estimators impact the attainable convergence guarantees. We discuss a few guiding examples that show the broad applicability of our analysis.
- Unified Optimal Analysis of the (Stochastic) Gradient MethodSebastian U. StichAbstractBibTechnical Report2019
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on μ-strongly convex functions under a (milder than standard) L-smoothness assumption. We show that SGD converges after T iterations as O(L∥x0−x*∥² exp[−μT/4L]+σ²/μT) where σ² measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have σ²=0 and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
- Stochastic Distributed Learning with Gradient Quantization and Variance ReductionSamuel HorváthDmitry KovalevKonstantin MishchenkoSebastian U. StichPeter RichtárikAbstractBibTechnical Report2019
We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and non-convex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
- k-SVRG: Variance Reduction for Large Scale OptimizationAnant RajSebastian U. StichBibTechnical Report2018
- Stochastic continuum armed bandit problem of few linear parameters in high dimensionsHemant TyagiSebastian U. StichBernd GärtnerBibTechnical Report2013
- Random Pursuit in Hilbert SpaceSebastian U. StichBernd GärtnerBibTechnical ReportCGL-TR-882013
- Matrix-valued Iterative Random ProjectionsSebastian U. StichChristian L. MüllerBernd GärtnerBibTechnical ReportCGL-TR-872013
Theses