Picture Sebastian Stich

Sebastian U. Stich

mail: sebastian.stich@epfl.ch
address: EPFL Lausanne, Machine Learning and Optimization Laboratory, EPFL IC IINFCOM MLO INJ 337, Station 14, CH-1015 Lausanne

Picture Sebastian Stich
Research scientist at EPFL,
in the group of Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).

Research Interests

(upcoming) Workshops

Reviewing for

Education and Work


Refereed Publications

  • A Unified Theory of Decentralized SGD with Changing Topology and Local UpdatesAnastasia KoloskovaNicolas LoizouSadra BoreiriMartin JaggiAbstractBibIn:ICML2020
    Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities. Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and non-convex) problems and the rates interpolate between the heterogeneous (non-identically distributed data) and iid-data settings, recovering linear convergence rates in many special cases, for instance for over-parametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD).
  • Extrapolation for Large-batch Training in Deep LearningTao LinLingjing KongMartin JaggiAbstractBibIn:ICML2020
    Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
  • Is Local SGD Better than Minibatch SGD?Blake WoodworthKumar Kshitij PatelZhen DaiBrian BullinsH. Brendan McMahanOhad ShamirNathan SrebroAbstractBibIn:ICML2020
    We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
  • SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated LearningSai Praneeth R. KarimireddySatyen KaleMehryar MohriSashank J. ReddiAnanda T. SureshAbstractBibIn:ICML2020
    Federated learning is a key scenario in modern large-scale machine learning. In that scenario, the training data remains distributed over a large number of clients, which may be phones, other mobile devices, or network sensors and a centralized model is learned without ever transmitting client data over the network. The standard optimization algorithm used in this scenario is Federated Averaging (FedAvg). However, when client data is heterogeneous, which is typical in applications, FedAvg does not admit a favorable convergence guarantee. This is because local updates on clients can drift apart, which also explains the slow convergence and hard-to-tune nature of FedAvg in practice. This paper presents a new Stochastic Controlled Averaging algorithm (SCAFFOLD) which uses control variates to reduce the drift between different clients. We prove that the algorithm requires significantly fewer rounds of communication and benefits from favorable convergence guarantees.
  • Dynamic Model Pruning with FeedbackTao LinLuis BarbaDaniil DmitrievMartin JaggiAbstractBibIn:ICLR2020
    Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).
  • Decentralized Deep Learning with Arbitrary Communication CompressionAnastasia KoloskovaTao LinMartin JaggiAbstractBibIn:ICLR2020
    Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD&emdash;recently introduced and analyzed for strongly-convex objectives only&emdash;converges under arbitrary high compression ratio on general non-convex functions at the rate O(1/√(nT)) where T denotes the number of iterations and n the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming all-reduce time-wise).
  • Don't Use Large Mini-Batches, Use Local SGDTao LinKumar Kshitij PatelMartin JaggiAbstractBibIn:ICLR2020
    Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a post-local SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of local SGD variants.
  • Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAnastasia KoloskovaMartin JaggiAbstractVideoCodeBibIn:ICMLPMLR 97, 3478–34872019
    We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
  • Error Feedback Fixes SignSGD and other Gradient Compression SchemesSai Praneeth R. KarimireddyQuentin RebjockMartin JaggiAbstractCodeBibIn:ICMLPMLR 97, 3252–32612019
    Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
  • Local SGD Converges Fast and Communicates LittleAbstractPosterBibIn:ICLR2019
    Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
    We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations.
    Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
  • Efficient Greedy Coordinate Descent for Composite ProblemsSai Praneeth R. KarimireddyAnastasia KoloskovaMartin JaggiAbstractBibIn:AISTATSPMLR 89, 2887–28962019
    Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
    In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes L1-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
    Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.
  • Sparsified SGD with MemoryJean-Baptiste CordonnierMartin JaggiAbstractPosterCodeBibIn:NeurIPS31, 4452–44632018
    Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
    In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
  • Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order OptimizationRobert M. GowerFilip HanzelyPeter RichtárikAbstractPosterBibIn:NeurIPS31, 1626–16362018
    We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasi-Newton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
  • On Matching Pursuit and Coordinate DescentFrancesco LocatelloAnant RajSai Praneeth R. KarimireddyGunnar RätschBernhard SchölkopfMartin JaggiBibIn:ICMLPMLR 80, 3198–32072018
  • Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSai Praneeth R. KarimireddyMartin JaggiSlidesBibIn:AISTATSPMLR 84, 1204–12132018
  • Safe Adaptive Importance SamplingAnant RajMartin JaggiVideoSlidesPosterBibSpotlight talk, In:NIPS30, 4381–43912017
  • On the existence of ordinary trianglesRadoslav FulekHossein N. MojarradMárton NaszódiJózsef SolymosiMay SzedlákBibComputational Geometry66, 28–312017
  • Approximate Steepest Coordinate DescentAnant RajMartin JaggiPosterBibIn:ICMLPMLR 70, 3251–32592017
  • Efficiency of accelerated coordinate descent method on structured optimization problemsYurii NesterovBibSIAM Journal on Optimization27:1, 110–1232017
  • Variable Metric Random PursuitChristian L. MüllerBernd GärtnerBibMathematical Programming156(1), 549–5792016
  • On two continuum armed bandit problems in high dimensionsHemant TyagiBernd GärtnerBibTheory of Computing Systems58:1, 191–2222016
  • On low complexity Acceleration Techniques for Randomized OptimizationSupplBibIn:PPSN XIIISpringer, 130–1402014
  • Optimization of Convex Functions with Random PursuitChristian L. MüllerBernd GärtnerBibSIAM Journal on Optimization23:2, 1284–13092013
  • On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesChristian L. MüllerPosterBibIn:PPSN XIISpringer, 448–4572012
  • On Two Problems Regarding the Hamiltonian Cycle GameDan HefetzBibThe Electronic Journal of Combinatorics16(1)2009

Preprints / Various:

  • Ensemble Distillation for Robust Model Fusion in Federated LearningTao LinLingjing KongMartin JaggiAbstractBibTechnical Report2020
    Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model while keeping the training data decentralized. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. However, directly averaging model parameters is only possible if all models have the same structure and size, which could be a restrictive constraint in many scenarios.
    In this work we investigate more powerful and more flexible aggregation schemes for FL. Specifically, we propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients. This knowledge distillation technique mitigates privacy risk and cost to the same extent as the baseline FL algorithms, but allows flexible aggregation over heterogeneous client models that can differ e.g. in size, numerical precision or structure. We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10/100, ImageNet, AG News, SST2) and settings (heterogeneous models/data) that the server model can be trained much faster, requiring fewer communication rounds than any existing FL technique so far.
  • Advances and Open Problems in Federated LearningPeter KairouzH. Brendan McMahanBrendan AventAurélien BelletMehdi BennisArjun Nitin BhagojiKeith BonawitzZachary CharlesGraham CormodeRachel CummingsRafael G.L. D'OliveiraSalim El RouayhebDavid EvansJosh GardnerZachary GarrettAdrià GascónBadih GhaziPhillip B. GibbonsMarco GruteserZaid HarchaouiChaoyang HeLie HeZhouyuan HuoBen HutchinsonJustin HsuMartin JaggiTara JavidiGauri JoshiMikhail KhodakJakub KonečnýAleksandra KorolovaFarinaz KoushanfarSanmi KoyejoTancrède LepointYang LiuPrateek MittalMehryar MohriRichard NockAyfer ÖzgürRasmus PaghMariana RaykovaHang QiDaniel RamageRamesh RaskarDawn SongWeikang SongZiteng SunAnanda T. SureshFlorian TramèrPraneeth VepakommaJianyu WangLi XiongZheng XuQiang YangFelix X. YuHan YuSen ZhaoAbstractBibTechnical Report2019
    Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
  • The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed CommunicationSai Praneeth R. KarimireddyAbstractBibTechnical Report2019
    We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients&emdash;compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
  • Unified Optimal Analysis of the (Stochastic) Gradient MethodAbstractBibTechnical Report2019
    In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on μ-strongly convex functions under a (milder than standard) L-smoothness assumption. We show that SGD converges after T iterations as O(L∥x0−x*∥² exp[−μT/4L]+σ²/μT) where σ² measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have σ²=0 and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
  • Stochastic Distributed Learning with Gradient Quantization and Variance ReductionSamuel HorváthDmitry KovalevKonstantin MishchenkoPeter RichtárikAbstractBibTechnical Report2019
    We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and non-convex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
  • Global linear convergence of Newton's method without strong-convexity or Lipschitz gradientsSai Praneeth R. KarimireddyMartin JaggiBibTechnical Report2018
  • k-SVRG: Variance Reduction for Large Scale OptimizationAnant RajBibTechnical Report2018
  • Stochastic continuum armed bandit problem of few linear parameters in high dimensionsHemant TyagiBernd GärtnerBibTechnical Report2013
  • Random Pursuit in Hilbert SpaceBernd GärtnerBibTechnical ReportCGL-TR-882013
  • Matrix-valued Iterative Random ProjectionsChristian L. MüllerBernd GärtnerBibTechnical ReportCGL-TR-872013


Talks / Posters / Workshops

Student Projects

  • Sadra BoreiriDecentralized Local SGDMaster Project2019
  • Damian DudziczDecentralized Asynchronous SGDMaster Project2019
  • Oriol Barbany MayorAffine-Invariant Robust TrainingMaster Project2019
  • Rayan ElalamyConvergence Anlysis for Hogwild! Under Less Restrictive AssumptionsMaster Project2019
  • Lionel ConstantinStochastic Extragradient for GAN TrainingMaster Project2019
  • Harshvardhan HarshvardhanBetter bounds for Optimal Batch Size in SGDMaster Project2019
  • Srijan Bansal (now at IIT Kharagpur)Adaptive Stepsize Strategies for SGDSummer Internship2019
  • Guillaume van Dessel (now PhD student at UCLouvain)Stochastic gradient based methods for large-scale optimization: review, unification and new approachesMaster Thesis2019
  • Ahmad Ajalloeian (now PhD student at UNIL)Stochastic Zeroth-Order Optimisation Algorithms with Variance ReductionMaster Thesis2019
  • Claire Capelo (now exchange student at Stanford)Adaptive schemes for communication compression: Trajectory Normalized Gradients for Distributed OptimizationMaster Project2019
  • Aakash Sunil Lahoti Special Cases of Local SGDMaster Project2019
  • Cem Musluoglu (now PhD student at KU Leuven)Quantization and Compression for Distributed OptimizationMaster Project2018
  • Quentin Rebjock (now visiting student at INRIA)Error Feedback For SignSGDMaster Project2018Paper
  • Jimi Vaubien (now at Visum Technologies)Derivative-Free Empirical Risk MinimizationBachelor Project2018
  • Polina Kirichenko (now PhD student at NYU)Zero-order training for DLSummer Internship2018
  • Nikhil Krishna (now at Detect Technologies)Adaptive Sampling with LSHSummer Internship2018
  • Alberto Chiappa (now at Schindler Group)Real-time recognition of industry-specific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
  • Jean-Baptiste Cordonnier (now PhD student at EPFL)Distributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
  • Arthur Deschamps (now MSc student at ETH)Asynchronous Methods in Machine LearningBachelor Project2018
  • Chia-An Yu (now at Swisscom)Compression for Stochastic Gradient DescentMaster Project2018
  • Frederik Künstner (now PhD student at UBC)Fully Quantized Distributed Gradient DescentMaster Project2017
  • Fabian Latorre Gomez (now PhD Student at EPFL)Importance Sampling for MLEDIC Project2017
  • Pooja Kulkarni (now PhD student at UIUC)Steepest Descent and Adaptive SamplingSummer Internship2017
  • Anastasia Koloskova (not PhD student at EPFL)Steepest CD and LSHSummer Internship2017Paper
  • William Borgeaud dit AvocatAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
  • Joel Castellon (now at Amazon)Complexity analysis for AdaRBFGS: a primitive for methods between first and second orderSemester Project2017
  • Alberto Chiappa (now at Schindler Group)Asynchronous updates for stochastic gradient descentMaster Project2017
  • Marina Mannari (now at Cartier)Faster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
  • Anant Raj (now PhD student at MPI)Steepest Coordinate DescentInternship2017Paper


Teaching Assistance