Picture Sebastian Stich

Sebastian U. Stich

mail: sebastian.stich@epfl.ch
address: EPFL Lausanne, Machine Learning and Optimization Laboratory, EPFL IC IINFCOM MLO INJ 337, Station 14, CH-1015 Lausanne

Picture Sebastian Stich
Research scientist at EPFL,
in the group of Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).

Research Interests

(upcoming) Workshops

Reviewing for

Education and Work


Refereed Publications

  • Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAnastasia KoloskovaMartin JaggiAbstractVideoCodeBibIn:ICMLPMLR 97, 3478–34872019
    We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
  • Error Feedback Fixes SignSGD and other Gradient Compression SchemesSai Praneeth R. KarimireddyQuentin RebjockMartin JaggiAbstractCodeBibIn:ICMLPMLR 97, 3252–3261, 20192019
    Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
  • Local SGD Converges Fast and Communicates LittleAbstractPosterBibIn:ICLR2019
    Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
    We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations.
    Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
  • Efficient Greedy Coordinate Descent for Composite ProblemsSai Praneeth R. KarimireddyAnastasia KoloskovaMartin JaggiAbstractBibIn:AISTATSPMLR 89, 2887–28962019
    Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
    In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes L1-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
    Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.
  • Sparsified SGD with MemoryJean-Baptiste CordonnierMartin JaggiAbstractPosterCodeBibIn:NeurIPS31, 4452–44632018
    Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
    In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
  • Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order OptimizationRobert M. GowerFilip HanzelyPeter RichtárikAbstractPosterBibIn:NeurIPS31, 1626–16362018
    We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasi-Newton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
  • On Matching Pursuit and Coordinate DescentFrancesco LocatelloAnant RajSai Praneeth R. KarimireddyGunnar RätschBernhard SchölkopfMartin JaggiBibIn:ICMLPMLR 80, 3198–32072018
  • Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSai Praneeth R. KarimireddyMartin JaggiSlidesBibIn:AISTATSPMLR 84, 1204–12132018
  • Safe Adaptive Importance SamplingAnant RajMartin JaggiVideoSlidesPosterBibSpotlight talk, In:NIPS30, 4381–43912017
  • On the existence of ordinary trianglesRadoslav FulekHossein N. MojarradMárton NaszódiJózsef SolymosiMay SzedlákBibComputational Geometry66, 28–312017
  • Approximate Steepest Coordinate DescentAnant RajMartin JaggiPosterBibIn:ICMLPMLR 70, 3251–32592017
  • Efficiency of accelerated coordinate descent method on structured optimization problemsYurii NesterovBibSIAM Journal on Optimization27:1, 110–1232017
  • Variable Metric Random PursuitChristian L. MüllerBernd GärtnerBibMathematical Programming156(1), 549–5792016
  • On two continuum armed bandit problems in high dimensionsHemant TyagiBernd GärtnerBibTheory of Computing Systems58:1, 191–2222016
  • On low complexity Acceleration Techniques for Randomized OptimizationSupplBibIn:PPSN XIIISpringer, 130–1402014
  • Optimization of Convex Functions with Random PursuitChristian L. MüllerBernd GärtnerBibSIAM Journal on Optimization23:2, 1284–13092013
  • On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesChristian L. MüllerPosterBibIn:PPSN XIISpringer, 448–4572012
  • On Two Problems Regarding the Hamiltonian Cycle GameDan HefetzBibThe Electronic Journal of Combinatorics16(1)2009

Preprints / Various:

  • The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed CommunicationSai Praneeth R. KarimireddyAbstractBibTechnical Report2019
    We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients&emdash;compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
  • Decentralized Deep Learning with Arbitrary Communication CompressionAnastasia KoloskovaTao LinMartin JaggiAbstractBibTechnical Report2019
    Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD&emdash;recently introduced and analyzed for strongly-convex objectives only&emdash;converges under arbitrary high compression ratio on general non-convex functions at the rate O(1/√(nT)) where T denotes the number of iterations and n the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming all-reduce time-wise).
  • Unified Optimal Analysis of the (Stochastic) Gradient MethodAbstractBibTechnical Report2019
    In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on μ-strongly convex functions under a (milder than standard) L-smoothness assumption. We show that SGD converges after T iterations as O(L∥x0−x*∥² exp[−μT/4L]+σ²/μT) where σ² measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have σ²=0 and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
  • Stochastic Distributed Learning with Gradient Quantization and Variance ReductionSamuel HorváthDmitry KovalevKonstantin MishchenkoPeter RichtárikAbstractBibTechnical Report2019
    We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and non-convex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
  • Don't Use Large Mini-Batches, Use Local SGDTao LinKumar Kshitij PatelMartin JaggiBibTechnical Report2018
  • Global linear convergence of Newton's method without strong-convexity or Lipschitz gradientsSai Praneeth R. KarimireddyMartin JaggiBibTechnical Report2018
  • k-SVRG: Variance Reduction for Large Scale OptimizationAnant RajBibTechnical Report2018
  • Stochastic continuum armed bandit problem of few linear parameters in high dimensionsHemant TyagiBernd GärtnerBibTechnical Report2013
  • Random Pursuit in Hilbert SpaceBernd GärtnerBibTechnical ReportCGL-TR-882013
  • Matrix-valued Iterative Random ProjectionsChristian L. MüllerBernd GärtnerBibTechnical ReportCGL-TR-872013


Talks / Posters / Workshops

Student Projects

  • Srijan BansalAdaptive Stepsize Strategies for SGDSummer Internship2019
  • Guillaume van DesselStochastic gradient based methods for large-scale optimization: review, unification and new approachesMaster Thesis2019
  • Ahmad AjalloeianStochastic Zeroth-Order Optimisation Algorithms with Variance ReductionMaster Thesis2019
  • Claire CapeloAdaptive schemes for communication compression: Trajectory Normalized Gradients for Distributed OptimizationMaster Project2019
  • Aakash Sunil LahotiSpecial Cases of Local SGDMaster Project2019
  • Cem MusluogluQuantization and Compression for Distributed OptimizationMaster Project2018
  • Quentin RebjockError Feedback For SignSGDMaster Project2018Paper
  • Jimi VaubienDerivative-Free Empirical Risk MinimizationBachelor Project2018
  • Polina KirichenkoZero-order training for DLSummer Internship2018
  • Nikhil KrishnaAdaptive Sampling with LSHSummer Internship2018
  • Alberto ChiappaReal-time recognition of industry-specific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
  • Jean-Baptiste CordonnierDistributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
  • Arthur DeschampsAsynchronous Methods in Machine LearningBachelor Project2018
  • Chia-An YuCompression for Stochastic Gradient DescentMaster Project2018
  • Frederik KünstnerFully Quantized Distributed Gradient DescentMaster Project2017
  • Fabian Latorre GomezImportance Sampling for MLEDIC Project2017
  • Pooja KulkarniSteepest Descent and Adaptive SamplingSummer Internship2017
  • Anastasia KoloskovaSteepest CD and LSHSummer Internship2017Paper
  • William Borgeaud DitAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
  • Joel CastellonComplexity analysis for AdaRBFGS: a primitive for methods between first and second orderSemester Project2017
  • Alberto ChiappaAsynchronous updates for stochastic gradient descentMaster Project2017
  • Marina MannariFaster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
  • Anant RajSteepest Coordinate DescentInternship2017Paper


Teaching Assistance