Picture Sebastian Stich

Sebastian U. Stich

mail: sebastian.stich@epfl.ch
address: EPFL Lausanne, Machine Learning and Optimization Laboratory, EPFL IC IINFCOM MLO INJ 337, Station 14, CH-1015 Lausanne

Picture Sebastian Stich
Research scientist at EPFL,
in the group of Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).

Research Interests

(upcoming) Workshops

Reviewing for

Education and Work

Publications

Refereed Publications

  • Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAnastasia KoloskovaMartin JaggiAbstractVideoCodeBibIn:ICMLPMLR 97, 3478–3487, 20192019
    We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
  • Error Feedback Fixes SignSGD and other Gradient Compression SchemesSai Praneeth R. KarimireddyQuentin RebjockMartin JaggiAbstractCodeBibIn:ICMLPMLR 97, 3252–3261, 20192019
    Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
  • Local SGD Converges Fast and Communicates LittleAbstractPosterBibIn:ICLR2019
    Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
    We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations.
    Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
  • Efficient Greedy Coordinate Descent for Composite ProblemsSai Praneeth R. KarimireddyAnastasia KoloskovaMartin JaggiAbstractBibIn:AISTATSPMLR 89, 2887–28962019
    Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
    In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes L1-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
    Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.
  • Sparsified SGD with MemoryJean-Baptiste CordonnierMartin JaggiAbstractPosterCodeBibIn:NeurIPS31, 4452–44632018
    Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
    In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
  • Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order OptimizationRobert M. GowerFilip HanzelyPeter RichtárikAbstractPosterBibIn:NeurIPS31, 1626–16362018
    We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasi-Newton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
  • On Matching Pursuit and Coordinate DescentFrancesco LocatelloAnant RajSai Praneeth R. KarimireddyGunnar RätschBernhard SchölkopfMartin JaggiBibIn:ICMLPMLR 80, 3198–32072018
  • Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSai Praneeth R. KarimireddyMartin JaggiSlidesBibIn:AISTATSPMLR 84, 1204–12132018
  • Safe Adaptive Importance SamplingAnant RajMartin JaggiVideoSlidesPosterBibSpotlight talk, In:NIPS30, 4381–43912017
  • On the existence of ordinary trianglesRadoslav FulekHossein N. MojarradMárton NaszódiJózsef SolymosiMay SzedlákBibComputational Geometry66, 28–312017
  • Approximate Steepest Coordinate DescentAnant RajMartin JaggiPosterBibIn:ICMLPMLR 70, 3251–32592017
  • Efficiency of accelerated coordinate descent method on structured optimization problemsYurii NesterovBibSIAM Journal on Optimization27:1, 110–1232017
  • Variable Metric Random PursuitChristian L. MüllerBernd GärtnerBibMathematical Programming156(1), 549–5792016
  • On two continuum armed bandit problems in high dimensionsHemant TyagiBernd GärtnerBibTheory of Computing Systems58:1, 191–2222016
  • On low complexity Acceleration Techniques for Randomized OptimizationSupplBibIn:PPSN XIIISpringer, 130–1402014
  • Optimization of Convex Functions with Random PursuitChristian L. MüllerBernd GärtnerBibSIAM Journal on Optimization23:2, 1284–13092013
  • On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesChristian L. MüllerPosterBibIn:PPSN XIISpringer, 448–4572012
  • On Two Problems Regarding the Hamiltonian Cycle GameDan HefetzBibThe Electronic Journal of Combinatorics16(1)2009

Preprints / Various:

Theses

Talks / Posters / Workshops

Student Projects

  • Guillaume van DesselStochastic gradient based methods for large-scale optimization: review, unification and new approachesMaster Thesis2019
  • Ahmad AjalloeianStochastic Zeroth-Order Optimisation Algorithms with Variance ReductionMaster Thesis2019
  • Claire CapeloAdaptive schemes for communication compression: Trajectory Normalized Gradients for Distributed OptimizationMaster Project2019
  • Aakash Sunil LahotiSpecial Cases of Local SGDMaster Project2019
  • Cem MusluogluQuantization and Compression for Distributed OptimizationMaster Project2018
  • Quentin RebjockError Feedback For SignSGDMaster Project2018Paper
  • Jimi VaubienDerivative-Free Empirical Risk MinimizationBachelor Project2018
  • Polina KirichenkoZero-order training for DLSummer Internship2018
  • Nikhil KrishnaAdaptive Sampling with LSHSummer Internship2018
  • Alberto ChiappaReal-time recognition of industry-specific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
  • Jean-Baptiste CordonnierDistributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
  • Arthur DeschampsAsynchronous Methods in Machine LearningBachelor Project2018
  • Chia-An YuCompression for Stochastic Gradient DescentMaster Project2018
  • Frederik KünstnerFully Quantized Distributed Gradient DescentMaster Project2017
  • Fabian Latorre GomezImportance Sampling for MLEDIC Project2017
  • Pooja KulkarniSteepest Descent and Adaptive SamplingSummer Internship2017
  • Anastasia KoloskovaSteepest CD and LSHSummer Internship2017Paper
  • William Borgeaud DitAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
  • Joel CastellonComplexity analysis for AdaRBFGS: a primitive for methods between first and second orderSemester Project2017
  • Alberto ChiappaAsynchronous updates for stochastic gradient descentMaster Project2017
  • Marina MannariFaster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
  • Anant RajSteepest Coordinate DescentInternship2017Paper

Teaching

Teaching Assistance