Picture Sebastian Stich

Sebastian U. Stich

mail: sebastian.stich@epfl.ch
address: EPFL Lausanne, Machine Learning and Optimization Laboratory, EPFL IC IINFCOM MLO INJ 337, Station 14, CH-1015 Lausanne

Picture Sebastian Stich
I am working as a scientist at EPFL
with Prof. Martin Jaggi in the Machine Learning and Optimization Laboratory (MLO).

Research Interests


Reviewing for

Education and Work


Refereed Publications

Preprints / Various:

  • Stochastic Distributed Learning with Gradient Quantization and Variance ReductionSamuel HorváthDmitry KovalevKonstantin MishchenkoPeter RichtárikAbstractBibTechnical Report2019
    We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g.\ quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and non-convex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
  • Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAnastasia KoloskovaMartin JaggiAbstractBibTechnical Report2019
    We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
  • Error Feedback Fixes SignSGD and other Gradient Compression SchemesSai Praneeth R. KarimireddyQuentin RebjockMartin JaggiAbstractBibTechnical Report2019
    Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
  • Don't Use Large Mini-Batches, Use Local SGDTao LinKumar Kshitij PatelMartin JaggiBibTechnical Report2018
  • Global linear convergence of Newton's method without strong-convexity or Lipschitz gradientsSai Praneeth R. KarimireddyMartin JaggiBibTechnical Report2018
  • SVRG meets SAGA: k-SVRG — A Tale of Limited MemoryAnant RajBibTechnical Report2018
  • Stochastic continuum armed bandit problem of few linear parameters in high dimensionsHemant TyagiBernd GärtnerBibTechnical Report2013
  • Random Pursuit in Hilbert SpaceBernd GärtnerBibTechnical ReportCGL-TR-882013
  • Matrix-valued Iterative Random ProjectionsChristian L. MüllerBernd GärtnerBibTechnical ReportCGL-TR-872013


Talks / Posters / Workshops

Student Projects

  • Cem MusluogluQuantization and Compression for Distributed OptimizationMaster Project2018
  • Quentin RebjockError Feedback For SignSGDMaster Project2018Paper
  • Jimi VaubienDerivative-Free Empirical Risk MinimizationBachelor Project2018
  • Polina KirichenkoZero-order training for DLSummer Internship2018
  • Nikhil KrishnaAdaptive Sampling with LSHSummer Internship2018
  • Alberto ChiappaReal-time recognition of industry-specific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
  • Jean-Baptiste CordonnierDistributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
  • Arthur DeschampsAsynchronous Methods in Machine LearningBachelor Project2018
  • Chia-An YuCompression for Stochastic Gradient DescentMaster Project2018
  • Frederik KünstnerFully Quantized Distributed Gradient DescentMaster Project2017
  • Fabian Latorre GomezImportance Sampling for MLEDIC Project2017
  • Pooja KulkarniSteepest Descent and Adaptive SamplingSummer Internship2017
  • Anastasia KoloskovaSteepest CD and LSHSummer Internship2017Paper
  • William Borgeaud DitAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
  • Joel CastellonComplexity analysis for AdaRBFGS: a primitive for methods between first and second orderSemester Project2017
  • Alberto ChiappaAsynchronous updates for stochastic gradient descentMaster Project2017
  • Marina MannariFaster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
  • Anant RajSteepest Coordinate DescentInternship2017Paper


Teaching Assistance