Research scientist at EPFL,
in the group of Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).
in the group of Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).
Research Interests
 machine learning—at the interface of theory and practice
 optimization—analysis of zeroth/first/secondorder serial/parallel/decentralized/federated + accelerated algorithms
(upcoming) Workshops

Advances in ML: Theory meets practice,
coorganized with Aymeric Dieuleveut
January 27, 2019, Lausanne, Switzerland. This workshop is a part of the Applied Machine Learning Days 2019.
Full Program and Abstracts 
Advances in ML: Theory meets practice,
coorganized with Dan Alistarh
January 28, 2018, Lausanne, Switzerland. This workshop is a part of the Applied Machine Learning Days 2018.
Full Program and Abstracts
Reviewing for
 Conferences: Genetic and Evolutionary Computation Conference (GECCO)191817161514International Conference on Artificial Intelligence and Statistics (AISTATS)19International Conference on Machine Learning (ICML)191817Neural Information Processing Systems (NIPS)1918Symposium on Discrete Algorithms (SODA)19Parallel Problem Solving from Nature (PPSN)18Symposium on the Theory of Computing (STOC)18
 Journals: European Journal of Operational Research (EJOR)IEEE Transactions on Evolutionary Computation (TEVC)Journal of Machine Learning Research (JMLR)Markov Processes And Related Fields (MPRF)Mathematical Programming (MAPR)Numerical Algebra, Control and Optimization (NACO)Optimization Methods and Software (OMS)SIAM Journal on Mathematics of Data Science (SIMODS)SIAM Journal on Optimization (SIOPT)
Education and Work
 From Nov 1 2014 to Okt, 31, 2016, I worked with Prof. Yurii Nesterov and Prof. François Glineur at the Center for Operations Research and Econometrics (CORE). I was also member of the Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM).
 From Sep 15, 2010 to Sep 30, 2014, I was a PHD student in Prof. Emo Welzl's research group, supervised by Prof. Bernd Gärtner and Christian Lorenz Müller.
 Until Jun 31, 2012, I was also member of Prof. Ivo Sbalzarini's MOSAIC group. The group now moved to the Max Planck Institute of Molecular Cell Biology and Genetics (MPICBG) in Dresden.
 From Sep 2005 to Mar 2010 I did my Bachelor and Master in Mathematics at ETH Zurich.
Publications
Refereed Publications
 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAbstractVideoCodeBibIn:ICML2019We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossipbased stochastic gradient descent algorithm, CHOCOSGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCOGOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective stateoftheart baselines and CHOCOSGD can reduce communication by at least two orders of magnitudes.
 Error Feedback Fixes SignSGD and other Gradient Compression SchemesAbstractCodeBibIn:ICML2019Signbased algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counterexamples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using errorfeedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EFSGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
 Local SGD Converges Fast and Communicates LittleAbstractPosterBibIn:ICLR2019Minibatch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as minibatch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and minibatch size. The number of communication rounds can be reduced up to a factor of T^{1/2}where T denotes the number of total stepscompared to minibatch SGD. This also holds for asynchronous implementations.
Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.  Efficient Greedy Coordinate Descent for Composite ProblemsAbstractBibIn:AISTATSPMLR 89, 2887–28962019Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
In this paper, we consider greedy updates that are based on subgradients for a class of nonsmooth composite problems, which includes L1regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.  Sparsified SGD with MemoryAbstractPosterCodeBibIn:NeurIPS31, 4452–44632018Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (topk sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
In this work we analyze Stochastic Gradient Descent (SGD) with ksparsification or compression (for instance topk or randomk) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.  Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder OptimizationAbstractPosterBibIn:NeurIPS31, 1626–16362018We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasiNewton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speedups over classical nonaccelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
 On Matching Pursuit and Coordinate DescentBibIn:ICMLPMLR 80, 3198–32072018
 Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSlidesBibIn:AISTATSPMLR 84, 1204–12132018
 Safe Adaptive Importance SamplingVideoSlidesPosterBibSpotlight talk, In:NIPS30, 4381–43912017
 On the existence of ordinary trianglesBibComputational Geometry66, 28–312017
 Approximate Steepest Coordinate DescentPosterBibIn:ICMLPMLR 70, 3251–32592017
 Efficiency of accelerated coordinate descent method on structured optimization problemsBibSIAM Journal on Optimization27:1, 110–1232017
 Variable Metric Random PursuitBibMathematical Programming156(1), 549–5792016
 On two continuum armed bandit problems in high dimensionsBibTheory of Computing Systems58:1, 191–2222016
 On low complexity Acceleration Techniques for Randomized OptimizationSupplBibIn:PPSN XIIISpringer, 130–1402014
 Optimization of Convex Functions with Random PursuitBibSIAM Journal on Optimization23:2, 1284–13092013
 On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesPosterBibIn:PPSN XIISpringer, 448–4572012
 On Two Problems Regarding the Hamiltonian Cycle GameBibThe Electronic Journal of Combinatorics16(1)2009
Preprints / Various:
 Stochastic Distributed Learning with Gradient Quantization and Variance ReductionAbstractBibTechnical Report2019We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g.\ quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finitesum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and nonconvex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
 Don't Use Large MiniBatches, Use Local SGDBibTechnical Report2018
 Global linear convergence of Newton's method without strongconvexity or Lipschitz gradientsBibTechnical Report2018
 SVRG meets SAGA: kSVRG — A Tale of Limited MemoryBibTechnical Report2018
 Stochastic continuum armed bandit problem of few linear parameters in high dimensionsBibTechnical Report2013
 Random Pursuit in Hilbert SpaceBibTechnical ReportCGLTR882013
 Matrixvalued Iterative Random ProjectionsBibTechnical ReportCGLTR872013
Theses
 Convex Optimization with Random PursuitBibPhD thesis in Theoretical Computer ScienceETH Zurich2014
 Graph sparsification and applicationsBibMaster thesis in MathematicsETH Zurich2010
 On two problems regarding the Hamilton Cycle GamePaperBibBachelor thesis in MathematicsETH Zurich2008
Talks / Posters / Workshops
 Gradient Compression with Error Feedback for Distributed Optimization4 June2019OwkinParisFrance
 Gradient Compression with Error Feedback for Distributed Optimization3 June2019Séminaire Parisien d'OptimisationParisFrance
 Local SGD Converges Fast and Communicates LittlePoster6–9 May2019International Conference on Learning Representations (ICLR)New OrleansUSA
 Communication Efficient SGD for Distributed Optimization(invited)20–22 March2019Operator Splitting Methods in Data Analysis WorkshopFlatiron Institute, New YorkUSA
 Error Feedback for Communication Efficient SGD(Machine Learning Seminar UChicago and TTIC)11–15 March2019research visit Nati Srebo, Toyota Technological Institute at Chicago (TTIC)ChicagoUSA
 Error Feedback for Communication Efficient SGD14 February2019MPITübingenGermany
 Advances in ML: Theory meets Practice(workshop organizer)Workshop26–19 January2019Applied Machine Learning DaysEPFL LausanneSwitzerland
 Sparsified SGD with MemoryPosterAccelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder OptimizationPoster2–8 December2018Thirtysecond Annual Conference on Neural Information Processing Systems (NeurIPS)MontréalCanada
 Communication Efficient Variants of SGD for Distributed Computing23 November2018Machine Learning meeting, EPFLLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed Computing(CS Graduate Seminar)kSVRG: Variance Reduction for Large Scale Optimization(All Hands Meetings on Big Data Optimization)23 October–14 November2018research visit Peter Richtárik, KAUSTThuwalKSA
 Communication Efficient Variants of SGD for Distributed Computing(CORE–INMA Seminar)2–14 September2018research visit François Glineur, Yurii Nesterov, UCLouvainLouvainlaNeuveBelgium
 Communication Efficient SGD through Quantization with Memory30 August2018MittagsseminarETH ZürichSwitzerland
 On Matching Pursuit and Coordinate Descent(talk by Francesco)10–15 July201835th International Conference on Machine Learning (ICML)StockholmSweden
 Approximate Composite Minimization: Convergence Rates and Examples2–6 July201823rd International Symposium on Mathematical Programming (ISMP)BordeauxFrance
 Adaptive balancing of gradient and update computation times(talk by Praneeth)9–11 April2018The 21st International Conference on Artificial Intelligence and Statistics (AISTATS)Playa Blanca, LanzaroteSpain
 19–20 February2018Swiss Joint Research Center Workshop 2018EPFL LausanneSwitzerland
 Advances in ML: Theory meets Practice(workshop organizer)Workshop27–30 January2018Applied Machine Learning DaysEPFL LausanneSwitzerland
 Adaptive importance sampling for convex optimization(invited)15–18 January2018Optimization for Learning WorkshopPontificia Universidad Católica de Chile, SantiagoChile
 Adaptive importance sampling for convex optimization2–13 January2018research visit Hemant Tyagi, Alan Touring InstituteLondonUK
 Safe Adaptive Importance Sampling4–9 December2017Thirtyfirst Annual Conference on Neural Information Processing Systems (NIPS)Long BeachUSAPoster
 27 November – 1 December2017Optimization, Statistics and Uncertainty Simons Institute for the Theory of Computing, BerkelyUSA
 Approximate Steepest Coordinate Descent3–21 October2017research visit Peter Richtárik, KAUSTThuwalKSA
 Approximate Steepest Coordinate Descent6–11 August201734th International Conference on Machine Learning (ICML)SidneyAustraliaPoster
 14–16 June2017The Summer Research Institute 2017EPFL LausanneSwitzerland
 12–14 June2017Google Machine Learning SummitGoogle ZürichSwitzerland
 Efficiency of Coordinate Descent on Structured Problems22–25 May2017SIAM Conference on Optimization (OP17)VancouverCanada
 2–3 February2017Swiss Joint Research Center Workshop 2017Microsoft CambridgeUK
 30–31 January2017Applied Machine Learning DaysEPFL LausanneSwitzerland
 5–10 December2016Thirtieth Annual Conference on Neural Information Processing Systems (NIPS)BarcelonaSpain
 Randomized Algorithms for Convex Optimization18 October2016Operations Research SeminarCORE, LouvainlaNeuveBelgium
 Efficiency of Random Search on Structured Problems6–11 August20165th International Conference on Continuous Optimization (ICCOPT)TokyoJapan
 6–10 June201614th Gremo Workshop on Open ProblemsSt. NiklausenSwitzerland
 Adaptive QuasiNewton Methods6 June20169th Combinatorial Algorithms DayETH ZürichSwitzerland
 7–12 February2016Optimization Without Borders—dedicated to the 60th birthday of Yuri NesterovLes HouchesFrance
 Efficient Methods for a Class of Truss Topology Design Problems28–29 January201630th annual conference of the Belgian Operational Research Society (ORBEL)LouvainlaNeuveBelgium
 23 November2015IAP DYSCO Study Day: Dynamical systems, control and optimizationLeuvenBelgium
 26 October – 20 November2015SOCN Grad Course: Randomized algorithms for big data optimizationLouvainlaNeuveBelgium
 Accelerated Random Search8–10 July201513th EUROPT Workshop on Advances in Continuous OptimizationEdinburghUK
 1–5 June201513th Gremo Workshop on Open ProblemsFeldisSwitzerland
 Solving generalized Laplacian linear systems1 June20158th Combinatorial Algorithms DayETH ZürichSwitzerland
 6–8 May2015Optimization and Big Data 2015, Workshop, Trek and ColloquiumEdinburghUK
 12 November2014IAP DYSCO Study Day: Dynamical systems, control and optimizationGentBelgium
 Optimization of convex function with Random Pursuit4 November2014Simons FoundationNew YorkUSA
 On Low Complexity Acceleration Techniques for Randomized Optimization13–17 September201413th International Conference on Parallel Problem Solving From Nature (PPSN)LjubljanaSlovenia
 30 June – 4 July201412th Gremo Workshop on Open ProblemsVal SinestraSwitzerland
 30 June20147th Combinatorial Algorithms DayETH ZürichSwitzerland
 Probabilistic Estimate Sequences4–5 April2014Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2014JenaGermany
 Probabilistic Estimate Sequences25 March2014TAO reserach seminarINRIA Saclay, ÎledeFranceFrance
 Natural Gradient in Evolution Strategies17 October2013MittagsseminarETH ZürichSwitzerland
 Optimization and Learning with Random Pursuit30 September – 2 Oktober2013CG Learning Review MeetingAthensGreece
 30 June – 5 July2013Dagstuhl Seminar "Theory of Evolutionary Algorithms"WadernGermany
 24–28 June201311th Gremo Workshop on Open ProblemsAlp SellamattSwitzerland
 24 June20136th Combinatorial Algorithms DayETH ZürichSwitzerland
 Stochastic Bandits30 May2013MittagsseminarETH ZürichSwitzerland
 6 April – 8 May2013Research visit with Christian Müller and Jonathan GoodmanCourant Institute of Mathematical Sciences, New York UniversityUSA
 Variable Metric Random Pursuit13–14 December2012CG Learning Review MeetingBerlinGermany
 Variable Metric Random Pursuit4 December2012MittagsseminarETH ZürichSwitzerland
 On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesPoster1–5 September201212th International Conference on Parallel Problem Solving From Nature (PPSN)TaorminaItaly
 Convergence of Local Search19–24 August201221st International Symposium on Mathematical Programming (ISMP)TU BerlinGermany
 20–22 June201212th International Workshop on High Performance Optimization (HPOPT): Algorithmic convexity and applicationsDelft University of TechnologyThe Netherlands
 4–8 June201210th Gremo Workshop on Open ProblemsBergünSwitzerland
 4 June20125th Combinatorial Algorithms DayETH ZürichSwitzerland
 Convergence of Local Search2–3 May2012Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2012Lille/Villeneuve d'AscqFrance
 The Heavy Ball Method3 April2012MittagsseminarETH ZürichSwitzerland
 Advertising Randomized derivativefree optimization11–14 March2012The First ETHJapan Workshop on Science and ComputingEngelbergSwitzerland
 7–9 March2012The First ETHJapan Symposium for the Promotion of Academic ExchangesETH ZürichSwitzerland
 Gradientfree optimization with Random Pursuit15 December2011CG Learning Review MeetingZürichSwitzerland
 Dimension reduction with the JohnsonLindenstrauss Lemma29 September2011MittagsseminarETH ZürichSwitzerland
 30 August – 2 September2011International Conference on Operations ResearchZürichSwitzerland
 Randomized DerivativeFree Optimization, a survery of different methodsPoster8–11 August2011MADALGO & CTIC Summer School on Highdimensional Geometric ComputingAarhus UniversityDanmark
 Random derivativefree optimization of convex functions using a line search oracle8–9 July2011Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2011CopenhagenDanmark
 4–8 July201138th International Colloquium on Automata, Languages and Programming (ICALP)ZürichSwitzerland
 9–11 June2011CG Learning Kickoff WorkshopParisFrance
 28–30 March201127th European Workshop on Computational Geometry (EuroCG)MorschachSwitzerland
 7–10 February20119th Gremo Workshop on Open ProblemsWergensteinSwitzerland
 Principles of selfadaptation in randomized optimization16 December2010MittagsseminarETH ZürichSwitzerland
 Graph sparsification with applications11 March2010Institute for Operations Research (IFOR)ETH ZürichSwitzerland
Student Projects
 Cem MusluogluQuantization and Compression for Distributed OptimizationMaster Project2018
 Quentin RebjockError Feedback For SignSGDMaster Project2018Paper
 Jimi VaubienDerivativeFree Empirical Risk MinimizationBachelor Project2018
 Polina KirichenkoZeroorder training for DLSummer Internship2018
 Nikhil KrishnaAdaptive Sampling with LSHSummer Internship2018
 Alberto ChiappaRealtime recognition of industryspecific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
 JeanBaptiste CordonnierDistributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
 Arthur DeschampsAsynchronous Methods in Machine LearningBachelor Project2018
 ChiaAn YuCompression for Stochastic Gradient DescentMaster Project2018
 Frederik KünstnerFully Quantized Distributed Gradient DescentMaster Project2017
 Fabian Latorre GomezImportance Sampling for MLEDIC Project2017
 Pooja KulkarniSteepest Descent and Adaptive SamplingSummer Internship2017
 Anastasia KoloskovaSteepest CD and LSHSummer Internship2017Paper
 William Borgeaud DitAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
 Joel CastellonComplexity analysis for AdaRBFGS: a primitive for methods between ﬁrst and second orderSemester Project2017
 Alberto ChiappaAsynchronous updates for stochastic gradient descentMaster Project2017
 Marina MannariFaster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
 Anant RajSteepest Coordinate DescentInternship2017Paper
Teaching
Teaching Assistance
 Game TheorySpring 2016
 Algorithms LabFall 2013
 Informaticsfor mathematics and physics (in C++) (head assistant)Fall 2013
 Algorithms LabFall 2012
 Informaticsfor mathematics and physics (in C++)Fall 2012
 Approximation Algorithms and Semidefinite Programming(head assistant)Spring 2012
 Algorithms LabFall 2011
 Informaticsfor mathematics and physics (in C++)Fall 2010
 Analysis Ifor mechanical engineeringFall 2009
 Analysis IIfor mechanical engineeringSpring 2009
 Analysis Ifor mechanical engineeringFall 2008
 Analysis IIfor mechanical engineeringSpring 2008
 Complex AnalysisFall 2007