Research scientist at EPFL, Member of the European Lab for Learning and Intelligent Systems.
Hosted by Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).
Hosted by Prof. Martin Jaggi, Machine Learning and Optimization Laboratory (MLO).
Research Interests
 methods for machine learning and statistics—at the interface of theory and practice
 collaborative learning (distributed, federated and decentralized methods)
 optimization for machine learning (adaptive stochastic methods and generalization performance)
Workshops (Organizer)

12th OPT Workshop on Optimization for Machine Learning, coorganizer
December 11, 2020, Virtual Event, Online. (together with NeurIPS). Submission Deadline October 8, 2020. 
Advances in ML: Theory meets practice,
coorganizer, with Aymeric Dieuleveut
January 26, 2020, Lausanne, Switzerland. This workshop is a part of the Applied Machine Learning Days 2020.
Reviewing for
 Conferences: Neural Information Processing Systems (NeurIPS)21 (area chair)201918International Symposium on Computational Geometry (SoCG)21International Workshop on Federated Learning for User Privacy and Data Confidentiality (FLICML)2120Genetic and Evolutionary Computation Conference (GECCO)20191817161514International Conference on Artificial Intelligence and Statistics (AISTATS)2019International Conference on Machine Learning (ICML)20191817NeurIPS Workshop on Optimization for Machine Learning (OPT)2019Parallel Problem Solving from Nature (PPSN)2018Symposium on Discrete Algorithms (SODA)19Symposium on the Theory of Computing (STOC)18
 Journals: Artificial Intelligence (ARTINT)European Journal of Operational Research (EJOR)IEEE Journal on Selected Areas in Information Theory (JSAIT)IEEE Transactions on Evolutionary Computation (TEVC)IEEE Transactions on Pattern Analysis and Machine Learning (TPAMI)IEEE Transactions on Signal Processing (TSP)Journal of Machine Learning Research (JMLR)Machine Learning (MACH)Markov Processes And Related Fields (MPRF)Mathematical Programming (MAPR)Numerical Algebra, Control and Optimization (NACO)Optimization Methods and Software (OMS)SIAM Journal on Mathematics of Data Science (SIMODS)SIAM Journal on Optimization (SIOPT)
Education and Work
 since June 2020 member of the European Lab for Learning and Intelligent Systems.
 since December 2016 research scientist at the Machine Learning and Optimization Laboratory at EPFL.
 From Nov 1 2014 to Okt, 31, 2016, I worked with Prof. Yurii Nesterov and Prof. François Glineur at the Center for Operations Research and Econometrics (CORE). I was also member of the Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM).
 From Sep 15, 2010 to Sep 30, 2014, I was a PHD student in Prof. Emo Welzl's research group, supervised by Prof. Bernd Gärtner and Christian Lorenz Müller.
 Until Jun 31, 2012, I was also member of Prof. Ivo Sbalzarini's MOSAIC group. The group now moved to the Max Planck Institute of Molecular Cell Biology and Genetics (MPICBG) in Dresden.
 From Sep 2005 to Mar 2010 I did my Bachelor and Master in Mathematics at ETH Zurich.
Publications
Refereed Publications
 QuasiGlobal Momentum: Accelerating Decentralized Deep Learning on Heterogeneous DataAbstractIn:ICML2021Decentralized training of deep learning models is a key element for enabling data privacy and ondevice learning over networks. In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization challenge and may severely deteriorate the generalization performance. In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity. We propose a novel momentumbased method to mitigate this decentralized training difficulty. We show in extensive empirical experiments on various CV/NLP datasets (CIFAR10, ImageNet, AG News, and SST2) and several network topologies (Ring and Social Network) that our method is much more robust to the heterogeneity of clients' data than other existing methods, by a significant improvement in test performance (1%−20%).
 Consensus Control for Decentralized Deep LearningAbstractIn:ICML2021Decentralized training of deep learning models enables ondevice learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a datacenter setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning. We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we propose practical training guidelines for the datacenter setup as the important first step.
 Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous UpdatesAbstractIn:AISTATS2021It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and  in asynchronous implementations  on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a datadependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and nonconvex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments.
 LENA: CommunicationEfficient Distributed Learning with SelfTriggered Gradient UploadsAbstractIn:AISTATS2021In distributed optimization, parameter updates from the gradient computing node devices have to be aggregated in every iteration on the orchestrating server. When these updates are sent over an arbitrary commodity network, bandwidth and latency can be limiting factors. We propose a communication framework where nodes may skip unnecessary uploads. Every node locally accumulates an error vector in memory and selftriggers the upload of the memory contents to the parameter server using a significance filter. The server then uses a history of the nodes’ gradients to update the parameter. We characterize the convergence rate of our algorithm in smooth settings (stronglyconvex, convex, and nonconvex) and show that it enjoys the same convergence rate as when sending gradients every iteration, with substantially fewer uploads. Numerical experiments on real data indicate a significant reduction of used network resources (total communicated bits and latency), especially in large networks, compared to stateoftheart algorithms. Our results provide important practical insights for using machine learning over resourceconstrained networks, including InternetofThings and geoseparated datasets across the globe.
 A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!AbstractIn:AISTATS2021Decentralized optimization methods enable ondevice training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system. We propose a new randomized firstorder method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages. By combining our scheme with a new variance reduction technique that progressively throughout the iterations reduces the adverse effect of the injected quantization noise, we obtain a scheme that converges linearly on strongly convex decentralized problems while using compressed communication only. We prove that our method can solve the problems without any increase in the number of communications compared to the baseline which does not perform any communication compression while still allowing for a significant compression factor which depends on the conditioning of the problem and the topology of the network. Our key theoretical findings are supported by numerical experiments.
 Taming GANs with LookaheadMinmaxAbstractIn:ICLR2021Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead–minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large minibatches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR10, and ImageNet demonstrate a clear advantage of combining Lookahead–minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30fold fewer parameters and 16fold smaller minibatches we outperform the reported performance of the classdependent BigGAN on CIFAR10 by obtaining FID of 12.19 without using the class labels, bringing stateoftheart GAN training within reach of common computational resources.
 Ensemble Distillation for Robust Model Fusion in Federated LearningAbstractBibIn:NeurIPS, 322020Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model while keeping the training data decentralized. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. However, directly averaging model parameters is only possible if all models have the same structure and size, which could be a restrictive constraint in many scenarios.
In this work we investigate more powerful and more flexible aggregation schemes for FL. Specifically, we propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients. This knowledge distillation technique mitigates privacy risk and cost to the same extent as the baseline FL algorithms, but allows flexible aggregation over heterogeneous client models that can differ e.g. in size, numerical precision or structure. We show in extensive empirical experiments on various CV/NLP datasets (CIFAR10/100, ImageNet, AG News, SST2) and settings (heterogeneous models/data) that the server model can be trained much faster, requiring fewer communication rounds than any existing FL technique so far.  The ErrorFeedback Framework: Better Rates for SGD with Delayed Gradients and Compressed CommunicationAbstractBibJournal of Machine Learning ResearchJMLR 21(237), 1–362020We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasiconvex and nonconvex functions and derive concise, nonasymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for stronglyconvex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients&emdash;compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
 A Unified Theory of Decentralized SGD with Changing Topology and Local UpdatesAbstractBibIn:ICML2020Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communicationefficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities. Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and nonconvex) problems and the rates interpolate between the heterogeneous (nonidentically distributed data) and iiddata settings, recovering linear convergence rates in many special cases, for instance for overparametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD).
 Extrapolation for Largebatch Training in Deep LearningAbstractBibIn:ICML2020Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often modelspecific noise and finetuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
 Is Local SGD Better than Minibatch SGD?AbstractBibIn:ICML2020We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
 SCAFFOLD: Stochastic Controlled Averaging for OnDevice Federated LearningAbstractBibIn:ICML2020Federated learning is a key scenario in modern largescale machine learning. In that scenario, the training data remains distributed over a large number of clients, which may be phones, other mobile devices, or network sensors and a centralized model is learned without ever transmitting client data over the network. The standard optimization algorithm used in this scenario is Federated Averaging (FedAvg). However, when client data is heterogeneous, which is typical in applications, FedAvg does not admit a favorable convergence guarantee. This is because local updates on clients can drift apart, which also explains the slow convergence and hardtotune nature of FedAvg in practice. This paper presents a new Stochastic Controlled Averaging algorithm (SCAFFOLD) which uses control variates to reduce the drift between different clients. We prove that the algorithm requires significantly fewer rounds of communication and benefits from favorable convergence guarantees.
 Dynamic Model Pruning with FeedbackAbstractBibIn:ICLR2020Deep neural networks often have millions of parameters. This can hinder their deployment to lowend devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR10 and ImageNet, and show that the obtained sparse models can reach the stateoftheart performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).
 Decentralized Deep Learning with Arbitrary Communication CompressionAbstractBibIn:ICLR2020Decentralized training of deep learning models is a key element for enabling data privacy and ondevice learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that ChocoSGD&emdash;recently introduced and analyzed for stronglyconvex objectives only&emdash;converges under arbitrary high compression ratio on general nonconvex functions at the rate O(1/√(nT)) where T denotes the number of iterations and n the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous stateofthe art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming allreduce timewise).
 Don't Use Large MiniBatches, Use Local SGDAbstractBibIn:ICLR2020Minibatch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the minibatch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a postlocal SGD and show that it significantly improves the generalization performance compared to largebatch training on standard benchmarks while enjoying the same efficiency (timetoaccuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance tradeoffs associated with a host of local SGD variants.
 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAbstractVideoCodeBibIn:ICMLPMLR 97, 3478–34872019We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossipbased stochastic gradient descent algorithm, CHOCOSGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCOGOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective stateoftheart baselines and CHOCOSGD can reduce communication by at least two orders of magnitudes.
 Error Feedback Fixes SignSGD and other Gradient Compression SchemesAbstractCodeBibIn:ICMLPMLR 97, 3252–32612019Signbased algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counterexamples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using errorfeedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EFSGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
 Local SGD Converges Fast and Communicates LittleAbstractPosterBibIn:ICLR2019Minibatch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as minibatch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and minibatch size. The number of communication rounds can be reduced up to a factor of T^{1/2}where T denotes the number of total stepscompared to minibatch SGD. This also holds for asynchronous implementations.
Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.  Efficient Greedy Coordinate Descent for Composite ProblemsAbstractBibIn:AISTATSPMLR 89, 2887–28962019Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
In this paper, we consider greedy updates that are based on subgradients for a class of nonsmooth composite problems, which includes L1regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.  Sparsified SGD with MemoryAbstractPosterCodeBibIn:NeurIPS31, 4452–44632018Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (topk sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
In this work we analyze Stochastic Gradient Descent (SGD) with ksparsification or compression (for instance topk or randomk) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.  Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder OptimizationAbstractPosterBibIn:NeurIPS31, 1626–16362018We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasiNewton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speedups over classical nonaccelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
 On Matching Pursuit and Coordinate DescentBibIn:ICMLPMLR 80, 3198–32072018
 Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSlidesBibIn:AISTATSPMLR 84, 1204–12132018
 Safe Adaptive Importance SamplingVideoSlidesPosterBibSpotlight talk, In:NIPS30, 4381–43912017
 On the existence of ordinary trianglesBibComputational Geometry66, 28–312017
 Approximate Steepest Coordinate DescentPosterBibIn:ICMLPMLR 70, 3251–32592017
 Efficiency of accelerated coordinate descent method on structured optimization problemsBibSIAM Journal on Optimization27:1, 110–1232017
 Variable Metric Random PursuitBibMathematical Programming156(1), 549–5792016
 On two continuum armed bandit problems in high dimensionsBibTheory of Computing Systems58:1, 191–2222016
 On low complexity Acceleration Techniques for Randomized OptimizationSupplBibIn:PPSN XIIISpringer, 130–1402014
 Optimization of Convex Functions with Random PursuitBibSIAM Journal on Optimization23:2, 1284–13092013
 On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesPosterBibIn:PPSN XIISpringer, 448–4572012
 On Two Problems Regarding the Hamiltonian Cycle GameBibThe Electronic Journal of Combinatorics16(1)2009
Preprints / Various:
 On Communication Compression for Distributed Optimization on Heterogeneous DataAbstractTechnical Report2020Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models. We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD (DQSGD) with arbitrary unbiased quantizers and (ii) distributed SGD with errorfeedback and biased compressors (DEFSGD) in the heterogeneous (noniid) data setting. Our results indicate that DEFSGD is much less affected than DQSGD by noniid data, but both methods can suffer a slowdown if dataskewness is high. We propose two alternatives that are not (or much less) affected by heterogenous data distributions: a new method that is only applicable to strongly convex problems, and we point out a more general approach that is applicable to linear compressors.
 Mime: Mimicking Centralized Stochastic Algorithms in Federated LearningAbstractTechnical Report2020Federated learning is a challenging optimization problem due to the heterogeneity of the data across different clients. Such heterogeneity has been observed to induce client drift and significantly degrade the performance of algorithms designed for this setting. In contrast, centralized learning with centrally collected data does not experience such drift, and has seen great empirical and theoretical progress with innovations such as momentum, adaptivity, etc. In this work, we propose a general framework Mime which mitigates clientdrift and adapts arbitrary centralized optimization algorithms (e.g. SGD, Adam, etc.) to federated learning. Mime uses a combination of controlvariates and serverlevel statistics (e.g. momentum) at every clientupdate step to ensure that each local update mimics that of the centralized method. Our thorough theoretical and empirical analyses strongly establish Mime's superiority over other baselines.
 Analysis of SGD with Biased Gradient EstimatorsAbstractIn:ICML 2020 Workshop: Beyond First Order Methods in ML Systems2020The classic stochastic gradient (SGD) algorithm crucially depends on the availability of unbiased stochastic gradient oracles. However, in certain applications only biased gradient updates are available or preferred over unbiased ones for performance reasons. For instance, in the domain of distributed learning, biased gradient compression techniques such as top$k$ compression have been proposed as a tool to alleviate the communication bottleneck and in derivativefree optimization, only biased gradient estimators can be queried.
We present a general framework to analyze SGD with biased gradient oracles and derive convergence rates for smooth (nonconvex) functions and give improved rates under the PolyakLojasiewicz condition. Our rates show how biased estimators impact the attainable convergence guarantees. We discuss a few guiding examples that show the broad applicability of our analysis.  Advances and Open Problems in Federated LearningAbstractBibTechnical Report2019Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
 Unified Optimal Analysis of the (Stochastic) Gradient MethodAbstractBibTechnical Report2019In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on μstrongly convex functions under a (milder than standard) Lsmoothness assumption. We show that SGD converges after T iterations as O(L∥x0−x*∥² exp[−μT/4L]+σ²/μT) where σ² measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have σ²=0 and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
 Stochastic Distributed Learning with Gradient Quantization and Variance ReductionAbstractBibTechnical Report2019We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finitesum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and nonconvex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
 Global linear convergence of Newton's method without strongconvexity or Lipschitz gradientsBibIn:NeurIPS 2019 Workshop: Beyond First Order Methods in ML2018
 kSVRG: Variance Reduction for Large Scale OptimizationBibTechnical Report2018
 Stochastic continuum armed bandit problem of few linear parameters in high dimensionsBibTechnical Report2013
 Random Pursuit in Hilbert SpaceBibTechnical ReportCGLTR882013
 Matrixvalued Iterative Random ProjectionsBibTechnical ReportCGLTR872013
Theses
 Convex Optimization with Random PursuitBibPhD thesis in Theoretical Computer ScienceETH Zurich2014
 Graph sparsification and applicationsBibMaster thesis in MathematicsETH Zurich2010
 On two problems regarding the Hamilton Cycle GamePaperBibBachelor thesis in MathematicsETH Zurich2008
Talks / Posters / Workshops
 TBA(invited session)20–23 July2021SIAM Conference on Optimization (OP21)Spokane, WashingtonUSA
 11 December202012th Workshop on Optimization for Machine Learning (OPT)(workshop organizer)Virtual, formerly VancouverOnlineWorkshop
 Towards Decentralized Machine Learning with SGD12 November2020SeminarETH ZurichSwitzerland
 A Unified Theory of Decentralized SGD with Changing Topology and Local Updates (talk by Anastasia Koloskova)VideoExtrapolation for Largebatch Training in Deep Learning (talk by Tao Lin)VideoIs Local SGD Better than Minibatch SGD? (talk by Blake Woodworth)VideoSCAFFOLD: Stochastic Controlled Averaging for Federated Learning (talk by Praneeth Karimireddy)Video12–18 July202037th International Conference on Machine Learning (ICML)Virtual conference, formerly ViennaOnline
 Decentralized Deep Learning with Arbitrary Communication Compression (talk by Anastasia Koloskova)VideoDon't Use Large Minibatches, Use Local SGD (talk by Tao Lin)VideoDynamic Model Pruning with Feedback (talk by Tao Lin)Video26 April–1st May2020International Conference on Learning Representations (ICLR)Virtual Conference, formerly Addis AbabaOnline
 27 February2020research visit Suvrit Sra, MITCambridge MAUSA
 A Unifying Framework for Communication Efficient Decentralized Machine Learning20 February2020MLO Group SeminarLausanneSwitzerland
 Advances in ML: Theory meets Practice (workshop organizer)Workshop25–29 January2020Applied Machine Learning DaysEPFL LausanneSwitzerland
 14 December201911th Workshop on Optimization for Machine Learning (OPT)(workshop organizer)VancouverCanadaWorkshop
 8–14 December2019Thirtythird Conference on Neural Information Processing Systems (NeurIPS)VancouverCanada
 23–26 November2019KAUSTTsinghua Industry Workshop on Advances in Artificial IntelligenceThuwalKSA
 17–29 November2019research visit Peter Richtárik, KAUSTThuwalKSA
 22–30 October2019research visit Yurii Nesterov, UCLouvainLouvainlaNeuveBelgium
 Tutorial On Theory of Federated Learning15 October2019MLO SeminarLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed Training(invited session)5–8 August20196th International Conference on Continuous Optimization (ICCOPT)BerlinGermany
 On Lottery Tickets in DL5 July2019EPFL, MLTheory SeminarLausanneSwitzerland
 Variance Reduced Methods in Derivative Free Optimization(invited session)24–26 June201930th European Conference on Operational Research (EURO)DublinIreland
 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationError Feedback Fixes SignSGD and other Gradient Compression Schemes(talk by Praneeth)9–15 June201936th International Conference on Machine Learning (ICML)Long BeachUSA
 Gradient Compression with Error Feedback for Distributed Optimization(invited)4 June2019OwkinParisFrance
 Gradient Compression with Error Feedback for Distributed Optimization(invited)3 June2019Séminaire Parisien d'OptimisationParisFrance
 Tutorial On Robust Optimization28 May2019MLO SeminarLausanneSwitzerland
 Local SGD Converges Fast and Communicates LittlePoster6–9 May2019International Conference on Learning Representations (ICLR)New OrleansUSA
 Communication Efficient SGD for Distributed Optimization(invited)20–22 March2019Operator Splitting Methods in Data Analysis WorkshopFlatiron Institute, New YorkUSA
 Error Feedback for Communication Efficient SGD(Machine Learning Seminar UChicago and TTIC, invited)11–15 March2019research visit Nati Srebo, Toyota Technological Institute at Chicago (TTIC)ChicagoUSA
 Error Feedback for Communication Efficient SGD(invited)14 February2019MPITübingenGermany
 Advances in ML: Theory meets Practice(workshop organizer)Workshop26–29 January2019Applied Machine Learning DaysEPFL LausanneSwitzerland
 Sparsified SGD with MemoryPosterAccelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder OptimizationPoster2–8 December2018Thirtysecond Annual Conference on Neural Information Processing Systems (NeurIPS)MontréalCanada
 Communication Efficient Variants of SGD for Distributed Computing23 November2018Machine Learning meeting, EPFLLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed Computing(CS Graduate Seminar, invited)kSVRG: Variance Reduction for Large Scale Optimization(All Hands Meetings on Big Data Optimization)23 October–14 November2018research visit Peter Richtárik, KAUSTThuwalKSA
 Tutorial On Convex and NonConvex Optimization26 September2018MLO SeminarLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed Computing(CORE–INMA Seminar, invited)2–14 September2018research visit François Glineur, Yurii Nesterov, UCLouvainLouvainlaNeuveBelgium
 Communication Efficient SGD through Quantization with Memory30 August2018MittagsseminarETH ZürichSwitzerland
 On Matching Pursuit and Coordinate Descent(talk by Francesco)10–15 July201835th International Conference on Machine Learning (ICML)StockholmSweden
 Approximate Composite Minimization: Convergence Rates and Examples(invited session)2–6 July201823rd International Symposium on Mathematical Programming (ISMP)BordeauxFrance
 Tutorial On Random Pursuit Methods13 June2018MLO SeminarLausanneSwitzerland
 Adaptive balancing of gradient and update computation times(talk by Praneeth)9–11 April2018The 21st International Conference on Artificial Intelligence and Statistics (AISTATS)Playa Blanca, LanzaroteSpain
 19–20 February2018Swiss Joint Research Center Workshop 2018EPFL LausanneSwitzerland
 Advances in ML: Theory meets Practice(workshop organizer)Workshop27–30 January2018Applied Machine Learning DaysEPFL LausanneSwitzerland
 Adaptive importance sampling for convex optimization(invited)15–18 January2018Optimization for Learning WorkshopPontificia Universidad Católica de Chile, SantiagoChile
 Adaptive importance sampling for convex optimization(invited)2–13 January2018research visit Hemant Tyagi, Alan Touring InstituteLondonUK
 Safe Adaptive Importance Sampling4–9 December2017Thirtyfirst Annual Conference on Neural Information Processing Systems (NIPS)Long BeachUSAPoster
 27 November – 1 December2017Optimization, Statistics and Uncertainty Simons Institute for the Theory of Computing, BerkelyUSA
 Approximate Steepest Coordinate Descent3–21 October2017research visit Peter Richtárik, KAUSTThuwalKSA
 Tutorial On Coordinate Descent30 August2017MLO SeminarLausanneSwitzerland
 Approximate Steepest Coordinate Descent6–11 August201734th International Conference on Machine Learning (ICML)SidneyAustraliaPoster
 14–16 June2017The Summer Research Institute 2017EPFL LausanneSwitzerland
 12–14 June2017Google Machine Learning SummitGoogle ZürichSwitzerland
 Efficiency of Coordinate Descent on Structured Problems(invited session)22–25 May2017SIAM Conference on Optimization (OP17)VancouverCanada
 2–3 February2017Swiss Joint Research Center Workshop 2017Microsoft CambridgeUK
 30–31 January2017Applied Machine Learning DaysEPFL LausanneSwitzerland
 5–10 December2016Thirtieth Annual Conference on Neural Information Processing Systems (NIPS)BarcelonaSpain
 Randomized Algorithms for Convex Optimization18 October2016Operations Research SeminarCORE, LouvainlaNeuveBelgium
 Efficiency of Random Search on Structured Problems(invited session)6–11 August20165th International Conference on Continuous Optimization (ICCOPT)TokyoJapan
 6–10 June201614th Gremo Workshop on Open ProblemsSt. NiklausenSwitzerland
 Adaptive QuasiNewton Methods(invited)6 June20169th Combinatorial Algorithms DayETH ZürichSwitzerland
 11 April–15 April2016research visit Peter Richtárik, University of EdinburghEdinburghUK
 7–12 February2016Optimization Without Borders—dedicated to the 60th birthday of Yuri NesterovLes HouchesFrance
 Efficient Methods for a Class of Truss Topology Design Problems28–29 January201630th annual conference of the Belgian Operational Research Society (ORBEL)LouvainlaNeuveBelgium
 23 November2015IAP DYSCO Study Day: Dynamical systems, control and optimizationLeuvenBelgium
 26 October – 20 November2015SOCN Grad Course: Randomized algorithms for big data optimizationLouvainlaNeuveBelgium
 Accelerated Random Search8–10 July201513th EUROPT Workshop on Advances in Continuous OptimizationEdinburghUK
 1–5 June201513th Gremo Workshop on Open ProblemsFeldisSwitzerland
 Solving generalized Laplacian linear systems(invited)1 June20158th Combinatorial Algorithms DayETH ZürichSwitzerland
 6–8 May2015Optimization and Big Data 2015, Workshop, Trek and ColloquiumEdinburghUK
 12 November2014IAP DYSCO Study Day: Dynamical systems, control and optimizationGentBelgium
 Optimization of convex function with Random Pursuit(invited)4 November2014Simons FoundationNew YorkUSA
 30 October–6 November2016research visit Christian Müller, Simons FoundationNew YorkUSA
 On Low Complexity Acceleration Techniques for Randomized Optimization13–17 September201413th International Conference on Parallel Problem Solving From Nature (PPSN)LjubljanaSlovenia
 30 June – 4 July201412th Gremo Workshop on Open ProblemsVal SinestraSwitzerland
 30 June20147th Combinatorial Algorithms DayETH ZürichSwitzerland
 Probabilistic Estimate Sequences4–5 April2014Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2014JenaGermany
 Probabilistic Estimate Sequences(invited)25 March2014TAO reserach seminarINRIA Saclay, ÎledeFranceFrance
 Natural Gradient in Evolution Strategies17 October2013MittagsseminarETH ZürichSwitzerland
 Optimization and Learning with Random Pursuit30 September – 2 Oktober2013CG Learning Review MeetingAthensGreece
 30 June – 5 July2013Dagstuhl Seminar "Theory of Evolutionary Algorithms"WadernGermany
 24–28 June201311th Gremo Workshop on Open ProblemsAlp SellamattSwitzerland
 24 June20136th Combinatorial Algorithms DayETH ZürichSwitzerland
 Stochastic Bandits30 May2013MittagsseminarETH ZürichSwitzerland
 6 April – 8 May2013Research visit with Christian Müller and Jonathan GoodmanCourant Institute of Mathematical Sciences, New York UniversityUSA
 Variable Metric Random Pursuit13–14 December2012CG Learning Review MeetingBerlinGermany
 Variable Metric Random Pursuit4 December2012MittagsseminarETH ZürichSwitzerland
 On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesPoster1–5 September201212th International Conference on Parallel Problem Solving From Nature (PPSN)TaorminaItaly
 Convergence of Local Search19–24 August201221st International Symposium on Mathematical Programming (ISMP)TU BerlinGermany
 20–22 June201212th International Workshop on High Performance Optimization (HPOPT): Algorithmic convexity and applicationsDelft University of TechnologyThe Netherlands
 4–8 June201210th Gremo Workshop on Open ProblemsBergünSwitzerland
 4 June20125th Combinatorial Algorithms DayETH ZürichSwitzerland
 Convergence of Local Search2–3 May2012Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2012Lille/Villeneuve d'AscqFrance
 The Heavy Ball Method3 April2012MittagsseminarETH ZürichSwitzerland
 Advertising Randomized derivativefree optimization11–14 March2012The First ETHJapan Workshop on Science and ComputingEngelbergSwitzerland
 7–9 March2012The First ETHJapan Symposium for the Promotion of Academic ExchangesETH ZürichSwitzerland
 Gradientfree optimization with Random Pursuit15 December2011CG Learning Review MeetingZürichSwitzerland
 Dimension reduction with the JohnsonLindenstrauss Lemma29 September2011MittagsseminarETH ZürichSwitzerland
 30 August – 2 September2011International Conference on Operations ResearchZürichSwitzerland
 Randomized DerivativeFree Optimization, a survery of different methodsPoster8–11 August2011MADALGO & CTIC Summer School on Highdimensional Geometric ComputingAarhus UniversityDanmark
 Random derivativefree optimization of convex functions using a line search oracle8–9 July2011Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2011CopenhagenDanmark
 4–8 July201138th International Colloquium on Automata, Languages and Programming (ICALP)ZürichSwitzerland
 9–11 June2011CG Learning Kickoff WorkshopParisFrance
 28–30 March201127th European Workshop on Computational Geometry (EuroCG)MorschachSwitzerland
 7–10 February20119th Gremo Workshop on Open ProblemsWergensteinSwitzerland
 Principles of selfadaptation in randomized optimization16 December2010MittagsseminarETH ZürichSwitzerland
 Graph sparsification with applications11 March2010Institute for Operations Research (IFOR)ETH ZürichSwitzerland
Student Projects
 Oguz Kaan YükselTBAMaster Project2021
 Damian Dudzicz (now exchange student at ETH)PushSum Algorithms for Decentralized OptimizationMaster Project2020
 Harshvardhan HarshvardhanConvergenceDiagnostic based Adaptive Batch Sizes for SGDOptional Project2020
 Pavlo KaralupovImportance Sampling to Accelerate DL TrainingMaster Project2020
 Raphael BonattiOn Solving Combinatorial Games with Neural NetworksBachelor Project2020
 Sebastian BischoffImportance Sampling to Accelerate DL TrainingMaster Project2020
 Sadra BoreiriDecentralized Local SGDMaster Project2019Paper
 Damian Dudzicz (now exchange student at ETH)Decentralized Asynchronous SGDMaster Project2019
 Oriol Barbany Mayor (now at Logitec)AffineInvariant Robust TrainingMaster Project2019
 Rayan ElalamyConvergence Anlysis for Hogwild! Under Less Restrictive AssumptionsMaster Project2019
 Lionel ConstantinStochastic Extragradient for GAN TrainingMaster Project2019
 Harshvardhan HarshvardhanBetter bounds for Optimal Batch Size in SGDMaster Project2019
 Srijan Bansal (now at IIT Kharagpur)Adaptive Stepsize Strategies for SGDSummer Internship2019
 Guillaume van Dessel (now PhD student at UCLouvain)Stochastic gradient based methods for largescale optimization: review, unification and new approachesMaster Thesis2019
 Ahmad Ajalloeian (now PhD student at UNIL)Stochastic ZerothOrder Optimisation Algorithms with Variance ReductionMaster Thesis2019
 Claire Capelo (now exchange student at Stanford)Adaptive schemes for communication compression: Trajectory Normalized Gradients for Distributed OptimizationMaster Project2019
 Aakash Sunil Lahoti Special Cases of Local SGDMaster Project2019
 Cem Musluoglu (now PhD student at KU Leuven)Quantization and Compression for Distributed OptimizationMaster Project2018
 Quentin Rebjock (now PhD student at EPFL)Error Feedback For SignSGDMaster Project2018Paper
 Jimi Vaubien (now at Visum Technologies)DerivativeFree Empirical Risk MinimizationBachelor Project2018
 Polina Kirichenko (now PhD student at NYU)Zeroorder training for DLSummer Internship2018
 Nikhil Krishna (now at Detect Technologies)Adaptive Sampling with LSHSummer Internship2018
 Alberto Chiappa (now at Schindler Group)Realtime recognition of industryspecific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
 JeanBaptiste Cordonnier (now PhD student at EPFL)Distributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
 Arthur Deschamps (now research intern at National University of Singapore)Asynchronous Methods in Machine LearningBachelor Project2018
 ChiaAn Yu (now at Bloomberg)Compression for Stochastic Gradient DescentMaster Project2018
 Frederik Künstner (now PhD student at UBC)Fully Quantized Distributed Gradient DescentMaster Project2017
 Fabian Latorre Gomez (now PhD Student at EPFL)Importance Sampling for MLEDIC Project2017
 Pooja Kulkarni (now PhD student at UIUC)Steepest Descent and Adaptive SamplingSummer Internship2017
 Anastasia Koloskova (not PhD student at EPFL)Steepest CD and LSHSummer Internship2017Paper
 William Borgeaud dit AvocatAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
 Joel Castellon (now at Amazon)Complexity analysis for AdaRBFGS: a primitive for methods between ﬁrst and second orderSemester Project2017
 Alberto Chiappa (now at Schindler Group)Asynchronous updates for stochastic gradient descentMaster Project2017
 Marina Mannari (now at Cartier)Faster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
 Anant Raj (now PhD student at MPI)Steepest Coordinate DescentInternship2017Paper
Teaching
Teaching Assistance
 Game TheorySpring 2016
 Algorithms LabFall 2013
 Informaticsfor mathematics and physics (in C++) (head assistant)Fall 2013
 Algorithms LabFall 2012
 Informaticsfor mathematics and physics (in C++)Fall 2012
 Approximation Algorithms and Semidefinite Programming(head assistant)Spring 2012
 Algorithms LabFall 2011
 Informaticsfor mathematics and physics (in C++)Fall 2010
 Analysis Ifor mechanical engineeringFall 2009
 Analysis IIfor mechanical engineeringSpring 2009
 Analysis Ifor mechanical engineeringFall 2008
 Analysis IIfor mechanical engineeringSpring 2008
 Complex AnalysisFall 2007