News & Open Positions

I just moved to
CISPA Helmholtz Center for Information Security as a tenuretrack faculty!

I am looking for a strong postdocs, PhD students, interns, master theses, or other research collaborations possible (see here).
If you are interested in applying for a position, please read this page and send me an email with your CV, research interests and motivation.
 [Teaching Summer 22]: I am teaching an advanced lecture on Optimization for Machine Learning at Saarland University.
 [Teaching Winter 21]: I am teaching a seminar on Topics in Optimization for Machine Learning
at Saarland University.
 [March 1315, 22] Invited Talk at the KAUST AI Symposium organized by Jürgen Schmidhuber. Anastasia talks about our recent work on [Gradient Tracking] and Grigory Malinovsky on [ProxSkip].
 [Nov 10, 21] Invited talk on Decentralized Deep Learning [Slides] [Video] at the All Russian Optimization seminar (head B. Polyak).
 [Nov 4, 21] Invited talk at the LEAPS Seminar at Johns Hopkins.
 [Oct 27, 21] Invited talk on Bidirectional Compression for FL [Slides] [Video] at the Federated Learning One World (FLOW) Seminar.
 [June 24, 21] Invited talk on Efficient Federated Learning Algorithms [Slides] at the FLICML 2021 Workshop.
 [June 20, 21] Invited talk on Asynchronous SGD [Slides] at SIOP21 (SIAM Conference on Optimization).
Research Interests
 optimization for machine learning
 methods for collaborative learning (distributed, federated and decentralized methods)
 efficient optimization methods
 adaptive stochastic methods and generalization performance
 theory of deep learning
 privacy and security in machine learning
Reviewing for
 Conferences:
Conference on Computer Vision and Pattern Recognition (CVPR)22International Conference on Artificial Intelligence and Statistics (AISTATS)222019Neural Information Processing Systems (NeurIPS)21 (area chair)201918Genetic and Evolutionary Computation Conference (GECCO)2120191817161514International Symposium on Computational Geometry (SoCG)21International Conference on Machine Learning (ICML)20191817Parallel Problem Solving from Nature (PPSN)2018Symposium on Discrete Algorithms (SODA)19Symposium on the Theory of Computing (STOC)18
 Journals:
Annals of Statistics (AOS)Artificial Intelligence (ARTINT)European Journal of Operational Research (EJOR)IEEE Journal on Selected Areas in Information Theory (JSAIT)IEEE Transactions on Evolutionary Computation (TEVC)IEEE Transactions on Pattern Analysis and Machine Learning (TPAMI)IEEE Transactions on Signal Processing (TSP)Journal of Machine Learning Research (JMLR)Machine Learning (MACH)Markov Processes And Related Fields (MPRF)Mathematical Programming (MAPR)Numerical Algebra, Control and Optimization (NACO)Optimization Methods and Software (OMS)SIAM Journal on Mathematics of Data Science (SIMODS)SIAM Journal on Optimization (SIOPT)
 Workshops:
International Workshop on Trustable, Verifiable and Auditable Federated Learning in Conjunction with AAAI 2022 (FLAAAI22)22International Workshop on Federated Learning for User Privacy and Data Confidentiality (FLICML)2120NeurIPS Workshop on Optimization for Machine Learning (OPT)212019
Publications
Refereed Publications
 Simultaneous Training of Partially Masked Neural NetworksAmirkeivan MohtashamiMartin JaggiSebastian U. StichAbstractAISTATS2022
For deploying deep learning models to lower end devices, it is necessary to train less resourcedemanding variants of stateoftheart architectures. This does not eliminate the need for more expensive models as they have a higher performance. In order to avoid training two separate models, we show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be splitoff from the trained full network with remarkable good performance. We extend on prior methods that focused only on core networks of smaller width, while we focus on supporting arbitrary core network architectures. Our proposed training scheme switches consecutively between optimizing only the core part of the network and the full one. The accuracy of the full model remains comparable, while the core network achieves better performance than when it is trained in isolation. In particular, we show that training a Transformer with a lowrank core gives a lowrank model with superior performance than when training the lowrank model alone. We analyze our training scheme theoretically, and show its convergence under assumptions that are either standard or practically justified. Moreover, we show that the developed theoretical framework allows analyzing many other partial training schemes for neural networks.
 The Peril of Popular Deep Learning Uncertainty Estimation MethodsYehao LiuMatteo PagliardiniTatjana ChavdarovaSebastian U. StichAbstractCodeNeurIPS 2021 Workshop: Bayesian Deep Learning2021
Uncertainty estimation (UE) techniques  such as the Gaussian process (GP), Bayesian neural networks (BNN), Monte Carlo dropout (MCDropout)  aim to improve the interpretability of machine learning models by assigning an estimated uncertainty value to each of their prediction outputs. However, since too high uncertainty estimates can have fatal consequences in practice, this paper analyzes the above techniques.
Firstly, we show that GP methods always yield high uncertainty estimates on out of distribution (OOD) data. Secondly, we show on a 2D toy example that both BNNs and MCDropout do not give high uncertainty estimates on OOD samples. Finally, we show empirically that this pitfall of BNNs and MCDropout holds on real world datasets as well. Our insights (i) raise awareness for the more cautious use of currently popular UE methods in Deep Learning, (ii) encourage the development of UE methods that approximate GPbased methods  instead of BNNs and MCDropout, and (iii) our empirical setups can be used for verifying the OOD performances of any other UE method.
 An Improved Analysis of Gradient Tracking for Decentralized Machine LearningAnastasia KoloskovaTao LinSebastian U. StichAbstractPosterNeurIPS342021
We consider decentralized machine learning over a network where the training data is distributed across agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between workers' local data distributions, the known convergence rates for GT algorithms are not optimal with respect to their dependence on the mixing parameter (related to the spectral gap of the connectivity matrix).
We provide a tighter analysis of the GT method in the stochastic strongly convex, convex and nonconvex settings. We improve the dependency on p from p to pc in the noiseless case and from p to pc in the general stochastic case, where c>p is related to the negative eigenvalues of the connectivity matrix (and is a constant in most practical applications). This improvement was possible due to a new proof technique which could be of independent interest.
 RelaySum for Decentralized Deep Learning on Heterogeneous DataThijs VogelsLie HeAnastasia KoloskovaTao LinSai Praneeth R. KarimireddySebastian U. StichMartin JaggiAbstractCodeNeurIPS342021
In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without alltoall connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data.
 Breaking the centralized barrier for crossdevice federated learning (aka MIME)Sai Praneeth R. KarimireddyMartin JaggiSatyen KaleMehryar MohriSashank J. ReddiSebastian U. StichAnanda T. SureshAbstractCodeNeurIPS342021
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates client drift and ii) adapts arbitrary centralized optimization algorithms such as momentum and Adam to the crossdevice federated learning setting. Mime uses a combination of controlvariates and serverlevel statistics (e.g. momentum) at every clientupdate step to ensure that each local update mimics that of the centralized method run on iid data. We prove a reduction result showing that Mime can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Further, we show that when combined with momentum based variance reduction, Mime is provably faster than any centralized methodthe first such result. We also perform a thorough experimental exploration of Mime's performance on real world datasets.
 Semantic Perturbations with Normalizing Flows for Improved GeneralizationOğuz Kaan YükselSebastian U. StichMartin JaggiTatjana ChavdarovaAbstractICCV (extended abstract in: ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models)2021
Data augmentation is a widely adopted technique for avoiding overfitting when
training deep neural networks. However, this approach requires domainspecific
knowledge and is often limited to a fixed set of hardcoded transformations.
Recently, several works proposed to use generative models for generating
semantically meaningful perturbations to train a classifier. However, because
accurate encoding and decoding are critical, these methods, which use
architectures that approximate the latentvariable inference, remained limited
to pilot studies on small datasets. Exploiting the exactly reversible encoderdecoder structure of normalizing
flows, we perform onmanifold perturbations in the latent space to define fully
unsupervised data augmentations. We demonstrate that such perturbations match
the performance of advanced data augmentation techniques  reaching 96.6% test
accuracy for CIFAR10 using ResNet18 and outperform existing methods,
particularly in low data regimes  yielding 1025% relative improvement of
test accuracy from classical training. We find that our latent adversarial
perturbations adaptive to the classifier throughout its training are most
effective, yielding the first test accuracy improvement results on realworld
datasets  CIFAR10/100  via latentspace perturbations.
 QuasiGlobal Momentum: Accelerating Decentralized Deep Learning on Heterogeneous DataTao LinSai Praneeth R. KarimireddySebastian U. StichMartin JaggiAbstractVideoICML2021
Decentralized training of deep learning models is a key element for enabling data privacy and ondevice learning over networks.
In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization
challenge and may severely deteriorate the generalization performance.
In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees
of data heterogeneity. We propose a novel momentumbased method to mitigate this decentralized training difficulty. We show in extensive
empirical experiments on various CV/NLP datasets (CIFAR10, ImageNet, AG News, and SST2) and several network topologies (Ring and
Social Network) that our method is much more robust to the heterogeneity of clients' data than other existing methods, by a
significant improvement in test performance (1%−20%).
 Consensus Control for Decentralized Deep LearningLingjing KongTao LinAnastasia KoloskovaMartin JaggiSebastian U. StichAbstractVideoICML2021
Decentralized training of deep learning models enables ondevice learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a datacenter setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning.
We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we propose practical training guidelines for the datacenter setup as the important first step.
 Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous UpdatesSebastian U. StichAmirkeivan MohtashamiMartin JaggiAbstractVideoAISTATS2021
It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and  in asynchronous implementations  on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a datadependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and nonconvex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments.
 LENA: CommunicationEfficient Distributed Learning with SelfTriggered Gradient UploadsHossein Shokri GhadikolaeiSebastian U. StichMartin JaggiAbstractVideoAISTATS2021
In distributed optimization, parameter updates from the gradient computing node devices have to be aggregated in every iteration on the orchestrating server. When these updates are sent over an arbitrary commodity network, bandwidth and latency can be limiting factors. We propose a communication framework where nodes may skip unnecessary uploads. Every node locally accumulates an error vector in memory and selftriggers the upload of the memory contents to the parameter server using a significance filter. The server then uses a history of the nodes’ gradients to update the parameter. We characterize the convergence rate of our algorithm in smooth settings (stronglyconvex, convex, and nonconvex) and show that it enjoys the same convergence rate as when sending gradients every iteration, with substantially fewer uploads. Numerical experiments on real data indicate a significant reduction of used network resources (total communicated bits and latency), especially in large networks, compared to stateoftheart algorithms. Our results provide important practical insights for using machine learning over resourceconstrained networks, including InternetofThings and geoseparated datasets across the globe.
 A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!Dmitry KovalevAnastasia KoloskovaMartin JaggiPeter RichtárikSebastian U. StichAbstractVideoAISTATS2021
Decentralized optimization methods enable ondevice training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system.
We propose a new randomized firstorder method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages. By combining our scheme with a new variance reduction technique that progressively throughout the iterations reduces the adverse effect of the injected quantization noise, we obtain a scheme that converges linearly on strongly convex decentralized problems while using compressed communication only.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline which does not perform any communication compression while still allowing for a significant compression factor which depends on the conditioning of the problem and the topology of the network. Our key theoretical findings are supported by numerical experiments.
 Taming GANs with LookaheadMinmaxTatjana ChavdarovaMatteo PagliardiniSebastian U. StichFrançois FleuretMartin JaggiAbstractVideoICLR2021
Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead–minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large minibatches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR10, and ImageNet demonstrate a clear advantage of combining Lookahead–minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30fold fewer parameters and 16fold smaller minibatches we outperform the reported performance of the classdependent BigGAN on CIFAR10 by obtaining FID of 12.19 without using the class labels, bringing stateoftheart GAN training within reach of common computational resources.
 Ensemble Distillation for Robust Model Fusion in Federated LearningTao LinLingjing KongSebastian U. StichMartin JaggiAbstractVideoNeurIPS332020
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model while keeping the training data decentralized. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. However, directly averaging model parameters is only possible if all models have the same structure and size, which could be a restrictive constraint in many scenarios.
In this work we investigate more powerful and more flexible aggregation schemes for FL. Specifically, we propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients. This knowledge distillation technique mitigates privacy risk and cost to the same extent as the baseline FL algorithms, but allows flexible aggregation over heterogeneous client models that can differ e.g. in size, numerical precision or structure. We show in extensive empirical experiments on various CV/NLP datasets (CIFAR10/100, ImageNet, AG News, SST2) and settings (heterogeneous models/data) that the server model can be trained much faster, requiring fewer communication rounds than any existing FL technique so far.
 The ErrorFeedback Framework: Better Rates for SGD with Delayed Gradients and Compressed CommunicationSebastian U. StichSai Praneeth R. KarimireddyAbstractBibJournal of Machine Learning ResearchJMLR 21(237), 1–362020
We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasiconvex and nonconvex functions and derive concise, nonasymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for stronglyconvex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients&emdash;compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
 A Unified Theory of Decentralized SGD with Changing Topology and Local UpdatesAnastasia KoloskovaNicolas LoizouSadra BoreiriMartin JaggiSebastian U. StichAbstractVideoICML2020
Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communicationefficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and nonconvex) problems and the rates interpolate between the heterogeneous (nonidentically distributed data) and iiddata settings, recovering linear convergence rates in many special cases, for instance for overparametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD).
 Extrapolation for Largebatch Training in Deep LearningTao LinLingjing KongSebastian U. StichMartin JaggiAbstractVideoICML2020
Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often modelspecific noise and finetuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
 Is Local SGD Better than Minibatch SGD?Blake WoodworthKumar Kshitij PatelSebastian U. StichZhen DaiBrian BullinsH. Brendan McMahanOhad ShamirNathan SrebroAbstractVideoICML2020
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
 Advances and Open Problems in Federated LearningPeter KairouzH. Brendan McMahanBrendan AventAurélien BelletMehdi BennisArjun Nitin BhagojiKeith BonawitzZachary CharlesGraham CormodeRachel CummingsRafael G.L. D'OliveiraHubert EichnerSalim El RouayhebDavid EvansJosh GardnerZachary GarrettAdrià GascónBadih GhaziPhillip B. GibbonsMarco GruteserZaid HarchaouiChaoyang HeLie HeZhouyuan HuoBen HutchinsonJustin HsuMartin JaggiTara JavidiGauri JoshiMikhail KhodakJakub KonečnýAleksandra KorolovaFarinaz KoushanfarSanmi KoyejoTancrède LepointYang LiuPrateek MittalMehryar MohriRichard NockAyfer ÖzgürRasmus PaghMariana RaykovaHang QiDaniel RamageRamesh RaskarDawn SongWeikang SongSebastian U. StichZiteng SunAnanda T. SureshFlorian TramèrPraneeth VepakommaJianyu WangLi XiongZheng XuQiang YangFelix X. YuHan YuSen ZhaoAbstractFoundations and Trends® in Machine Learning14(1–2), 1–2102021
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
 SCAFFOLD: Stochastic Controlled Averaging for OnDevice Federated LearningSai Praneeth R. KarimireddySatyen KaleMehryar MohriSashank J. ReddiSebastian U. StichAnanda T. SureshAbstractVideoICML2020
Federated learning is a key scenario in modern largescale machine learning. In that scenario, the training data remains distributed over a large number of clients, which may be phones, other mobile devices, or network sensors and a centralized model is learned without ever transmitting client data over the network. The standard optimization algorithm used in this scenario is Federated Averaging (FedAvg). However, when client data is heterogeneous, which is typical in applications, FedAvg does not admit a favorable convergence guarantee. This is because local updates on clients can drift apart, which also explains the slow convergence and hardtotune nature of FedAvg in practice. This paper presents a new Stochastic Controlled Averaging algorithm (SCAFFOLD) which uses control variates to reduce the drift between different clients. We prove that the algorithm requires significantly fewer rounds of communication and benefits from favorable convergence guarantees.
 Dynamic Model Pruning with FeedbackTao LinSebastian U. StichLuis BarbaDaniil DmitrievMartin JaggiAbstractBibICLR2020
Deep neural networks often have millions of parameters. This can hinder their deployment to lowend devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR10 and ImageNet, and show that the obtained sparse models can reach the stateoftheart performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).
 Decentralized Deep Learning with Arbitrary Communication CompressionAnastasia KoloskovaTao LinSebastian U. StichMartin JaggiAbstractBibICLR2020
Decentralized training of deep learning models is a key element for enabling data privacy and ondevice learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that ChocoSGD&emdash;recently introduced and analyzed for stronglyconvex objectives only&emdash;converges under arbitrary high compression ratio on general nonconvex functions at the rate O(1/√(nT)) where T denotes the number of iterations and n the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous stateofthe art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming allreduce timewise).
 Don't Use Large MiniBatches, Use Local SGDTao LinSebastian U. StichKumar Kshitij PatelMartin JaggiAbstractBibICLR2020
Minibatch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the minibatch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a postlocal SGD and show that it significantly improves the generalization performance compared to largebatch training on standard benchmarks while enjoying the same efficiency (timetoaccuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance tradeoffs associated with a host of local SGD variants.
 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationAnastasia KoloskovaSebastian U. StichMartin JaggiAbstractVideoCodeBibICMLPMLR 97, 3478–34872019
We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤1 (ω=1 meaning no compression). We (i) propose a novel gossipbased stochastic gradient descent algorithm, CHOCOSGD, that converges at rate O(1/(nT)+1/(Tδ2ω)2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCOGOSSIP, for the average consensus problem that converges in time O(1/(δ2ω)log(1/ε)) for accuracy ε>0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω>0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective stateoftheart baselines and CHOCOSGD can reduce communication by at least two orders of magnitudes.
 Error Feedback Fixes SignSGD and other Gradient Compression SchemesSai Praneeth R. KarimireddyQuentin RebjockSebastian U. StichMartin JaggiAbstractCodeBibICMLPMLR 97, 3252–32612019
Signbased algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counterexamples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using errorfeedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EFSGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.
 Local SGD Converges Fast and Communicates LittleSebastian U. StichAbstractPosterBibICLR2019
Minibatch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.
We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as minibatch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and minibatch size. The number of communication rounds can be reduced up to a factor of T^{1/2}where T denotes the number of total stepscompared to minibatch SGD. This also holds for asynchronous implementations.
Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
 Efficient Greedy Coordinate Descent for Composite ProblemsSai Praneeth R. KarimireddyAnastasia KoloskovaSebastian U. StichMartin JaggiAbstractBibAISTATSPMLR 89, 2887–28962019
Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations.
In this paper, we consider greedy updates that are based on subgradients for a class of nonsmooth composite problems, which includes L1regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy.
Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.
 Sparsified SGD with MemorySebastian U. StichJeanBaptiste CordonnierMartin JaggiAbstractPosterCodeBibNeurIPS31, 4452–44632018
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (topk sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
In this work we analyze Stochastic Gradient Descent (SGD) with ksparsification or compression (for instance topk or randomk) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
 Global linear convergence of Newton's method without strongconvexity or Lipschitz gradientsSai Praneeth R. KarimireddySebastian U. StichMartin JaggiBibNeurIPS 2019 Workshop: Beyond First Order Methods in ML2018
 Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder OptimizationRobert M. GowerFilip HanzelyPeter RichtárikSebastian U. StichAbstractPosterBibNeurIPS31, 1626–16362018
We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasiNewton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speedups over classical nonaccelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.
 On Matching Pursuit and Coordinate DescentFrancesco LocatelloAnant RajSai Praneeth R. KarimireddyGunnar RätschBernhard SchölkopfSebastian U. StichMartin JaggiBibICMLPMLR 80, 3198–32072018
 Adaptive balancing of gradient and update computation times using global geometry and approximate subproblemsSai Praneeth R. KarimireddySebastian U. StichMartin JaggiSlidesBibAISTATSPMLR 84, 1204–12132018
 Safe Adaptive Importance SamplingSebastian U. StichAnant RajMartin JaggiVideoSlidesPosterBibNIPS30, 4381–43912017
 On the existence of ordinary trianglesRadoslav FulekHossein N. MojarradMárton NaszódiJózsef SolymosiSebastian U. StichMay SzedlákBibComputational Geometry66, 28–312017
 Approximate Steepest Coordinate DescentSebastian U. StichAnant RajMartin JaggiPosterBibICMLPMLR 70, 3251–32592017
 Efficiency of accelerated coordinate descent method on structured optimization problemsYurii NesterovSebastian U. StichBibSIAM Journal on Optimization27:1, 110–1232017
 Variable Metric Random PursuitSebastian U. StichChristian L. MüllerBernd GärtnerBibMathematical Programming156(1), 549–5792016
 On two continuum armed bandit problems in high dimensionsHemant TyagiSebastian U. StichBernd GärtnerBibTheory of Computing Systems58:1, 191–2222016
 On low complexity Acceleration Techniques for Randomized OptimizationSebastian U. StichSupplBibPPSN XIIISpringer, 130–1402014
 Optimization of Convex Functions with Random PursuitSebastian U. StichChristian L. MüllerBernd GärtnerBibSIAM Journal on Optimization23:2, 1284–13092013
 On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesSebastian U. StichChristian L. MüllerPosterBibPPSN XIISpringer, 448–4572012
 On Two Problems Regarding the Hamiltonian Cycle GameDan HefetzSebastian U. StichBibThe Electronic Journal of Combinatorics16(1)2009
Preprints / Various:
 Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient MethodsAmirkeivan MohtashamiMartin JaggiSebastian U. StichAbstractTechnical Report2022
While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured shuffling when various levels of abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset in theory and in practice. Finally, relying on our measure, we develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.
 ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!Konstantin MishchenkoGrigory MalinovskySebastian U. StichPeter RichtárikAbstractTechnical Report2022
We introduce ProxSkip  a surprisingly simple and provably efficient method for minimizing the sum of a smooth (f)
and an expensive nonsmooth proximable (ψ) function. The canonical approach to solving such problems is via the proximal gradient descent
(ProxGD) algorithm, which is based on the evaluation of the gradient of f and the prox operator of ψ in each iteration.
In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation
of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations.
Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently
on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging.
In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradienttype methods,
such as FedAvg, SCAFFOLD, SLocalGD and FedLin, whose theoretical communication complexity is worse than, or at best matching,
that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneitybounding assumptions.
 Linear Speedup in Personalized Collaborative LearningElMahdi ChaytiSai Praneeth R. KarimireddySebastian U. StichNicolas FlammarionMartin JaggiAbstractTechnical Report2021
Personalization in federated learning can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In order to develop training algorithms that optimally balance this tradeoff, it is necessary to extend our theoretical foundations. In this work, we formalize the personalized collaborative learning problem as stochastic optimization of a user's objective f0(x) while given access to N related but different objectives of other users {f1(x),…,fN(x)}. We give convergence guarantees for two algorithms in this setting  a popular personalization method known as weighted gradient averaging, and a novel bias correction method  and explore conditions under which we can optimally tradeoff their bias for a reduction in variance and achieve linear speedup w.r.t. the number of users N. Further, we also empirically study their performance confirming our theoretical insights.
 ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive TrainingHuiPo WangSebastian U. StichYang HeMario FritzAbstractTechnical Report2021
Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data. However, training is resourceintensive for edge devices, and limited network bandwidth is often the main bottleneck. Prior work often overcomes the constraints by condensing the models or messages into compact formats, e.g., by gradient compression or distillation. In contrast, we propose ProgFed, the first progressive training framework for efficient and effective federated learning. It inherently reduces computation and twoway communication costs while maintaining the strong performance of the final models. We theoretically prove that ProgFed converges at the same asymptotic rate as standard training on full models. Extensive results on a broad range of architectures, including CNNs (VGG, ResNet, ConvNets) and Unets, and diverse tasks from simple classification to medical image segmentation show that our highly effective training approach saves up to 20% computation and up to 63% communication costs for converged models. As our approach is also complimentary to prior work on compression, we can achieve a wide range of tradeoffs, showing reduced communication of up to 50× at only 0.1% loss in utility.
 Escaping Local Minima With Stochastic NoiseHarsh VardhanSebastian U. StichAbstractNeurIPS 2021 Workshop: Optimization for Machine Learning2021
Nonconvex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using
stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success.
In particular, the standard analyses don’t show global convergence of SGD on nonconvex functions, and instead show convergence to stationary points (which can also be local minima or saddle
points). In this work, we identify a broad class of nonconvex functions for which we can show that
perturbed SGD (gradient descent perturbed by stochastic noise—covering SGD as a special case)
converges to a global minimum, in contrast to gradient descent without noise that can get stuck
in local minima. In particular, for nonconvex functions that are relative close to a convexlike
(strongly convex or PŁ) function we prove that SGD converges linearly to a global optimum.
 On Secondorder Optimization Methods for Federated LearningSebastian BischoffStephan GunnemannMartin JaggiSebastian U. StichAbstractICML 2021 Workshop: Beyond firstorder methods in ML systems2021
We consider federated learning (FL), where the training data is distributed across a large number of clients. The standard optimization method in this setting is Federated Averaging (FedAvg), which performs multiple local firstorder optimization steps between communication rounds. In this work, we evaluate the performance of several secondorder distributed methods with local steps in the FL setting which promise to have favorable convergence properties.
We (i) show that FedAvg performs surprisingly well against its secondorder competitors when evaluated under fair metrics (equal amount of local computations)in contrast to the results of previous work. Based on our numerical study, we propose (ii) a novel variant that uses secondorder local information for updates and a global line search to counteract the resulting local specificity.
 Bidirectional Adaptive Communication for Heterogenous Distributed LearningDmitrii AvdiukhinNikita IvkinMartin JaggiVladimir BravermanAbstractICML 2021 Workshop: Federated Learning for User Privacy and Data Confidentiality2021
Communication constraints are a key bottleneck
in distributed optimization, in particularly bandwidth and latency can be limiting factors when
devices are connected over commodity networks,
such as in Federated Learning. Stateoftheart
techniques tackle these challenges by advanced
compression techniques or delaying communication rounds according to predefined schedules.
We present a new scheme that adaptively skips
communication (broadcast and client uploads) by
detecting slowvarying updates. The scheme automatically adjusts the communication frequency
independently for each worker and the server.
By utilizing an errorfeedback mechanism – borrowed from the compression literature – we prove
that the convergence rate is the same as for batch
gradient descent in the convex and nonconvex
smooth cases. We show reduction of the total
number of communication rounds between server
and clients needed to achieve a targeted accuracy, even in the case when the data distribution
is highly nonIID.
 A Field Guide to Federated OptimizationJianyu WangZachary CharlesZheng XuGauri JoshiH. Brendan McMahanBlaise Aguera y ArcasMaruan AlShedivatGalen AndrewSalman AvestimehrKatharine DalyDeepesh DataSuhas DiggaviHubert EichnerAdvait GadhikarZachary GarrettAntonious M. GirgisFilip HanzelyAndrew HardChaoyang HeSamuel HorváthZhouyuan HuoAlex IngermanMartin JaggiTara JavidiPeter KairouzSatyen KaleSai Praneeth R. KarimireddyJakub KonečnýSanmi KoyejoTian LiLuyang LiuMehryar MohriHang QiSashank J. ReddiPeter RichtárikKaran SinghalVirginia SmithMahdi SoltanolkotabWeikang SongAnanda T. SureshSebastian U. StichAmeet TalwalkarHongyi WangBlake WoodworthShanshan WuFelix X. YuHonglin YuanManzil ZaheerMi ZhangTong ZhangChunxiang ZhengChen ZhuWennan ZhuAbstractTechnical Report2021
Federated learning and analytics are a distributed approach for
collaboratively learning models (or statistics) from decentralized data,
motivated by and designed for privacy protection. The distributed learning
process can be formulated as solving federated optimization problems, which
emphasize communication efficiency, data heterogeneity, compatibility with
privacy and system requirements, and other constraints that are not primary
considerations in other problem settings. This paper provides recommendations
and guidelines on formulating, designing, evaluating and analyzing federated
optimization algorithms through concrete examples and practical implementation,
with a focus on conducting effective simulations to infer realworld
performance. The goal of this work is not to survey the current literature, but
to inspire researchers and practitioners to design federated learning
algorithms that can be used in various practical applications.
 Decentralized Local Stochastic ExtraGradient for Variational InequalitiesAleksandr BeznosikovPavel DvurechenskyAnastasia KoloskovaValentin SamokhinSebastian U. StichAlexander GasnikovAbstractTechnical Report2021
We consider decentralized stochastic variational inequalities where the problem data is distributed across many participating devices (heterogeneous, or nonIID data setting). We propose a novel method  based on stochastic extragradient  where participating devices can communicate over arbitrary, possibly timevarying network topologies. This covers both the fully decentralized optimization setting and the centralized topologies commonly used in Federated Learning. Our method further supports multiple local updates on the workers for reducing the communication frequency between workers. We theoretically analyze the proposed scheme in the strongly monotone, monotone and nonmonotone setting. As a special case, our method and analysis apply in particular to decentralized stochastic minmax problems which are being studied with increased interest in Deep Learning. For example, the training objective of Generative Adversarial Networks (GANs) are typically saddle point problems and the decentralized training of GANs has been reported to be extremely challenging. While SOTA techniques rely on either repeated gossip rounds or proximal updates, we alleviate both of these requirements. Experimental results for decentralized GAN demonstrate the effectiveness of our proposed algorithm.
 On Communication Compression for Distributed Optimization on Heterogeneous DataSebastian U. StichAbstractTechnical Report2020
Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models. We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD (DQSGD) with arbitrary unbiased quantizers and (ii) distributed SGD with errorfeedback and biased compressors (DEFSGD) in the heterogeneous (noniid) data setting.
Our results indicate that DEFSGD is much less affected than DQSGD by noniid data, but both methods can suffer a slowdown if dataskewness is high. We propose two alternatives that are not (or much less) affected by heterogenous data distributions: a new method that is only applicable to strongly convex problems, and we point out a more general approach that is applicable to linear compressors.
 On the Convergence of SGD with Biased GradientsAhmad AjalloeianSebastian U. StichAbstractICML 2020 Workshop: Beyond First Order Methods in ML Systems2020
The classic stochastic gradient (SGD) algorithm crucially depends on the availability of unbiased stochastic gradient oracles. However, in certain applications only biased gradient updates are available or preferred over unbiased ones for performance reasons.
For instance, in the domain of distributed learning, biased gradient compression techniques such as top$k$ compression have been proposed as a tool to alleviate the communication bottleneck and in derivativefree optimization, only biased gradient estimators can be queried.
We present a general framework to analyze SGD with biased gradient oracles and derive convergence rates for smooth (nonconvex) functions and give improved rates under the PolyakLojasiewicz condition. Our rates show how biased estimators impact the attainable convergence guarantees. We discuss a few guiding examples that show the broad applicability of our analysis.
 Unified Optimal Analysis of the (Stochastic) Gradient MethodSebastian U. StichAbstractBibTechnical Report2019
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on μstrongly convex functions under a (milder than standard) Lsmoothness assumption. We show that SGD converges after T iterations as O(L∥x0−x*∥² exp[−μT/4L]+σ²/μT) where σ² measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have σ²=0 and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
 Stochastic Distributed Learning with Gradient Quantization and Variance ReductionSamuel HorváthDmitry KovalevKonstantin MishchenkoSebastian U. StichPeter RichtárikAbstractBibTechnical Report2019
We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance ω≥1 that might slow down convergence. For strongly convex functions with condition number κ distributed among n machines, we (i) give a scheme that converges in O((κ+κωn+ω) log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finitesum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and nonconvex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.
 kSVRG: Variance Reduction for Large Scale OptimizationAnant RajSebastian U. StichBibTechnical Report2018
 Stochastic continuum armed bandit problem of few linear parameters in high dimensionsHemant TyagiSebastian U. StichBernd GärtnerBibTechnical Report2013
 Random Pursuit in Hilbert SpaceSebastian U. StichBernd GärtnerBibTechnical ReportCGLTR882013
 Matrixvalued Iterative Random ProjectionsSebastian U. StichChristian L. MüllerBernd GärtnerBibTechnical ReportCGLTR872013
Theses
Talks / Posters / Workshops
 13/14 December202113th Workshop on Optimization for Machine Learning (OPT)VirtualOnlineWorkshop
 Algorithms for Efficient Federated and Decentralized Learning24 July2021International Workshop on Federated Learning for User Privacy and Data ConfidentialityVirtualOnline
 Asynchronous Methods in Distributed Optimization and Insights from the ErrorFeedback FrameworkSlides20–23 July2021SIAM Conference on Optimization (OP21)Spokane, WashingtonUSA
 Escaping local minima for a class of smooth nonconvex problems14 July2021Optimization without BordersHybrid, SochiRussia
 Taming GANs with LookaheadMinmax (talk by Matteo Pagliardini)Video3 May–7 May2021International Conference on Learning Representations (ICLR)VirtualOnline
 11 December202012th Workshop on Optimization for Machine Learning (OPT)Virtual, formerly VancouverOnlineWorkshop
 Towards Decentralized Machine Learning with SGD12 November2020SeminarETH ZurichSwitzerland
 A Unified Theory of Decentralized SGD with Changing Topology and Local Updates (talk by Anastasia Koloskova)VideoExtrapolation for Largebatch Training in Deep Learning (talk by Tao Lin)VideoIs Local SGD Better than Minibatch SGD? (talk by Blake Woodworth)VideoSCAFFOLD: Stochastic Controlled Averaging for Federated Learning (talk by Praneeth Karimireddy)Video12–18 July202037th International Conference on Machine Learning (ICML)Virtual conference, formerly ViennaOnline
 Decentralized Deep Learning with Arbitrary Communication Compression (talk by Anastasia Koloskova)VideoDon't Use Large Minibatches, Use Local SGD (talk by Tao Lin)VideoDynamic Model Pruning with Feedback (talk by Tao Lin)Video26 April–1st May2020International Conference on Learning Representations (ICLR)Virtual Conference, formerly Addis AbabaOnline
 27 February2020research visit Suvrit Sra, MITCambridge MAUSA
 A Unifying Framework for Communication Efficient Decentralized Machine Learning20 February2020MLO Group SeminarLausanneSwitzerland
 Advances in ML: Theory meets PracticeWorkshop25–29 January2020Applied Machine Learning DaysEPFL LausanneSwitzerland
 14 December201911th Workshop on Optimization for Machine Learning (OPT)VancouverCanadaWorkshop
 8–14 December2019Thirtythird Conference on Neural Information Processing Systems (NeurIPS)VancouverCanada
 23–26 November2019KAUSTTsinghua Industry Workshop on Advances in Artificial IntelligenceThuwalKSA
 17–29 November2019research visit Peter Richtárik, KAUSTThuwalKSA
 22–30 October2019research visit Yurii Nesterov, UCLouvainLouvainlaNeuveBelgium
 Tutorial On Theory of Federated Learning15 October2019MLO SeminarLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed Training5–8 August20196th International Conference on Continuous Optimization (ICCOPT)BerlinGermany
 On Lottery Tickets in DL5 July2019EPFL, MLTheory SeminarLausanneSwitzerland
 Variance Reduced Methods in Derivative Free Optimization24–26 June201930th European Conference on Operational Research (EURO)DublinIreland
 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed CommunicationError Feedback Fixes SignSGD and other Gradient Compression Schemes9–15 June201936th International Conference on Machine Learning (ICML)Long BeachUSA
 Gradient Compression with Error Feedback for Distributed Optimization4 June2019OwkinParisFrance
 Gradient Compression with Error Feedback for Distributed Optimization3 June2019Séminaire Parisien d'OptimisationParisFrance
 Tutorial On Robust Optimization28 May2019MLO SeminarLausanneSwitzerland
 Local SGD Converges Fast and Communicates LittlePoster6–9 May2019International Conference on Learning Representations (ICLR)New OrleansUSA
 Communication Efficient SGD for Distributed Optimization20–22 March2019Operator Splitting Methods in Data Analysis WorkshopFlatiron Institute, New YorkUSA
 Error Feedback for Communication Efficient SGD11–15 March2019research visit Nati Srebo, Toyota Technological Institute at Chicago (TTIC)ChicagoUSA
 Error Feedback for Communication Efficient SGD14 February2019MPITübingenGermany
 Advances in ML: Theory meets PracticeWorkshop26–29 January2019Applied Machine Learning DaysEPFL LausanneSwitzerland
 Sparsified SGD with MemoryPosterAccelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder OptimizationPoster2–8 December2018Thirtysecond Annual Conference on Neural Information Processing Systems (NeurIPS)MontréalCanada
 Communication Efficient Variants of SGD for Distributed Computing23 November2018Machine Learning meeting, EPFLLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed ComputingkSVRG: Variance Reduction for Large Scale Optimization23 October–14 November2018research visit Peter Richtárik, KAUSTThuwalKSA
 Tutorial On Convex and NonConvex Optimization26 September2018MLO SeminarLausanneSwitzerland
 Communication Efficient Variants of SGD for Distributed Computing2–14 September2018research visit François Glineur, Yurii Nesterov, UCLouvainLouvainlaNeuveBelgium
 Communication Efficient SGD through Quantization with Memory30 August2018MittagsseminarETH ZürichSwitzerland
 On Matching Pursuit and Coordinate Descent10–15 July201835th International Conference on Machine Learning (ICML)StockholmSweden
 Approximate Composite Minimization: Convergence
Rates and Examples2–6 July201823rd International Symposium on Mathematical Programming (ISMP)BordeauxFrance
 Tutorial On Random Pursuit Methods13 June2018MLO SeminarLausanneSwitzerland
 Adaptive balancing of gradient and update computation times9–11 April2018The 21st International Conference on Artificial Intelligence and Statistics (AISTATS)Playa Blanca, LanzaroteSpain
 19–20 February2018Swiss Joint Research Center Workshop 2018EPFL LausanneSwitzerland
 Advances in ML: Theory meets PracticeWorkshop27–30 January2018Applied Machine Learning DaysEPFL LausanneSwitzerland
 Adaptive importance sampling for convex optimization15–18 January2018Optimization for Learning WorkshopPontificia Universidad Católica de Chile, SantiagoChile
 Adaptive importance sampling for convex optimization2–13 January2018research visit Hemant Tyagi, Alan Touring InstituteLondonUK
 Safe Adaptive Importance Sampling4–9 December2017Thirtyfirst Annual Conference on Neural Information Processing Systems (NIPS)Long BeachUSAPoster
 27 November – 1 December2017Optimization, Statistics and Uncertainty Simons Institute for the Theory of Computing, BerkelyUSA
 Approximate Steepest Coordinate Descent3–21 October2017research visit Peter Richtárik, KAUSTThuwalKSA
 Tutorial On Coordinate Descent30 August2017MLO SeminarLausanneSwitzerland
 Approximate Steepest Coordinate Descent6–11 August201734th International Conference on Machine Learning (ICML)SidneyAustraliaPoster
 14–16 June2017The Summer Research Institute 2017EPFL LausanneSwitzerland
 12–14 June2017Google Machine Learning SummitGoogle ZürichSwitzerland
 Efficiency of Coordinate Descent on Structured Problems22–25 May2017SIAM Conference on Optimization (OP17)VancouverCanada
 2–3 February2017Swiss Joint Research Center Workshop 2017Microsoft CambridgeUK
 30–31 January2017Applied Machine Learning DaysEPFL LausanneSwitzerland
 5–10 December2016Thirtieth Annual Conference on Neural Information Processing Systems (NIPS)BarcelonaSpain
 Randomized Algorithms for Convex Optimization18 October2016Operations Research SeminarCORE, LouvainlaNeuveBelgium
 Efficiency of Random Search on Structured Problems6–11 August20165th International Conference on Continuous Optimization (ICCOPT)TokyoJapan
 6–10 June201614th Gremo Workshop on Open ProblemsSt. NiklausenSwitzerland
 Adaptive QuasiNewton Methods6 June20169th Combinatorial Algorithms DayETH ZürichSwitzerland
 11 April–15 April2016research visit Peter Richtárik, University of EdinburghEdinburghUK
 7–12 February2016Optimization Without BordersLes HouchesFrance
 Efficient Methods for a Class of Truss Topology Design Problems28–29 January201630th annual conference of the Belgian Operational Research Society (ORBEL)LouvainlaNeuveBelgium
 23 November2015IAP DYSCO Study Day: Dynamical systems, control and optimizationLeuvenBelgium
 26 October – 20 November2015SOCN Grad Course: Randomized algorithms for big data optimizationLouvainlaNeuveBelgium
 Accelerated Random Search8–10 July201513th EUROPT Workshop on Advances in Continuous OptimizationEdinburghUK
 1–5 June201513th Gremo Workshop on Open ProblemsFeldisSwitzerland
 Solving generalized Laplacian linear systems1 June20158th Combinatorial Algorithms DayETH ZürichSwitzerland
 6–8 May2015Optimization and Big Data 2015, Workshop, Trek and ColloquiumEdinburghUK
 12 November2014IAP DYSCO Study Day: Dynamical systems, control and optimizationGentBelgium
 Optimization of convex function with Random Pursuit4 November2014Simons FoundationNew YorkUSA
 30 October–6 November2016research visit Christian Müller, Simons FoundationNew YorkUSA
 On Low Complexity Acceleration Techniques for Randomized Optimization13–17 September201413th International Conference on Parallel Problem Solving From Nature (PPSN)LjubljanaSlovenia
 30 June – 4 July201412th Gremo Workshop on Open ProblemsVal SinestraSwitzerland
 30 June20147th Combinatorial Algorithms DayETH ZürichSwitzerland
 Probabilistic Estimate Sequences4–5 April2014Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2014JenaGermany
 Probabilistic Estimate Sequences25 March2014TAO reserach seminarINRIA Saclay, ÎledeFranceFrance
 Natural Gradient in Evolution Strategies17 October2013MittagsseminarETH ZürichSwitzerland
 Optimization and Learning with Random Pursuit30 September – 2 Oktober2013CG Learning Review MeetingAthensGreece
 30 June – 5 July2013Dagstuhl Seminar "Theory of Evolutionary Algorithms"WadernGermany
 24–28 June201311th Gremo Workshop on Open ProblemsAlp SellamattSwitzerland
 24 June20136th Combinatorial Algorithms DayETH ZürichSwitzerland
 Stochastic Bandits30 May2013MittagsseminarETH ZürichSwitzerland
 6 April – 8 May2013Research visit with Christian Müller and Jonathan GoodmanCourant Institute of Mathematical Sciences, New York UniversityUSA
 Variable Metric Random Pursuit13–14 December2012CG Learning Review MeetingBerlinGermany
 Variable Metric Random Pursuit4 December2012MittagsseminarETH ZürichSwitzerland
 On spectral invariance of Randomized Hessian and Covariance Matrix Adaptation schemesPoster1–5 September201212th International Conference on Parallel Problem Solving From Nature (PPSN)TaorminaItaly
 Convergence of Local Search19–24 August201221st International Symposium on Mathematical Programming (ISMP)TU BerlinGermany
 20–22 June201212th International Workshop on High Performance Optimization (HPOPT): Algorithmic convexity and applicationsDelft University of TechnologyThe Netherlands
 4–8 June201210th Gremo Workshop on Open ProblemsBergünSwitzerland
 4 June20125th Combinatorial Algorithms DayETH ZürichSwitzerland
 Convergence of Local Search2–3 May2012Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2012Lille/Villeneuve d'AscqFrance
 The Heavy Ball Method3 April2012MittagsseminarETH ZürichSwitzerland
 Advertising Randomized derivativefree optimization11–14 March2012The First ETHJapan Workshop on Science and ComputingEngelbergSwitzerland
 7–9 March2012The First ETHJapan Symposium for the Promotion of Academic ExchangesETH ZürichSwitzerland
 Gradientfree optimization with Random Pursuit15 December2011CG Learning Review MeetingZürichSwitzerland
 Dimension reduction with the JohnsonLindenstrauss Lemma29 September2011MittagsseminarETH ZürichSwitzerland
 30 August – 2 September2011International Conference on Operations ResearchZürichSwitzerland
 Randomized DerivativeFree Optimization, a survery of different methodsPoster8–11 August2011MADALGO & CTIC Summer School on Highdimensional Geometric ComputingAarhus UniversityDanmark
 Random derivativefree optimization of convex functions using a line search oracle8–9 July2011Workshop on Theory of Randomized Search Heuristics (ThRaSH) 2011CopenhagenDanmark
 4–8 July201138th International Colloquium on Automata, Languages and Programming (ICALP)ZürichSwitzerland
 9–11 June2011CG Learning Kickoff WorkshopParisFrance
 28–30 March201127th European Workshop on Computational Geometry (EuroCG)MorschachSwitzerland
 7–10 February20119th Gremo Workshop on Open ProblemsWergensteinSwitzerland
 Principles of selfadaptation in randomized optimization16 December2010MittagsseminarETH ZürichSwitzerland
 Graph sparsification with applications11 March2010Institute for Operations Research (IFOR)ETH ZürichSwitzerland
Student Projects
 Pedram KhorsandiMomentum Methods in DLSummer Intern (Virtual)2021
 Yehao LiuUncertainity Estimation in DLMaster Project2021Paper
 Yue Liu (now graduate student at U Toronto)Variance reduction on decentralized training over heterogeneous dataMaster Thesis2021
 Giovanni IgowaCompression and Training for DLBachelor Project2021
 Ilyas Fatkhullin (now PhD student at ETH Zurich)Robust Acceleration with Error FeedbackSummer Intern (Virtual)2020
 Oguz Kaan YükselSemantic Perturbations with Normalizing Flows for Improved GeneralizationMaster Project2021Paper
 Nicolas BrandtDitGrieurinAdaptive Optimization Methods in DLMaster Project2020
 Sophie MauranModel Parallel Training of DNNMaster Project2020
 Pierre VuillecardPreconditioned SGD For DLMaster Project2020
 Thomas de Boissonneaux de ChevignyOn DeAI SimulatorsBachelor Project2020
 Manjot SinghThe Interplay between Noise and CurvatureOptional Master Project2020
 Damian Dudzicz (now exchange student at ETH)PushSum Algorithms for Decentralized OptimizationMaster Project2020
 Harshvardhan Harshvardhan (now graduate student at UC San Diego)ConvergenceDiagnostic based Adaptive Batch Sizes for SGDOptional Project2020
 Pavlo KaralupovImportance Sampling to Accelerate DL TrainingMaster Project2020
 Raphael BonattiOn Solving Combinatorial Games with Neural NetworksBachelor Project2020
 Sebastian BischoffImportance Sampling to Accelerate DL TrainingMaster Project2020
 Sadra BoreiriDecentralized Local SGDMaster Project2019Paper
 Damian Dudzicz (now exchange student at ETH)Decentralized Asynchronous SGDMaster Project2019
 Oriol Barbany Mayor (now at Logitec)AffineInvariant Robust TrainingMaster Project2019
 Rayan ElalamyConvergence Anlysis for Hogwild! Under Less Restrictive AssumptionsMaster Project2019
 Lionel ConstantinStochastic Extragradient for GAN TrainingMaster Project2019
 Harshvardhan HarshvardhanBetter bounds for Optimal Batch Size in SGDMaster Project2019
 Srijan Bansal (now at IIT Kharagpur)Adaptive Stepsize Strategies for SGDSummer Internship2019
 Guillaume van Dessel (now PhD student at UCLouvain)Stochastic gradient based methods for largescale optimization: review, unification and new approachesMaster Thesis2019
 Ahmad Ajalloeian (now PhD student at UNIL)Stochastic ZerothOrder Optimisation Algorithms with Variance ReductionMaster Thesis2019
 Claire Capelo (now exchange student at Stanford)Adaptive schemes for communication compression: Trajectory Normalized Gradients for Distributed OptimizationMaster Project2019
 Aakash Sunil Lahoti Special Cases of Local SGDMaster Project2019
 Cem Musluoglu (now PhD student at KU Leuven)Quantization and Compression for Distributed OptimizationMaster Project2018
 Quentin Rebjock (now PhD student at EPFL)Error Feedback For SignSGDMaster Project2018Paper
 Jimi Vaubien (now at Visum Technologies)DerivativeFree Empirical Risk MinimizationBachelor Project2018
 Polina Kirichenko (now PhD student at NYU)Zeroorder training for DLSummer Internship2018
 Nikhil Krishna (now at Detect Technologies)Adaptive Sampling with LSHSummer Internship2018
 Alberto Chiappa (now at Schindler Group)Realtime recognition of industryspecific context from mobile phone sensor dataMaster Thesis (industry project with Schindler)2018
 JeanBaptiste Cordonnier (now PhD student at EPFL)Distributed Machine Learning with Sparsified Stochastic Gradient DescentMaster Thesis2018Paper
 Arthur Deschamps (now research intern at National University of Singapore)Asynchronous Methods in Machine LearningBachelor Project2018
 ChiaAn Yu (now at Bloomberg)Compression for Stochastic Gradient DescentMaster Project2018
 Frederik Künstner (now PhD student at UBC)Fully Quantized Distributed Gradient DescentMaster Project2017
 Fabian Latorre Gomez (now PhD Student at EPFL)Importance Sampling for MLEDIC Project2017
 Pooja Kulkarni (now PhD student at UIUC)Steepest Descent and Adaptive SamplingSummer Internship2017
 Anastasia Koloskova (now PhD student at EPFL)Steepest CD and LSHSummer Internship2017Paper
 William Borgeaud dit AvocatAdaptive Sampling in Stochastic Coordinate DescentMaster Project2017
 Joel Castellon (now at Amazon)Complexity analysis for AdaRBFGS: a primitive for methods between ﬁrst and second orderSemester Project2017
 Alberto Chiappa (now at Schindler Group)Asynchronous updates for stochastic gradient descentMaster Project2017
 Marina Mannari (now at Cartier)Faster Coordinate Descent for Machine Learning through Limited Precision OperationsMaster Thesis2017
 Anant Raj (now PhD student at MPI)Steepest Coordinate DescentInternship2017Paper
Teaching
Lectures
 Optimization for Machine LearningSummer 22
Seminars
 Topics in Optimization for Machine LearningWinter 21/22
Teaching Assistance