ABSTRACTWe consider distributed optimization over several devices, each sending incremental model updates to a central server. This setting is considered, for instance, in federated learning. Various schemes have been designed to compress the model updates in order to reduce the overall communication cost. However, existing methods suffer from a significant slowdown due to additional variance ω>0 coming from the compression operator and as a result, only converge sublinearly. What is needed ...