In the last few years, we have seen a surge of empirical and theoretical works about “scaling laws”, whose goals are to characterize the performance of learning methods based on various problem parameters (e.g., number of observations and parameters, or amount of compute). From a theoretical point of view, this marks a renewed interest in asymptotic equivalents—something the machine learning community had mostly moved away from (and, let’s be honest, kind of looked down on) in favor o...| Machine Learning Research Blog
This month, we pursue our exploration of spectral properties of kernel matrices. As mentioned in a previous post, understanding how eigenvalues decay is not only fun but also key to understanding algorithmic and statistical properties of many learning methods (see, e.g., chapter 7 of my book “Learning Theory from First Principles“). This month, we look at the common situation of translation-invariant kernels in \(\mathbb{R}^d\), review asymptotic results, and provide non-asymptotic bounds...| Machine Learning Research Blog
Just in time for Christmas, I received two days ago the first hard copies of my book! It is a mix of feelings of relief and pride after 3 years of work. As most book writers will probably acknowledge, it took much longer than I expected when I started, but overall it was an enriching experience for several reasons: (1) I learned a lot of new math to write it; (2) trying to squeeze the simplest arguments for the analysis of algorithms led to many new research questions; (3) I was able to get t...| Machine Learning Research Blog
Scaling laws have been one of the key achievements of theoretical analysis in various fields of applied mathematics and computer science, answering the following key question: How fast does my method or my algorithm converge as a function of (potentially partially) observable problem parameters.| Machine Learning Research Blog
Since my early PhD years, I have plotted and studied eigenvalues of kernel matrices. In the simplest setting, take independent and identically distributed (i.i.d.) data, such as in the cube below in 2 dimensions, take your favorite kernels, such as the Gaussian or Abel kernels, plot eigenvalues in decreasing order, and see what happens. The goal of this sequence of blog posts is to explain what we observe.| Machine Learning Research Blog
There are a few mathematical results that any researcher in applied mathematics uses on a daily basis. One of them is Jensen’s inequality, which allows bounding expectations of functions of random variables. This really happens a lot in any probabilistic arguments but also as a tool to generate inequalities and optimization algorithms. In this blog post, I will present a collection of fun facts about the inequality, from very classical to more obscure. If you know other cool ones, please ad...| Machine Learning Research Blog
Among continuous optimization problems, convex problems (with convex objectives and convex constraints) define a class that can be solved efficiently with a variety of algorithms and with arbitrary precision. This is not true more generally when the convexity assumption is removed (see this post). This of course does not mean that (1) nobody should attempt to solve high-dimensional non-convex problems (in fact, the spell checker run on this document was trained solving such a problem…), and...| Machine Learning Research Blog
In optimization, acceleration is the art of modifying an algorithm in order to obtain faster convergence. Building accelerations and explaining their performance have been the subject of a countless number of publications, see [2] for a review. In this blog post, we give a vignette of these discussions on a minimal but challenging example, Nesterov acceleration. We introduce and illustrate different frameworks to understand and design acceleration techniques: the discrete framework based on i...| Machine Learning Research Blog
In these last two years, I have been studying intensively sum-of-squares relaxations for optimization, learning a lot from many great research papers [1, 2], review papers [3], books [4, 5, 6, 7, 8], and even websites. | Machine Learning Research Blog
In the previous post, we showed (or at least tried to!) how the inherent noise of the stochastic gradient descent algorithm (SGD), in the context of modern overparametrised architectures, is structured and carries two important features: (i) it vanishes for interpolating solutions and (ii) it belongs to a low-dimensional manifold spanned by the gradients. | Machine Learning Research Blog