Training a deep neural network is essentially a compression task. We want to represent our training data distribution as a function parameterized by a bunch of matrices. The more complex the distribution, the more parameters we need. The rationale for approximating the entire distribution is so that we can forward any valid point at inference using the same model, with the same weights. But what if our model was trained on-the-fly, at inference?