Muon from first principles, what makes it different from other optimizers, and why it works so well.| Franz Louis Cesista
The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.| Franz Louis Cesista
A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.| Franz Louis Cesista
在文章《初探muP:超参数的跨模型尺度迁移规律》中,我们基于前向传播、反向传播、损失增量和特征变化的尺度不变性推导了muP(Maximal Update Parametrization)。可能对...| kexue.fm