Fast and robust model training.| Franz Louis Cesista
What would Muon look like if we constrained the weights to be semi-orthogonal?| Franz Louis Cesista
Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced bey...| Franz Louis Cesista
Towards a maximal update parameterization of n-simplicial attention| Franz Louis Cesista
Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.| Franz Louis Cesista
Muon from first principles, what makes it different from other optimizers, and why it works so well.| Franz Louis Cesista
A possible reason why Muon converges faster & does better at higher learning rates than Adam.| Franz Louis Cesista
The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.| Franz Louis Cesista
Why Muon still work despite not perfectly semi-orthogonalizing the gradients.| Franz Louis Cesista
Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.| Franz Louis Cesista
The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.| Franz Louis Cesista
GRPO may not be the best choice for training reasoning models. Here's why.| Franz Louis Cesista
A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.| Franz Louis Cesista
Instead of asking, 'Which optimizer should I use?' ask, 'In which space do my features live in?'| Franz Louis Cesista
Generate interleaved text and image content in a structured format you can directly pass to downstream APIs.| Franz Louis Cesista
Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to ...| Franz Louis Cesista
A minimal implementation of Flash Attention 1 & 2 in just ~350 lines of CUDA code.| Franz Louis Cesista
Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a no...| Franz Louis Cesista
Could ChatGPT's shorter responses be an indication of something more bizarre going on?| Franz Louis Cesista
Years of experience in building artificial minds led me to believe that these AIs may end up seeming more 'human' than we currently imagine them to be.| Franz Louis Cesista
A C++ implementation of Meta's Llama2 generative large-language model. I also optimized the original C implementation by Karpathy by adding parallelization on the multi-head attention layer.| Franz Louis Cesista
Expedock Assistant is a chatbot that allows you to ask questions about your shipments and get answers in real time. It’s like having a personal assistant that knows everything about your business, shipments and industry.| Franz Louis Cesista
Expedock's AutoML Library -- fit a model, run batch inference, and get explanations in one line of code each.| Franz Louis Cesista
A thought dump on mRNA vaccines and the future of computational biology| Franz Louis Cesista
Booking demand prediction for Grab's Southeast Asia operations. The project involves spatio-temporal forecasting, anomaly detection, and econometric modeling.| Franz Louis Cesista
My entry for the World Finals of the Russian AI Cup 2018 - Codeball. A 3D physics-aware orchestrator of a pair of bots in a Rocket League-esque soccer game.| Franz Louis Cesista
My entry for the World Finals of the Russian AI Cup 2017 - Codewars. A particle swarm-based AI that uses potential flows and fluid mechanics to direct units in a Command-and-Conquer-esque game.| Franz Louis Cesista
A collection of algorithms, data structures and other useful information for competitive programming. Used and maintained by members of the Ateneo de Manila University Programming Varsity.| leloykun.github.io
Mathematician | Machine Learning (AI) Research Scientist| leloykun.github.io
A small step towards hardware-architecture-optimizer codesign in deep learning.| leloykun.github.io