I’m doing my master’s thesis around distributed low-communication training. Essentially, how can we train large models efficiently across distributed nodes and not be utterly destroyed by network latency and bandwidth? Below is some of what I’ve learned and investigated throughout the days. Day 3: Current Work on Heterogeneous Workers# A desirable problem to solve is being able to use different kinds of hardware for training. Even within the same generation, NVIDIA B300 GPUs are 50% fas...| Posts on nathan.rs
A few weeks back, I implemented GPT-2 using WebGL and shaders (Github Repo) which made the front page of Hacker News (discussion). This is a short write-up over what I learned about old-school general-purpose GPU programming over the course of this project. The Origins of General-Purpose GPU Programming# --- In the early 2000s, NVIDIA introduced programmable shaders with the GeForce 3 (2001) and GeForce FX (2003). Instead of being limited to predetermined transformations and effects of earlie...| Posts on nathan.rs
My notes over Mark Maxwell’s course, Introduction to Mathematical Statistics, and his textbook, Probability & Statistics with Applications, Second Edition. Sampling Distributions and Estimation# --- Normally in a probability experiment, we don’t know the true values of a model’s parameters, and therefore, we must estimate them using random observations. Because the observations are random, our estimates are subject to the vagaries of chance. We find ourselves in a paradoxical situation ...| Posts on nathan.rs
An overview of common discrete and continuous distributions found in probability and statistics, from Mark Maxwell’s textbook, Probability & Statistics with Applications, Second Edition. Common Discrete Distributions# --- Discrete Uniform# A random variable $X$ is said to have a discrete uniform distribution if its probability function is: $$Pr(X=x)=\frac{1}{n}$$ for $x=1,2,\dots,n$. Main Properties# Expected Value: $$E[X ]=\frac{n+1}{2}$$ Variance: $$Var[X ]= \frac{n^2-1}{12}$$ Additional ...| Posts on nathan.rs
Lately, I’ve been coming across many blogs that have weird font-size rendering issues for code blocks on iOS. Basically, in a code snippet, the text-size would sometimes be much larger for some lines than others. Below is a screenshot of the issue from a website where I’ve seen this occur. As you can see, the text-size isn’t uniform across code block lines. I’ve seen this issue across many blogs that compile markdown files to HTML such as sites built using Hugo, Jekyll, or even custom...| Posts on nathan.rs
For a while, I wanted to build a complete autograd engine. What is an autograd engine, you might ask? To find the answer, we first must know what a neural network is. Neural Network Crash Course# --- A neural network can just be seen as a black-box function. We pass in an input into this black box and receive an output. Normally, in a function, we define the rules on how to manipulate the input to get an output. For example, if we want a function that doubles the input, i.e $f(x) = 2x$, then ...| Posts on nathan.rs
Recently I was going through Thorsten Ball’s “Writing An Interpreter in Go”. In this book, you create a basic interpreted language and write a lexer, parser, evaluator, and REPL for it. A Lexer takes in source code and turns it into an intermediate representation, usually in the form of a string of tokens. This is called Lexical Analysis. A parser usually takes this stream of tokens and turns it into an Abstract Syntax Tree which is then evaluated and run.| Posts on nathan.rs
Below are all the books I’ve read since middle school, roughly in order. Those highlighted in blue were those I particularly enjoyed :) 2025 – Age 22# --- Willpower - Roy F. Baumeister & John Tierney I was at a ZFellows event in SF where one of the speakers, Adam Guild, recommended this book. It's been a long time since I've read a self-help book like this. I gave it a shot because I've felt lazy recently and was looking for something to help me pick up my old good habits again. EDIT: I d...| Posts on nathan.rs
Here are a few of my favorite quotes I’ve liked over the years. Life# “I believe that a man should strive for only one thing in life, and that is to have a touch of greatness” — Félix Martí-Ibáñez “In Three Words, I Can Sum Up Everything I’ve Learned About Life. It Goes On” — Robert Frost “But you see,” said Roark quietly, “I have, let’s say, sixty years to live. Most of that time will be spent working. I’ve chosen the work I want to do. If I find no joy in it, t...| Posts on nathan.rs
Theses are some of my over Qiang Liu’s course, Machine Learning II. Gradient Descent# --- Gradient Descent is a fundamental, first-order iterative optimization algorithm designed for minimizing a function. The primary objective of Gradient Descent is to find the minimum value of a function by iteratively moving towards the minimum of the gradient. Update Rule: The parameters $ \theta $ are updated as follows in each iteration:| Posts on nathan.rs
These are a few of my notes from Eunsol Choi’s NLP class at UT Austin. Word Embeddings# --- Word embeddings are a type of word representation that captures the semantic meaning of words in a continuous vector space. Unlike one-hot encoding, where each word is represented as a binary vector of all zeros except for a single ‘1’, word embeddings capture much richer information, including semantic relationships, word context, and even aspects of syntax.| Posts on nathan.rs
These are a few of my notes from Eunsol Choi’s NLP class at UT Austin. Recurrent Neural Networks (RNNs)# --- Recurrent Neural Networks (RNNs) are a class of artificial neural networks specifically designed to tackle sequence-based problems. Unlike traditional feedforward neural networks, RNNs possess a memory in the form of a hidden state, enabling them to remember and leverage past information when making decisions. This makes them particularly effective for tasks like language modeling, t...| Posts on nathan.rs
These are a few of my notes from Eunsol Choi’s NLP class at UT Austin. Generative Models vs. Discriminative Models# --- When it comes to classification, models are broadly categorized into Generative Models and Discriminative Models. Generative Models# In generative models, we aim to model the joint distribution of the data $ p(x, y) $. These models often assume a particular functional form for both $ P(x|y) $ and $ P(y) $. To classify a new data point, we maximize:| Posts on nathan.rs
My notes over Mark Maxwell’s course, Probability I, and his textbook, Probability & Statistics with Applications, Second Edition. Combinatorial Probability# --- The fundamental theorem of counting is also known as the multiplication principle. Given that there are $N(A)$ outcomes, and for each of these outcomes, there are $N(B)$ outcomes, then the total number of outcomes for the two combined is equal to $N(A)\cdot N(B)$. Example 1| Posts on nathan.rs
These are my notes over my review of Linear Algebra, going through Gilbert Strang’s Introduction To Linear Algebra. Introduction to Vectors# --- The core of linear algebra is vector addition and scalar multiplication. Combining these two operations gives us a set of linear combinations. $$ c\mathbf{v} + d\mathbf{w} = c\begin{bmatrix} 1 \\ 2 \end{bmatrix} + d\begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} c + 3d \\ 2c + 4d \end{bmatrix}. $$| Posts on nathan.rs
October 14th, 2025: This post is old and is most likely outdated if you’re reading this! Dioxus has possibly changed a substantial amount, thus do not read this as a how-to-guide. Why Rust for Front-End Development# --- I’ve been using React and Next.js for front-end development ever since high school, it was one of the first few things I learned when it came to programming. Recently, I’ve had the itch to learn something new, specifically Rust front-end. As someone with a “.rs” doma...| Posts on nathan.rs
A small review over Calculus 1, 2, and 3, based on the textbook, Calculus: Early Transcendentals (Eight Edition). Differentiation Rules# --- Product Rule# If $f$ and $g$ are both differentiable, then $$\frac{d}{dx}[f(x)g(x)]=f(x)g^\prime(x)+g(x)f^\prime(x)$$ Quotient Rule# If $f$ and $g$ are differentiable, then $$\frac{d}{dx}\bigg[\frac{f(x)}{g(x)}\bigg]=\frac{g(x)f^\prime(x)-f(x)g^\prime(x)}{[g(x)]^2}$$ Integration# --- The Substitution Rule# If an integral has both an $x$ value and the der...| Posts on nathan.rs
A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step. I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018. The firs...| nathan.rs