Flash attention, a recent implementation of attention which makes less calls to high-bandwidth memory, uses a version of the softmax function which is numerically stable. In this post, I’ll briefly showcase how this is done and an example of an unstable softmax. The softmax function is used in machine learning to convert a vector of real numbers to a vector of probabilities which sum to 1, and is defined as: softmax(x) = [exp(x[i]) / sum([exp(xj) for xj in x]) for xi in x] where x is a vect...