In this post, we will take a look at relative positional encoding, as introduced in Shaw et al (2018) and refined by Huang et al (2018). This is a topic I meant to explore earlier, but only recently was I able to really force myself to dive into this concept as I started reading about music generation with NLP language models. This is a separate topic for another post of its own, so let’s not get distracted.| Jake Tae
Building the world's most influential neural network architecture from scratch...| cameronrwolfe.substack.com
Breaking down the capabilities of Google's highly anticipated OpenAI competitor...| cameronrwolfe.substack.com
This criterion computes the cross entropy loss between input logits| pytorch.org
Understanding FlashAttention which is the most efficient exact attention implementation out there, which optimizes for both memory requirements and wall-clock time.| shreyansh26.github.io