I'm continuing through chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers training the LLM. Last time I wrote about cross entropy loss. Before moving on to the next section, I wanted to post about something that the book only covers briefly in a sidebar: perplexity. Back in May, I thought I had understood it: Just as I was finishing this off, I found myself thinking that logits were interesting because you could take some measure of how certain t...