Noam Shazeer and Jeff Dean on Dwarkesh arithmetic very cheap. moving data around is expensive. model parameters are very memory efficient: one fact per parameter? (this probably isn’t the right way to think about it because of superposition?) versus in context, there are kqv which can many more bits inference improvement thing? big model verifier, small model does it first thing?? “drafter models”. are these real? i don’t see how these parallelize. oh wait no you can batch it so it go...