This post analyzes transformer models, specifically memory and computation overhead. Many transformer based models just explain themselves as a model with X-B parameters; I wanted to break it down and look into the model structure how they are stored and used in actual computing hardware. Many illustrations and analysis are based on the following papers 1. Transformer-based Model # Since Google announed attention and transformer models in 2017 2, MLP and image classification models are rapidl...