Introduction One of the key components of the Transformer architecture is the Attention layer, which is in charge of making every word (or more generally, every token) learn the context given by every other in a sequence, and was introduced in the seminal paper Attention is all you need. In