Theoretical analysis provides insight into the optimization process during model training and reveals that for some optimizations, the Gaussian attention kernel may work better than softmax.| Amazon Science