Attention as a General-Purpose Primitive, Examined Closely
Transformers have become the dominant architecture in machine learning not by accident, but because the attention mechanism turned out to be a surprisingly general-purpose computational primitive. If