Transformers In Self and Cross Attention Initially the transformers began for sequential inputs only where they are later encoded with embeddings. The original paper can be referenced here,
The most General Architecture of a Transformer consists of a encoder and decoder model as discused in the Attention article that I wrote earlier.
Both the encoder and decoder has a set of layers where the first layer consists of the multi-head attention model, which is followed by the layer normalization node.