The most notable thing about this model is that they use fewer parameters (20 billion) than many of the other LLM which makes it less resource intensive to train and easier to run.
They also use an encoder-decoder architecture, which is common for machine translation, unlike most large language models which are decoder-only.
In an encoder-decoder architecture, the encoder produces a representation of an input text using a bidirectional encoding, and the decoder uses that representation to perform some task — historically, generating translation of the input.
By contrast, the decoder-only model uses left-to-right (unidirectional) encoding of the input text. This works well for language modeling, in which the task is to predict the next token in a sequence based on those that precede it, but it’s less effective for machine translation and text summarization, the tasks on which AlexaTM 20B outperforms GPT-3.
For reference on model sizes:
- Argos Translate - 150M params per model 7B total
- DeepMind Chinchilla - 100B params
- BLOOM - 176B params