The authors suggest treating translation like a generic language modeling task of generating the next word in a sequence of tokens instead of having separate networks for encoding the input text and decoding the output text. This would mean doing something like few-shot translation using something like a autoregressive decoder only Transformer model.