Neural Machine Translation Without Tokenization

ByT5: Towards a token-free future with pre-trained byte-to-byte models

I think using characters directly as tokens, without a tokenizer like SentencePiece, like described in this paper will make sense for Argos Translate 2.0. This combined with seq2seq sentence boundary detection would allow translation using only CTranslate2.

Related links:

I find it amazing that one can use byte-to-byte models to train translation models. But it makes sense that a model would be able to learn tokenization on its own. How would that apply to translating content like HTML/XML?

1 Like

This probably wouldn’t help XML very much (maybe even make it harder since you can’t tokenize tags as one entity). I’m currently running a tag injection server to create data for translating XML but it is very slow. Long term I think XML will be solved by simply having language models powerful enough that they can manage tags in large sections of text.

Visual explanation of Charformers:

1 Like

Interesting, thanks for sharing!