ByT5: Towards a token-free future with pre-trained byte-to-byte models
I think using characters directly as tokens, without a tokenizer like SentencePiece, like described in this paper will make sense for Argos Translate 2.0. This combined with seq2seq sentence boundary detection would allow translation using only CTranslate2.
I find it amazing that one can use byte-to-byte models to train translation models. But it makes sense that a model would be able to learn tokenization on its own. How would that apply to translating content like HTML/XML?
This probably wouldn’t help XML very much (maybe even make it harder since you can’t tokenize tags as one entity). I’m currently running a tag injection server to create data for translating XML but it is very slow. Long term I think XML will be solved by simply having language models powerful enough that they can manage tags in large sections of text.
Visual explanation of Charformers: https://youtu.be/debgj24BAZE
Interesting, thanks for sharing!