Neural Machine Translation Without Tokenization

argosopentech · June 17, 2021, 11:31pm

ByT5: Towards a token-free future with pre-trained byte-to-byte models

I think using characters directly as tokens, without a tokenizer like SentencePiece, like described in this paper will make sense for Argos Translate 2.0. This combined with seq2seq sentence boundary detection would allow translation using only CTranslate2.

Related links:

pierotofy · June 18, 2021, 3:06pm

I find it amazing that one can use byte-to-byte models to train translation models. But it makes sense that a model would be able to learn tokenization on its own. How would that apply to translating content like HTML/XML?

argosopentech · June 18, 2021, 10:11pm

This probably wouldn’t help XML very much (maybe even make it harder since you can’t tokenize tags as one entity). I’m currently running a tag injection server to create data for translating XML but it is very slow. Long term I think XML will be solved by simply having language models powerful enough that they can manage tags in large sections of text.

argosopentech · June 26, 2021, 1:43pm

argosopentech · June 29, 2021, 10:30pm

Visual explanation of Charformers: https://youtu.be/debgj24BAZE

pierotofy · June 30, 2021, 11:40pm

Interesting, thanks for sharing!