Overview of Tokenization

A good overview of different tokenization strategies. I’m most optimistic about single character/byte tokenization and letting the network figure it out. Argos Translate currently uses Unigram models in SentencePiece. I also liked the idea in this paper of character hashes to make single character tokenization more straightforward.