One of the biggest flaws of NLLB, in my opinion, is that they had to constrain the vocabulary size to be able to handle all 200 languages. This is best explained in the paper:
8.1.1 Training a Tokenizer for 200+ languages To represent the 200+ languages of No Language Left Behind, we trained a new SentencePiece (SPM; Kudo and Richardson, 2018) tokenizer. To train this SentencePiece model, we sample a total of 100M sentences from primary bitext corpora. Given that most of the languages in NLLB are low-resource languages (150), uniform sampling would over-represent high-resource languages and under-represent low-resource languages, leading to too much fragmentation of low-resource language text. To mitigate this, we apply temperature sampling (with temperature T = 5), which effectively downsamples high-resource languages and upsamples low-resource languages. This results in a more balanced distribution of samples over all languages. To validate the quality of the SPM, we first examine the rate of unknown tokens (<unk>) for each language. We observe that even after using a high temperature for sampling, certain languages such as zho_Hans, zho_Hant and yue_Hant had higher <unk> error rates, due to the very large character set of their scripts. To compensate, we further upsample those specific languages by a factor of 5 during training. With these modifications, the <unk> error rate for all languages is below 1%. Another important factor for quality is the tokenization rate, or the average number of tokens per sentence for each language (Mielke et al., 2021). Since SentencePiece identifies subword units based on language perplexity (roughly, frequency), underrepresented languages tend to have a higher tokenization rate than high-resource ones, leading to a near character-based model for those languages. This makes modeling more challenging, especially for long range dependencies and for synthesizing words from near character-level tokens. Based on the above two factors, we choose a vocabulary of 91 size 256,000 for our SentencePiece model to allow for enough capacity to represent the wide spectrum and large number of languages we cover. As we achieve reasonable tokenization quality with a vocabulary size of 256k, we do not train SentencePiece models with even larger vocabulary sizes (e.g. 384k or more), as a larger vocabulary size would significantly increase the number of model parameters. To evaluate with spBLEU, we use this SPM-200 as the tokenizer to better support the languages of Flores-200. We open source this SentencePiece model along with the Flores-200 dataset.
Being below 1% might sound reasonable, but in practice I found that 1 every ~10-20 sentences ends up outputting an unknown token, which happens to be a key word that messes up the translation. This doesn’t affect BLEU scores much obviously, masking the problem.