I’ve been working on implementing multilingual translation with pre-trained models and the primary issue is how to deal with the idiosyncrasies of the different models we may want to run.
For example, if you look at some of the models currently supported by CTranslate2 they require different types of tokens to be added to the source text before translation and removed from the target text after translation. I’d like to implement Argos Translate in a generic way as much as possible and be able to support new models as they come out by updating the package index but not having to update the code.
The model [FairSeq WMT19] can then be used to sample or score sequences of tokens. All inputs should start with the special token
</s>
For translation [with M2M-100], the language tokens should prefix the source and target sequences. Language tokens have the format
__X__
whereX
is the language code. See the end of the fixed dictionary file for the list of accepted languages.
Similar to M2M-100, the language tokens [for MBART-50] should prefix the source and target sequences.
I’m watching as new models are released to see if a consistent convention emerges. My current plan is to expect the convention used by M2M-100 of appending __LANGUAGECODE__
to the source and target text. This would mean adding the language codes to the source text, and removing them from the target text if they’re present. For example:
__en__ I've been working on implementing multilingual translation with pre-trained models.
__es__ He estado trabajando en implementar traducción multilingüe con modelos pre-entrenados.
I’m also considering adding optional “pretext” and “text” text files to the .argosmodel packages that get concatenated before and after the source text.
Another possibility is to include Python code in the packages that processes the source and target text in a custom way for individual packages. I’d really like to avoid this though because if there is executable code in the .argosmodel packages then it’s much more difficult share packages because of the security risk.