Training a multilingual model with OpenNMT-py

I was able to get M2M-100 working with CTranslate2 and have been trying to train a similar multilingual model from scratch using OpenNMT-py.

What is the best format to use for tokens that tell the language model what the source and target languages are? For M2M-100 I appended the source token to the source text and then called ctranslate2.Translator.translate_batch with target_prefix=[[target_code_token]] * len(tokenized_sentences). Another option for format is to prepend the target code token to the source text like in this tutorial.

M2M-100 format:

__en__William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
__fr__William Caxton (c. 1422 – c. 1491) était un marchand, diplomate et écrivain anglais.

Prepend the target code to source text:

__fr__William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.

Prepend the source and target code to source text:

__en__ __fr__William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.

Is there any sort of industry standard for this? I think I prefer prepending the source and target code tokens to the source text but I also want to maximize compatibility with models trained by other people.

Additionally, how does the target_prefix parameter work in CTranslate2? My understanding is that it prepends the provided prefixes to the target text while it is being decoded. The target_prefix parameter works with M2M-100 but doesn’t seem to work with my OpenNMT-py model.

I created a multilingual dataset with 95949184 lines of data from Opus and formatted it in the M2M-100 format. I then trained a model with OpenNMT-py like I would for a individual language pair. I ran the model with CTranslate2 prepending the source code token to the source text and using the target_prefix parameter for the target language. I get completely incorrect output and the target_prefix doesn’t seem to affect the translation at all.

$ argos-translate -f en -t de "Cheese"
es ies.
$ argos-translate -f en -t fr "Cheese"
es ies.
$ argos-translate -f en -t es "Cheese"
es ies.
$ argos-translate -f en -t es "I'm flying to Miami next week."
Miami i Miami.

Will most pretrained models just need custom logic? I know this is often true when running models on Huggingface, different language models from different companies often need a custom tokenizer or other logic.

This might be a good reason to prefer one convention over another.

I’m not very familiar with target prefixing in OpenNMT, but to enable the feature it seems you need to pass two parameters, tgt_file_prefix and tgt.

1 Like

I think this is only for the onmt.translate.Translator class which is doing inference. I don’t think I’m using this class since I’m only using OpenNMT-py for training and using CTranslate2 for inference.

Maybe I need to add tgt_file_prefix to the config.yml for OpenNMT-py when I train the model to get target_prefix to work in CTranslate2?

https://opennmt.net/OpenNMT-py/FAQ.html#add-custom-prefix-to-examples