Multilingual translation with CTranslate2 and pre-trained FairSeq models

Yes it is indeed a very big change in the architecture of argos and the fact of being able to use models coming from large companies like meta too.
I think it would be interesting to add a progressbar. when I download models via chrome there is the size indicated, the remaining time, etc.
so it would be doable to implement this in argospm

2 Likes

Multilingual translation with M2M-100 now works in the dev branch!

pip install git+https://github.com/argosopentech/argos-translate.git@v2
argospm install translate
argos-translate --from-lang en --to-lang es "I am translating from English to Spanish with the M2M-100 language model."
Estoy traducindo del inglés al español con el modelo de lenguaje M2M-100.
1 Like

Great!
how argos will handle the case where several models are installed, example:
the m2m model is installed
the argos model translate-en_es is installed

which will take priority? the one loaded first?

1 Like

It’s not strictly determined currently, it would just be whichever is loaded first.

I’ve thought about using some kind of “specificity” heuristic where Argos Translate would load the model that translates to the fewest other languages. Nothing is implemented right now though.

1 Like

I couldn’t get these lines to work. when I do argospm update && argospm search the m2m package doesn’t seem to exist in the index.

1 Like

Try deleting the existing packages and index:

rm -rf ~/.local/share/argos-translate/
rm -rf ~/.local/cache/argos-translate/

You may also need to run:

argospm update

You could some metadata for the model in properties such as “prefix,” “suffix” and more specific ones for the src/tgt sides.

A lot of the annoyance is probably with models such as M2M where the encoder doesnt know what lang to translate to, and the decoder is given a token and decodes from there - at least with these fields you’d be able to add to a sequence of tokens.

Ultimately these properties couldn’t just be one token so they’d either need to be an array of tokens or a string which is tokenized.

Eg
src_prefix: “__opt_src_SRCLANG __opt_tgt__TGTLANG”

This would allow a standard process of adding this to each seq of tokens into a model by replacing the placeholders with lang names and if the properties don’t exist then just do nothing.

For some models I’ve made, tokenizing these prefix properties wouldn’t work as opposed to using an array because I artificially inserted the target and source tokens to decrease seq length [to avoid having an un-necessary space token between the two when tokenized]

This could be extended to not just be for src/tgt tokens but to formality or locale tokens.

2 Likes

Metadata with the model may be the best strategy for adding different prefixes to the source and target text.

I also posted about this on the OpenNMT Forum:

1 Like