Multilingual translation with CTranslate2 and pre-trained FairSeq models

ok thank you, yes I think the most complicated thing is to integrate it properly into argostranslate to manage m2m models like this and one language to one language models

1 Like

I failed to use the seq2seq system.
When stanza is disabled translations are not possible because no languages are detected as there is no stanza system.

For the moment I reactivated stanza and I put in translate.py at the beginning of the apply_packaged_translation() method

itranslation = get_translation_from_codes(pkg.from_code, pkg.to_code)
print(detect_sentence(input_text, itranslation))

But trying to use detect_sentence() from sdb.py it kills the process.
How should it be used?

There’s a "sbd"package that needs to be installed. You can download and install it from the dev index or run the GUI and it will be automatically installed when you install another language with the same source lang.

Ok but it will only work for the indicated languages [“en”, “de”, “es”, “fr”, “pt”, “ru”] ?

it would therefore be necessary either to train a new model to manage the 100 languages of the m2m-100 model or to find another system?


EDIT: I tried with a language that is not in the listed languages and it seems to work. However it doesn’t seem to work properly.
Is there anything that needs to be changed to fix this?

$ ARGOS_DEBUG=1 ARGOS_DEV_MODE=1 ARGOS_STANZA_AVAILABLE=0 argos-translate --from-lang en --to-lang it "Hello World"

('get_installed_languages',)
('paragraphs:', ['Hello World'])
('apply_packaged_translation', 'Hello World')
('sentence_guess:', 'Hello World')
('paragraphs:', ['<detect-sentence-boundaries>Hello World'])
('apply_packaged_translation', '<detect-sentence-boundaries>Hello World')
('sentences', ['<detect-sentence-boundaries>Hello World'])
('tokenized', [['▁<', 'detect', '-', 'sentence', '-', 'boundaries', '>', 'H', 'ello', '▁World']])
('translated_batches', [TranslationResult(hypotheses=[['▁Ho', 'ello', '▁World', '▁World', '▁<', 'sentence', '-', 'boundary', '>']], scores=[-4.561014652252197], attention=[])])
('value_hypotheses:', [('Hoello World World <sentence-boundary>', -4.561014652252197)])
('translated_paragraphs:', [[('Hoello World World <sentence-boundary>', -4.561014652252197)]])
('hypotheses_to_return:', [('Hoello World World <sentence-boundary>', -4.561014652252197)])
('sbd_translated_guess:', 'Hoello World World ')
('start_index', 0)
('sbd_index', 10)
('Hello Worl',)
('sentences', ['Hello Worl'])
('tokenized', [['▁Hello', '▁Wor', 'l']])
('translated_batches', [TranslationResult(hypotheses=[['▁Ciao', '▁Wor', 'l']], scores=[-3.1133172512054443], attention=[])])
('value_hypotheses:', [('Ciao Worl', -3.1133172512054443)])
('translated_paragraphs:', [[('Ciao Worl', -3.1133172512054443)]])
('hypotheses_to_return:', [('Ciao Worl', -3.1133172512054443)])
Ciao Worl
1 Like

The sbd model was only trained on those languages so it can probably only reliably split input text into sentences for those languages. To support more M2M-100 language you would probably need to train a new sbd model.

1 Like

I finally managed to use the m2m-100 model with argos-translate, I used this system instead of stanza and the seq2seq model:

I directly modify the source code of argos like a nag but it works.

for a clean integration it will indeed be necessary to review the Package classes and many others i think :sweat_smile:

1 Like

Thanks for doing these experiments.

My plan is to release Argos Translate 2.0 at some point with breaking changes and more focus on multilingual models. I made a “v2” branch to start tracking the changes planned for version 2.

1 Like

I’ve been working on implementing multilingual translation with pre-trained models and the primary issue is how to deal with the idiosyncrasies of the different models we may want to run.

For example, if you look at some of the models currently supported by CTranslate2 they require different types of tokens to be added to the source text before translation and removed from the target text after translation. I’d like to implement Argos Translate in a generic way as much as possible and be able to support new models as they come out by updating the package index but not having to update the code.

The model [FairSeq WMT19] can then be used to sample or score sequences of tokens. All inputs should start with the special token </s>

For translation [with M2M-100], the language tokens should prefix the source and target sequences. Language tokens have the format __X__ where X is the language code. See the end of the fixed dictionary file for the list of accepted languages.

Similar to M2M-100, the language tokens [for MBART-50] should prefix the source and target sequences.

I’m watching as new models are released to see if a consistent convention emerges. My current plan is to expect the convention used by M2M-100 of appending __LANGUAGECODE__ to the source and target text. This would mean adding the language codes to the source text, and removing them from the target text if they’re present. For example:

__en__ I've been working on implementing multilingual translation with pre-trained models.
__es__ He estado trabajando en implementar traducción multilingüe con modelos pre-entrenados.

I’m also considering adding optional “pretext” and “text” text files to the .argosmodel packages that get concatenated before and after the source text.

Another possibility is to include Python code in the packages that processes the source and target text in a custom way for individual packages. I’d really like to avoid this though because if there is executable code in the .argosmodel packages then it’s much more difficult share packages because of the security risk.

yes indeed it is complex to manage all these differences between the multilingual models in addition to the model one language. an idea may be to add a “type” in the package.json and python files directly in argos which would manage a pretranslate and aftertranslate for each type, a bit like I did for argostranslatefiles

1 Like

This may be what I do. I could release a version of Argos Translate that works well for the M2M-100 model like described above. Then as new models come out I could add custom logic to the Argos Translate codebase based on the package’s argostranslate.package.IPackage.code value. This would mean the .argosmodel packages would have decent forward compatibility and continue to work with subsequent versions of Argos Translate. However, if you used a newer package with an older version of the Python code you might get broken results.

1 Like

I’m tentatively focusing on support for FairSeq M2M-100, however, CTranslate supports a large and increasing number of models so there are plenty of options.

@pierotofy and @dingedi let me know if there’s a different model you’d like to run in LibreTranslate, for instance one of the Huggyface Transformers models. If I start publishing multiple redundant language models on the package index we’ll need to change the current LibreTranslate logic of downloading and installing all of the available models.

Different models have different tradeoffs, for example, M2M-100 translations are meaningfully higher quality but will be around 4x slower than the current OpenNMT-py models. M2M-100 is also ~460MB so it should be much faster to download than downloading all of the current language models (~7GB) but slower than downloading a single language pair (~250MB). GPT-2 is probably even more powerful than M2M-100 and even slower. We may also want some way to expose this functionality to the user in LibreTranslate.

For example:

 --translation-provider argos-m2m-100
 --language-model GPT-3
 --translator argos

1 Like

I’m in favor of offering choices for the ability to run different models; I’m not sure I’m knowledgeable enough to say which ones would work best, but I’d be happy to help implement changes to LT as required.

1 Like

Yes it is indeed a very big change in the architecture of argos and the fact of being able to use models coming from large companies like meta too.
I think it would be interesting to add a progressbar. when I download models via chrome there is the size indicated, the remaining time, etc.
so it would be doable to implement this in argospm

2 Likes

Multilingual translation with M2M-100 now works in the dev branch!

pip install git+https://github.com/argosopentech/argos-translate.git@v2
argospm install translate
argos-translate --from-lang en --to-lang es "I am translating from English to Spanish with the M2M-100 language model."
Estoy traducindo del inglés al español con el modelo de lenguaje M2M-100.
1 Like

Great!
how argos will handle the case where several models are installed, example:
the m2m model is installed
the argos model translate-en_es is installed

which will take priority? the one loaded first?

1 Like

It’s not strictly determined currently, it would just be whichever is loaded first.

I’ve thought about using some kind of “specificity” heuristic where Argos Translate would load the model that translates to the fewest other languages. Nothing is implemented right now though.

1 Like

I couldn’t get these lines to work. when I do argospm update && argospm search the m2m package doesn’t seem to exist in the index.

1 Like

Try deleting the existing packages and index:

rm -rf ~/.local/share/argos-translate/
rm -rf ~/.local/cache/argos-translate/

You may also need to run:

argospm update

You could some metadata for the model in properties such as “prefix,” “suffix” and more specific ones for the src/tgt sides.

A lot of the annoyance is probably with models such as M2M where the encoder doesnt know what lang to translate to, and the decoder is given a token and decodes from there - at least with these fields you’d be able to add to a sequence of tokens.

Ultimately these properties couldn’t just be one token so they’d either need to be an array of tokens or a string which is tokenized.

Eg
src_prefix: “__opt_src_SRCLANG __opt_tgt__TGTLANG”

This would allow a standard process of adding this to each seq of tokens into a model by replacing the placeholders with lang names and if the properties don’t exist then just do nothing.

For some models I’ve made, tokenizing these prefix properties wouldn’t work as opposed to using an array because I artificially inserted the target and source tokens to decrease seq length [to avoid having an un-necessary space token between the two when tokenized]

This could be extended to not just be for src/tgt tokens but to formality or locale tokens.

2 Likes

Metadata with the model may be the best strategy for adding different prefixes to the source and target text.

I also posted about this on the OpenNMT Forum:

1 Like