Multilingual translation with CTranslate2 and pre-trained FairSeq models

argosopentech · March 12, 2022, 1:42am

gist.github.com

https://gist.github.com/ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d

M2M-100-example.py

# This example uses M2M-100 models converted to the CTranslate2 format.
# Download CTranslate2 models:
# • M2M-100 418M-parameter model: https://bit.ly/33fM1AO
# • M2M-100 1.2B-parameter model: https://bit.ly/3GYiaed


import ctranslate2
import sentencepiece as spm

This file has been truncated. show original

argosopentech · March 31, 2022, 12:16pm

This script is demonstrating using a pre-trained FairSeq multilingual model with CTranslate2.

Multilingual translation works by prepending a token representing the target language to the source text. For example:

__de__ This English sentence will be translated to German.

This works well for translating between languages other than English. For example, to translate from French to Spanish in Argos Translate we currently translate from French to English then “pivot” and translate from English to Spanish which loses some information in the intermediate English representation. With the multilingual strategy the model instead understands multiple languages and can translate directly between all of them.

It’s also possible to train Argos Translate models for language pairs other than English, for example, you could train a direct Spanish to Portuguese model. However, it’s not possible to support direct translations between a large number of languages without multilingual translation. To translate directly between 100 languages would require 100^2 = 10000 models.

dingedi · April 9, 2022, 3:31pm

very interesting ! is it a plan for argos to focus more on multilingual model training? or in the short/medium term continue to train classic models

argosopentech · April 10, 2022, 11:05am

Short to medium term I’m planning to continue the current strategy of training models for single language pairs but multilingual translation is probably the best strategy long term.

dingedi · September 12, 2022, 9:31am

Is it planned to add support for different types of models other than argos models? to be able to use the m2m-100 model, for example.
Would it be easily doable?

argosopentech · September 12, 2022, 11:51am

Argos Translate loads and runs a CTranslate2 model and CTranslate2 already supports models from a number of sources. Argos Translate should already mostly support using these but I’m planning to improve support in the future.

https://opennmt.net/CTranslate2/guides/transformers.html

dingedi · September 12, 2022, 11:58am

Great! I will see if I can get the m2m-100 model to work with argos-translate. this will allow many more languages to be available for translation

argosopentech · September 12, 2022, 1:00pm

It’s a bit hacky but you’ll probably need to use the seq2seq sentence boundary detection system instead of Stanza to get this to work.

export ARGOS_STANZA_AVAILABLE=0
export ARGOS_DEV_MODE=1
argos-translate-gui

The plan going forward is to have better support for models that support many languages (not just a single from_lang and to_lang) by adding this to the model package’s metadata.json:

{
    "languages": [
        {
            "code": "en",
            "name": "English"
        },
        {
            "code": "es",
            "name": "Spanish"
        },
        {
            "code": "chunk",
            "name": "Chunk"
        }
    ]
}

I then want to make “chunk” a valid language that works similar to the current seq2seq chunking system to split input text into sentences to be translated separately. Currently if you’re not using Stanza you have to install the sbd system as it’s own package (which is available on the dev index) but I want to also support having them combined with the main translation language model.

dingedi · September 12, 2022, 1:18pm

ok thank you, yes I think the most complicated thing is to integrate it properly into argostranslate to manage m2m models like this and one language to one language models

dingedi · September 13, 2022, 9:07am

I failed to use the seq2seq system.
When stanza is disabled translations are not possible because no languages are detected as there is no stanza system.

For the moment I reactivated stanza and I put in translate.py at the beginning of the apply_packaged_translation() method

itranslation = get_translation_from_codes(pkg.from_code, pkg.to_code)
print(detect_sentence(input_text, itranslation))

But trying to use detect_sentence() from sdb.py it kills the process.
How should it be used?

argosopentech · September 13, 2022, 10:39am

There’s a "sbd"package that needs to be installed. You can download and install it from the dev index or run the GUI and it will be automatically installed when you install another language with the same source lang.

dingedi · September 13, 2022, 10:43am

Ok but it will only work for the indicated languages [“en”, “de”, “es”, “fr”, “pt”, “ru”] ?

it would therefore be necessary either to train a new model to manage the 100 languages of the m2m-100 model or to find another system?

EDIT: I tried with a language that is not in the listed languages and it seems to work. However it doesn’t seem to work properly.
Is there anything that needs to be changed to fix this?

$ ARGOS_DEBUG=1 ARGOS_DEV_MODE=1 ARGOS_STANZA_AVAILABLE=0 argos-translate --from-lang en --to-lang it "Hello World"

('get_installed_languages',)
('paragraphs:', ['Hello World'])
('apply_packaged_translation', 'Hello World')
('sentence_guess:', 'Hello World')
('paragraphs:', ['<detect-sentence-boundaries>Hello World'])
('apply_packaged_translation', '<detect-sentence-boundaries>Hello World')
('sentences', ['<detect-sentence-boundaries>Hello World'])
('tokenized', [['▁<', 'detect', '-', 'sentence', '-', 'boundaries', '>', 'H', 'ello', '▁World']])
('translated_batches', [TranslationResult(hypotheses=[['▁Ho', 'ello', '▁World', '▁World', '▁<', 'sentence', '-', 'boundary', '>']], scores=[-4.561014652252197], attention=[])])
('value_hypotheses:', [('Hoello World World <sentence-boundary>', -4.561014652252197)])
('translated_paragraphs:', [[('Hoello World World <sentence-boundary>', -4.561014652252197)]])
('hypotheses_to_return:', [('Hoello World World <sentence-boundary>', -4.561014652252197)])
('sbd_translated_guess:', 'Hoello World World ')
('start_index', 0)
('sbd_index', 10)
('Hello Worl',)
('sentences', ['Hello Worl'])
('tokenized', [['▁Hello', '▁Wor', 'l']])
('translated_batches', [TranslationResult(hypotheses=[['▁Ciao', '▁Wor', 'l']], scores=[-3.1133172512054443], attention=[])])
('value_hypotheses:', [('Ciao Worl', -3.1133172512054443)])
('translated_paragraphs:', [[('Ciao Worl', -3.1133172512054443)]])
('hypotheses_to_return:', [('Ciao Worl', -3.1133172512054443)])
Ciao Worl

argosopentech · September 13, 2022, 1:08pm

The sbd model was only trained on those languages so it can probably only reliably split input text into sentences for those languages. To support more M2M-100 language you would probably need to train a new sbd model.

dingedi · September 13, 2022, 2:56pm

I finally managed to use the m2m-100 model with argos-translate, I used this system instead of stanza and the seq2seq model:

github.com

ymoslem/DesktopTranslator/blob/main/utils/paragraph_splitter.py

import pysbd
from sentence_splitter import split_text_into_sentences
from indicnlp.tokenize.sentence_tokenize import sentence_split


def paragraph_tokenizer(text, language="en"):
    """Replace sentences with their indexes, and store indexes of newlines
    Args:
        text (str): Text to be indexed

    Returns:
        sentences (list): List of sentences
        breaks (list): List of indexes of sentences and newlines
    """

    languages_splitter = ["ca", "cs", "da", "de", "el", "en", "es", "fi", "fr", "hu", "is", "it",
                          "lt", "lv", "nl", "no", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr"]
    languages_indic = ["as", "bn", "gu", "hi", "kK", "kn", "ml", "mr", "ne", "or", "pa", "sa",
                       "sd", "si", "ta", "te"]
    languages_pysbd = ["en", "hi", "mr", "zh", "es", "am", "ar", "hy", "bg", "ur", "ru", "pl",

This file has been truncated. show original

I directly modify the source code of argos like a nag but it works.

for a clean integration it will indeed be necessary to review the Package classes and many others i think

argosopentech · September 14, 2022, 12:33am

Thanks for doing these experiments.

My plan is to release Argos Translate 2.0 at some point with breaking changes and more focus on multilingual models. I made a “v2” branch to start tracking the changes planned for version 2.

argosopentech · September 25, 2022, 6:50pm

I’ve been working on implementing multilingual translation with pre-trained models and the primary issue is how to deal with the idiosyncrasies of the different models we may want to run.

For example, if you look at some of the models currently supported by CTranslate2 they require different types of tokens to be added to the source text before translation and removed from the target text after translation. I’d like to implement Argos Translate in a generic way as much as possible and be able to support new models as they come out by updating the package index but not having to update the code.

The model [FairSeq WMT19] can then be used to sample or score sequences of tokens. All inputs should start with the special token </s>

For translation [with M2M-100], the language tokens should prefix the source and target sequences. Language tokens have the format __X__ where X is the language code. See the end of the fixed dictionary file for the list of accepted languages.

Similar to M2M-100, the language tokens [for MBART-50] should prefix the source and target sequences.

I’m watching as new models are released to see if a consistent convention emerges. My current plan is to expect the convention used by M2M-100 of appending __LANGUAGECODE__ to the source and target text. This would mean adding the language codes to the source text, and removing them from the target text if they’re present. For example:

__en__ I've been working on implementing multilingual translation with pre-trained models.
__es__ He estado trabajando en implementar traducción multilingüe con modelos pre-entrenados.

I’m also considering adding optional “pretext” and “text” text files to the .argosmodel packages that get concatenated before and after the source text.

Another possibility is to include Python code in the packages that processes the source and target text in a custom way for individual packages. I’d really like to avoid this though because if there is executable code in the .argosmodel packages then it’s much more difficult share packages because of the security risk.

dingedi · September 25, 2022, 7:44pm

yes indeed it is complex to manage all these differences between the multilingual models in addition to the model one language. an idea may be to add a “type” in the package.json and python files directly in argos which would manage a pretranslate and aftertranslate for each type, a bit like I did for argostranslatefiles

argosopentech · September 25, 2022, 8:01pm

This may be what I do. I could release a version of Argos Translate that works well for the M2M-100 model like described above. Then as new models come out I could add custom logic to the Argos Translate codebase based on the package’s argostranslate.package.IPackage.code value. This would mean the .argosmodel packages would have decent forward compatibility and continue to work with subsequent versions of Argos Translate. However, if you used a newer package with an older version of the Python code you might get broken results.

argosopentech · September 26, 2022, 12:01am

I’m tentatively focusing on support for FairSeq M2M-100, however, CTranslate supports a large and increasing number of models so there are plenty of options.

@pierotofy and @dingedi let me know if there’s a different model you’d like to run in LibreTranslate, for instance one of the Huggyface Transformers models. If I start publishing multiple redundant language models on the package index we’ll need to change the current LibreTranslate logic of downloading and installing all of the available models.

Different models have different tradeoffs, for example, M2M-100 translations are meaningfully higher quality but will be around 4x slower than the current OpenNMT-py models. M2M-100 is also ~460MB so it should be much faster to download than downloading all of the current language models (~7GB) but slower than downloading a single language pair (~250MB). GPT-2 is probably even more powerful than M2M-100 and even slower. We may also want some way to expose this functionality to the user in LibreTranslate.

For example:

 --translation-provider argos-m2m-100
 --language-model GPT-3
 --translator argos

pierotofy · September 26, 2022, 1:56am

I’m in favor of offering choices for the ability to run different models; I’m not sure I’m knowledgeable enough to say which ones would work best, but I’d be happy to help implement changes to LT as required.