OPUS-MT Language Models Port Thread

I’ve ported/tested a bunch of the existing languages that exist in OPUS-MT but aren’t available in argos-translate or where argos-translate models need improvement using Locomotive:

Here’s the complete list with bidirectional models to/from English:

New:

Norwegian (no)
Chinese (traditional) (zt)
Albanian (sq)
Romanian (ro)
Serbian (sr)
Bengali (bn)
Slovenian (sl)
Vietnamese (vi)
Bulgarian (bg)
Estonian (et)
Latvian (lv)
Thai (th)
Lithuanian (lt)
Malay (ms)
Togalog (tl)

Updated/Improved:

Chinese (zh)
Greek (el)
Polish (pl)
English => Catalan (ca) (single direction, we already have the Catalan => English direction)

:boom:

Link to argosmodels: OPUS-models – Google Drive

Note that BPE-encoded models will have some spacing issues until https://github.com/argosopentech/argos-translate/pull/373 is merged.

I also didn’t set the proper version numbers for updated models like Greek. If included in the index, the version number should probably be updated.

2 Likes

I’ve loaded the models on libretranslate.com.

2 Likes

This is now merged and available in Argos Translate 1.9.1

1 Like

Updated/Improved

French (fr)
Czech (cs)

I’ve been working on uploading these OPUS-MT models to argospm. Here are some notes I have from testing them:

translate-en_bn-1_9.argosmodel

  • Works well

translate-en_zt-1_9.argosmodel

  • Works well

translate-en_bg-1_9.argosmodel

  • Works well

translate-en_lt-1_9.argosmodel

  • Works well

translate-en_ro-1_9.argosmodel

  • Works very well

translate-sl_en-1_9.argosmodel

  • Stanza broken. Argos Translate crashes when I try to run this model
2023-10-22 09:08:53 ERROR: Cannot load model from /home/pj/.local/share/argos-translate/packages/translate-sl_en-1_9/stanza/sl/tokenize/ssj.pt
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: '/home/pj/.local/share/argos-translate/packages/translate-sl_en-1_9/stanza/sl/tokenize/ssj.pt'

translate-ms_en-1_9.argosmodel

  • Works well

translate-en_sq-1_9.argosmodel

  • Works well

translate-tl_en-1_9.argosmodel

  • Works

translate-en_sl-1_9.argosmodel

  • Partially broken
  • Frequently generates this specific sentence instead of translating the source English text: “= = Odkritje = = Asteroid je odkril avstrijski astronom Johann Palisa (1877 - 1962) 17. novembra 1892 v Karlu.”

According to Google Translate this translates to: “= = Discovery = = The asteroid was discovered by the Austrian astronomer Johann Palisa (1877 - 1962) on November 17, 1892 in Karl.”

For example all of these three different English source sentences are translated to the same Slovenian sentence about the asteroid (approximatly 5% of sentences have this issue in my tests):

He began running during the first COVID-19 lockdown in March 2020, having been motivated by an American athlete who had run every street in San Francisco in 30 days.

Daniel Roy Gilchrist Noboa Azín (born 30 November 1987) is an Ecuadorian business administrator, politician and businessman in the banana industry, who is the president-elect of Ecuador after winning the 2023 general election.

In March 2023, he was in favor of the muerte cruzada, in the face of the rejection and filing of the Investment Law, presented by the government of Guillermo Lasso.[19] On 17 May 2023,

translate-en_no-1_9.argosmodel

  • Partially broken
    Most translations are correct but ~15% of them have gibberish, for example:
English Source

Daniel Roy-Gilchrist Noboa Azín was born on 30 November 1987 in the city of Guayaquil. He is the son of businessman Álvaro Noboa and physician Anabella Azín.

translate-en_no-1_9

Daniel Roychrist christ Nowas was was ble født 30. november 87 87 i byen il il il. Han er sønn av forretningsmann. o o Noo og lege bella Bella Azín.

Back translation with Google Translate

Daniel Roychrist christ Nowas was was born on 30 November 87 87 in the city of il il il. He is the son of a businessman. o o Noo and doctor bella Bella Azín.

translate-en_et-1_9.argosmodel

  • Works well

translate-sq_en-1_9.argosmodel

  • Works

translate-en_el-1_9.argosmodel

  • Works well

translate-ro_en-1_9.argosmodel

  • Stanza broken
2023-10-22 09:25:52 ERROR: Cannot load model from /home/pj/.local/share/argos-translate/packages/translate-ro_en-1_9/stanza/ro/tokenize/rrt.pt

translate-en_vi-1_9.argosmodel

  • Works well

translate-sr_en-1_9.argosmodel

  • Stanza broken

translate-en_pl-1_9.argosmodel

  • Works very well

translate-bg_en-1_9.argosmodel

  • Works

translate-el_en-1_9.argosmodel

  • Works well

translate-en_sr-1_9.argosmodel

  • Works well

translate-et_en-1_9.argosmodel

  • Works well

translate-en_ca-1_9.argosmodel

  • Works very well

translate-en_zh-1_9.argosmodel

  • Works very well, much better than the current Chinese model

translate-bn_en-1_9.argosmodel

  • Works

translate-lv_en-1_9.argosmodel

  • Works well

translate-en_ms-1_9.argosmodel

  • Works very well

translate-pl_en-1_9.argosmodel

  • Works very well

translate-en_lv-1_9.argosmodel

  • Works well

translate-lt_en-1_9.argosmodel

  • Works well

translate-th_en-1_9.argosmodel

  • Works

translate-zt_en-1_9.argosmodel

  • Works very well

translate-zh_en-1_9.argosmodel

  • Works very well, a major improvement over the current Chinese model

translate-en_th-1_9.argosmodel

  • Works

translate-vi_en-1_9.argosmodel

translate-no_en-1_9.argosmodel

  • Works

translate-en_tl-1_9.argosmodel

  • Works very well
1 Like

I’ve just fixed the ro,sl models (both directions) and uploaded fixed models to drive. I made a mistake when packaging the stanza model.

1 Like

I’ve uploaded a better Norwegian (Bokmål) (nb) model, since there’s two main variations (nb and nn), which should do better than the “no” model.

(input) Daniel Roy-Gilchrist Noboa Azín was born on 30 November 1987 in the city of Guayaquil. He is the son of businessman Álvaro Noboa and physician Anabella Azín.

(output) Daniel Roy-Gilchrist Noboa Azín ble født 30. november 1987 i byen Guayaquil. Han er sønn av forretningsmannen Álvaro Noboa og lege Anabella Azín.

(backtranslation) Daniel Roy-Gilchrist Noboa Azín was born on 30 November 1987 in the city of Guayaquil. He is the son of businessman Álvaro Noboa and physician Anabella Azín.

1 Like

These models are published on argospm-index now:

1 Like

Updated/Improved

Ukrainian (uk)

(en)> When people don’t see moose as potentially dangerous, they may approach too closely and put themselves at risk.
(gt)> Якщо люди не ставляться до лосів як до потенційної небезпеки, вони можуть занадто наблизитися і піддати себе ризику.
(uk)> Коли люди не вважають лося бути потенційно небезпечними, то можуть занадто близько підійти до них і піддати себе небезпеці.

(uk)> Якщо люди не ставляться до лосів як до потенційної небезпеки, вони можуть занадто наблизитися і піддати себе ризику.
(gt)> When people don’t see moose as potentially dangerous, they may approach too closely and put themselves at risk.
(en)> If humans do not treat the elk as a potential danger, they may come too close and put themselves at risk.

1 Like

I tried porting the Opus-MT en->es model to Argos Translate and it doesn’t seem to work.

python opus_mt_convert.py -s en -t es

For almost any source text the Opus-MT model generates repetitive gibberish similar to the (in)famous salad bug.

$ argos-translate -f en -t es "Language models can be trained by providing lots of example translations from a source language to a target language."
ClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClev@@

I’m going to experiment with some more Opus-MT models in different languages. I think our current Spanish model is pretty good so if the issue is just this specific model then we can keep our current model which was trained with OpenNMT-tf.

This is the same issue from BPE encoding, prefix/suffix support, NLLB support by pierotofy · Pull Request #369 · argosopentech/argos-translate · GitHub

I’ve narrowed down the problem to the compute_type parameter for runtime quantization; for some reason, some of these models fail when compute_type is set to default (the current value). Forcing the model to use “float32” fixes the problem, but I’m unsure of the root cause

1 Like

I went through the Firefox translation performance testing data (BLEU and COMET scores) to find models where Opus-MT has better performance than Argos Translate. These Opus-MT models could be good candidates to port to Argos Translate to replace the existing models.

Opus-MT model has better performance than the Argos Translate model

cs-en, tr-en, sk-en, en-fi, hu-en, en-hu, fi-en, en-da, en-ar, ko-en, uk-en, en-es, en-uk, en-cs, id-en, da-en, en-id, en-sk, en-de, de-en

Argos Translate model has better performance than the Opus-MT model

en-ru, en-bg (Argos Translate is using the Opus-MT Bulgarian model so this is strange), ca-en, en-hi, hi-en, pl-en

Similar Performance

ar-en, en-it, fr-en, ru-en, en-sv, sy-en, it-en, en-fr, es-en, ja-en

Argos is already using the Opus-MT model

en-et, en-zh, en-el, en-nl, et-en, en-ca, nl-en, bg-en, zh-en

1 Like

I’m working on converting these Opus-MT models to Argos Translate now. The LibreTranslate Locomotive script for model conversion seem to be working great for me.

I successfully converted these models but still need to test them:

en-fi, fi-en, en-cs, cs-en, en-sk, sk-en, en-hu, hu-en, en-da, da-en, en-es, es-en, en-id, id-en, en-de, de-en

These languages didn’t work with a Cannot find opus model URL. error:

Turkish (tr), Arabic (ar), Korean (ko)

1 Like

Sometimes models are published using a different URL pattern or there’s no reference to the model archive in the README, I’ve added time ago a --model-url argument that can be used to specify a different path Locomotive/opus_mt_convert.py at main · LibreTranslate/Locomotive · GitHub

1 Like

It looks like there are only ko-en and ar-en models but not en-ko, en-ar, en-tr, or tr-en Opus models available here:

Using model-url like this works:

python opus_mt_convert.py -s ko -t en --model-url https://object.pouta.csc.fi/OPUS-MT-models/ko-en/opus-2019-12-05.zip

I tested these models. They mostly look good but a lot of them don’t work with int8 quantization.

en-de

Works

de-en

Works with -q float32 flag. This will make the model larger but that’s fine for German since it’s one of our biggest langauge pairs

en-cs

Works

cs-en

Works

en-da

Works

da-en

Works

en-es (Excluded)

I get repeated text (“ClevClevClevClevClev…”) even with the -q float32 flag.

es-en

Works with float32 quantization

en-fi

Works

fi-en

Works

en-hu

Works

hu-en

Works

en-id

Works

id-en

Works

en-sk

Works

sk-en

Works

en-uk

Works

uk-en

Works

ko-en

Fails with “stave stave stave stave stave stave” but works with float32

1 Like

I’m still puzzled by this quantization issue, although good to hear most models seem to work!

1 Like

I don’t think it’s that surprising that models that were trained for 32bit float don’t always work at 8bit int.

Using the original quantization makes the models 4x larger compressed but that’s fine as long as we only do it for a few languages.
New Opus-MT models and their size compressed

The new Opus-MT models are live! Please LMK if anyone notices issues with them.

1 Like

I’ll find some time to test these today. :partying_face:

1 Like