OPUS-MT Language Models Port Thread

I’ve ported/tested a bunch of the existing languages that exist in OPUS-MT but aren’t available in argos-translate or where argos-translate models need improvement using Locomotive:

Here’s the complete list with bidirectional models to/from English:

New:

Norwegian (no)
Chinese (traditional) (zt)
Albanian (sq)
Romanian (ro)
Serbian (sr)
Bengali (bn)
Slovenian (sl)
Vietnamese (vi)
Bulgarian (bg)
Estonian (et)
Latvian (lv)
Thai (th)
Lithuanian (lt)
Malay (ms)
Togalog (tl)

Updated/Improved:

Chinese (zh)
Greek (el)
Polish (pl)
English => Catalan (ca) (single direction, we already have the Catalan => English direction)

:boom:

Link to argosmodels: OPUS-models – Google Drive

Note that BPE-encoded models will have some spacing issues until https://github.com/argosopentech/argos-translate/pull/373 is merged.

I also didn’t set the proper version numbers for updated models like Greek. If included in the index, the version number should probably be updated.

2 Likes

I’ve loaded the models on libretranslate.com.

2 Likes

This is now merged and available in Argos Translate 1.9.1

1 Like

Updated/Improved

French (fr)
Czech (cs)

I’ve been working on uploading these OPUS-MT models to argospm. Here are some notes I have from testing them:

translate-en_bn-1_9.argosmodel

  • Works well

translate-en_zt-1_9.argosmodel

  • Works well

translate-en_bg-1_9.argosmodel

  • Works well

translate-en_lt-1_9.argosmodel

  • Works well

translate-en_ro-1_9.argosmodel

  • Works very well

translate-sl_en-1_9.argosmodel

  • Stanza broken. Argos Translate crashes when I try to run this model
2023-10-22 09:08:53 ERROR: Cannot load model from /home/pj/.local/share/argos-translate/packages/translate-sl_en-1_9/stanza/sl/tokenize/ssj.pt
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: '/home/pj/.local/share/argos-translate/packages/translate-sl_en-1_9/stanza/sl/tokenize/ssj.pt'

translate-ms_en-1_9.argosmodel

  • Works well

translate-en_sq-1_9.argosmodel

  • Works well

translate-tl_en-1_9.argosmodel

  • Works

translate-en_sl-1_9.argosmodel

  • Partially broken
  • Frequently generates this specific sentence instead of translating the source English text: “= = Odkritje = = Asteroid je odkril avstrijski astronom Johann Palisa (1877 - 1962) 17. novembra 1892 v Karlu.”

According to Google Translate this translates to: “= = Discovery = = The asteroid was discovered by the Austrian astronomer Johann Palisa (1877 - 1962) on November 17, 1892 in Karl.”

For example all of these three different English source sentences are translated to the same Slovenian sentence about the asteroid (approximatly 5% of sentences have this issue in my tests):

He began running during the first COVID-19 lockdown in March 2020, having been motivated by an American athlete who had run every street in San Francisco in 30 days.

Daniel Roy Gilchrist Noboa Azín (born 30 November 1987) is an Ecuadorian business administrator, politician and businessman in the banana industry, who is the president-elect of Ecuador after winning the 2023 general election.

In March 2023, he was in favor of the muerte cruzada, in the face of the rejection and filing of the Investment Law, presented by the government of Guillermo Lasso.[19] On 17 May 2023,

translate-en_no-1_9.argosmodel

  • Partially broken
    Most translations are correct but ~15% of them have gibberish, for example:
English Source

Daniel Roy-Gilchrist Noboa Azín was born on 30 November 1987 in the city of Guayaquil. He is the son of businessman Álvaro Noboa and physician Anabella Azín.

translate-en_no-1_9

Daniel Roychrist christ Nowas was was ble født 30. november 87 87 i byen il il il. Han er sønn av forretningsmann. o o Noo og lege bella Bella Azín.

Back translation with Google Translate

Daniel Roychrist christ Nowas was was born on 30 November 87 87 in the city of il il il. He is the son of a businessman. o o Noo and doctor bella Bella Azín.

translate-en_et-1_9.argosmodel

  • Works well

translate-sq_en-1_9.argosmodel

  • Works

translate-en_el-1_9.argosmodel

  • Works well

translate-ro_en-1_9.argosmodel

  • Stanza broken
2023-10-22 09:25:52 ERROR: Cannot load model from /home/pj/.local/share/argos-translate/packages/translate-ro_en-1_9/stanza/ro/tokenize/rrt.pt

translate-en_vi-1_9.argosmodel

  • Works well

translate-sr_en-1_9.argosmodel

  • Stanza broken

translate-en_pl-1_9.argosmodel

  • Works very well

translate-bg_en-1_9.argosmodel

  • Works

translate-el_en-1_9.argosmodel

  • Works well

translate-en_sr-1_9.argosmodel

  • Works well

translate-et_en-1_9.argosmodel

  • Works well

translate-en_ca-1_9.argosmodel

  • Works very well

translate-en_zh-1_9.argosmodel

  • Works very well, much better than the current Chinese model

translate-bn_en-1_9.argosmodel

  • Works

translate-lv_en-1_9.argosmodel

  • Works well

translate-en_ms-1_9.argosmodel

  • Works very well

translate-pl_en-1_9.argosmodel

  • Works very well

translate-en_lv-1_9.argosmodel

  • Works well

translate-lt_en-1_9.argosmodel

  • Works well

translate-th_en-1_9.argosmodel

  • Works

translate-zt_en-1_9.argosmodel

  • Works very well

translate-zh_en-1_9.argosmodel

  • Works very well, a major improvement over the current Chinese model

translate-en_th-1_9.argosmodel

  • Works

translate-vi_en-1_9.argosmodel

translate-no_en-1_9.argosmodel

  • Works

translate-en_tl-1_9.argosmodel

  • Works very well
1 Like

I’ve just fixed the ro,sl models (both directions) and uploaded fixed models to drive. I made a mistake when packaging the stanza model.

1 Like

I’ve uploaded a better Norwegian (Bokmål) (nb) model, since there’s two main variations (nb and nn), which should do better than the “no” model.

(input) Daniel Roy-Gilchrist Noboa Azín was born on 30 November 1987 in the city of Guayaquil. He is the son of businessman Álvaro Noboa and physician Anabella Azín.

(output) Daniel Roy-Gilchrist Noboa Azín ble født 30. november 1987 i byen Guayaquil. Han er sønn av forretningsmannen Álvaro Noboa og lege Anabella Azín.

(backtranslation) Daniel Roy-Gilchrist Noboa Azín was born on 30 November 1987 in the city of Guayaquil. He is the son of businessman Álvaro Noboa and physician Anabella Azín.

1 Like

These models are published on argospm-index now:

1 Like

Updated/Improved

Ukrainian (uk)

(en)> When people don’t see moose as potentially dangerous, they may approach too closely and put themselves at risk.
(gt)> Якщо люди не ставляться до лосів як до потенційної небезпеки, вони можуть занадто наблизитися і піддати себе ризику.
(uk)> Коли люди не вважають лося бути потенційно небезпечними, то можуть занадто близько підійти до них і піддати себе небезпеці.

(uk)> Якщо люди не ставляться до лосів як до потенційної небезпеки, вони можуть занадто наблизитися і піддати себе ризику.
(gt)> When people don’t see moose as potentially dangerous, they may approach too closely and put themselves at risk.
(en)> If humans do not treat the elk as a potential danger, they may come too close and put themselves at risk.

1 Like

I tried porting the Opus-MT en->es model to Argos Translate and it doesn’t seem to work.

python opus_mt_convert.py -s en -t es

For almost any source text the Opus-MT model generates repetitive gibberish similar to the (in)famous salad bug.

$ argos-translate -f en -t es "Language models can be trained by providing lots of example translations from a source language to a target language."
ClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClevClev@@

I’m going to experiment with some more Opus-MT models in different languages. I think our current Spanish model is pretty good so if the issue is just this specific model then we can keep our current model which was trained with OpenNMT-tf.

This is the same issue from BPE encoding, prefix/suffix support, NLLB support by pierotofy · Pull Request #369 · argosopentech/argos-translate · GitHub

I’ve narrowed down the problem to the compute_type parameter for runtime quantization; for some reason, some of these models fail when compute_type is set to default (the current value). Forcing the model to use “float32” fixes the problem, but I’m unsure of the root cause

1 Like