Odd translation behavior repeating words

An update: the models have finished training:

The BLEU scores were ~68 and ~58 respectively, oddly the pt_en one seems to perform worse than the en_pt (I just used the --reverse flag).

When evaluating manually, the error for “quase quase” still persists, it outputs “almost almost almost almost almost almost…” repeated many times.

Trying some samples it does seem to be translate well, but I am not sure how can I compare with the current model in an objective way?

Any feedback or ideas?

Thanks @argosopentech and @pierotofy

1 Like

Awesome thanks for training these!

These models look good. They didn’t work with Argos Translate out of the box but they worked after I unzipped and re-zipped the model directory. I’m not sure exactly what the issue was or why re-zipping fixed it but they’re working now.

I got this error initially:

Traceback (most recent call last):
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslategui/gui.py", line 396, in load_languages
    self.languages = translate.load_installed_languages()
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/translate.py", line 636, in load_installed_languages
    return get_installed_languages()
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/translate.py", line 521, in get_installed_languages
    packages = package.get_installed_packages()
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/package.py", line 327, in get_installed_packages
    to_return.append(Package(path))
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/package.py", line 193, in __init__
    raise FileNotFoundError(
FileNotFoundError: Error opening package at /home/pj/.local/share/argos-translate/packages/stanza/metadata.json no metadata.json
Aborted (core dumped)
1 Like

Here’s some translation samples:

en → pt

English Source Text (Wikipedia)

Hector Berlioz (11 December 1803 – 8 March 1869) was a French Romantic composer. His output includes orchestral works such as Harold in Italy, choral pieces including his Requiem and L’enfance du Christ, and works of hybrid genres such as the “dramatic symphony” Roméo et Juliette and the “dramatic legend” La damnation de Faust. Expected to enter medicine, Berlioz defied his family by taking up music, and won the Prix de Rome in 1830. Berlioz married the Irish Shakespearean actress Harriet Smithson, who inspired his first major success, the Symphonie fantastique, in which an idealised depiction of her occurs throughout. His first opera, Benvenuto Cellini, was a failure. The second, the epic Les Troyens, was so large in scale that it was never staged in its entirety during his lifetime. Meeting only occasional success in France as a composer, Berlioz turned to conducting, in which he gained an international reputation. He also wrote musical journalism throughout much of his career.

1.1 (Proposed) Translation

Texto para traduzir de Hector Berlioz (11 de dezembro de 1803 – 8 de março de 1869) foi um compositor francês. Sua saída inclui obras orquestrais como Harold em Itália, peças coral, incluindo seu Requiem e L’enfance du Cristo, e obras de gêneros híbridos como a “fínfoniadramática” Roméo et Juliette e a " lendadramática" La damnation de Faust. Esperava entrar em medicina, Berlioz defiou sua família tomando música e ganhou o Prix de Roma em 1830. Berlioz casou-se com a atriz irlandesa Shakespearean Harriet Smithson, que inspirou seu primeiro grande sucesso, a fantastique Symphonie, na qual ocorre uma representação idealizada de sua atriz. Sua primeira ópera, Benvenuto Cellini, foi uma falha. O segundo, o épico Les Troyens, foi tão grande em escala que nunca foi palco em sua totalidade durante sua vida. Encontro apenas um sucesso ocasional na França como compositor, Berlioz voltou a conduzir, em que ganhou uma reputação internacional. Ele também escreveu jornalismo musical em grande parte de sua carreira.

1.0 (Prod) Translation

Hector Berlioz (11 de dezembro de 1803 - 8 de março de 1869) foi um compositor romântico francês. Sua produção inclui obras orquestrais como Haroldo na Itália, peças corais incluindo sua Requiem e L’enfance du Christ, e obras de gêneros híbridos como a “sinfonia dramática” Roméo et Juliette e a " legenda dramática" La Damnation de Faust. Esperava entrar na medicina, Berlioz desafiou sua família ao tomar música e ganhou o Prix de Roma em 1830. Berlioz casou-se com a atriz irlandesa de Shakespeare Harriet Smithson, que inspirou seu primeiro grande sucesso, a fantasia de Symphonie, na qual uma representação idealizada dela ocorre em toda parte. Sua primeira ópera, Benvenuto Cellini, foi um fracasso. O segundo, o épico Les Troyens, era tão grande em escala que nunca foi encenado em sua totalidade durante sua vida. Reunindo apenas sucesso ocasional na França como compositor, Berlioz virou-se para conduzir, em que ganhou uma reputação internacional. Ele também escreveu jornalismo musical em grande parte de sua carreira.

pt → en

Portuguese Source Text (Wikipedia)

Mohammed Ould Abdel Aziz (Akjoujt, 20 de dezembro de 1956) é um político e foi o 8.º presidente da Mauritânia entre 2009 a 2019.[1] Soldado de carreira e oficial de alta patente, foi destaque durante o golpe em agosto de 2005 que depôs o presidente Maaouya Ould Sid’Ahmed Taya, e liderou o golpe em agosto de 2008, que derrubou o presidente Sidi Ould Cheikh Abdallahi. Após o golpe de 2008, Abdel Aziz tornou-se Presidente do Conselho Superior de Estado como parte do que foi descrito como uma transição política que conduziu a uma nova eleição. Renunciou ao cargo em abril de 2009 para se apresentar como candidato nas eleições presidenciais de julho de 2009, saindo eleito. Foi empossado em 5 de agosto de 2009. Posteriormente, foi reeleito em 2014 e não buscou a reeleição em 2019. Foi sucedido por Mohamed Ould Ghazouani, que assumiu o cargo em 1 de agosto de 2019.

1.1 (Proposed) Translation

Mohammed Ould Abdel Aziz (born December 20, 1956) is a politician, and was the 8th president of Mauritania between 2009 and 2019.[1] High patent career and official soldier, was highlighted during the coup in August 2005 which led President Maaouya Ould Sid’Ahmed Taya, and led the coup in August 2008, which broke down President Sidi Ould Cheikh Abdallahi. After the 2008 coup, Abdel Aziz became President of the Higher Council of State as part of what was described as a political transition that led to a new election. He denounced the position in April 2009 to present himself as a candidate in the presidential elections of July 2009, leaving elected. On 5 August 2009. Subsequently, it was re-elected in 2014 and did not seek re-election in 2019. It was succeeded by Mohamed Ould Ghazouani, who took office on 1 August 2019.

1.0 (Prod) Translation

Mohammed Ould Abdel Aziz (December 20, 1956) is a politician and was the 8th president of Mauritania from 2009 to 2019.[1] High-ranking officer and career soldier, was featured during the coup in August 2005 which deposed President Maaouya Ould Sid’Ahmed Taya, and led the coup in August 2008, which ousted President Sidi Ould. After the 2008 coup, Abdel Aziz became President of the Superior Council of State as part of what was described as a political transition that led to a new election. He resigned in April 2009 to appear as a candidate in the July 2009 presidential election, leaving elected. It was retired on 5 August 2009. He was re-elected in 2014 and did not seek re-election in 2019. It was succeeded by Mohamed Ould Ghazouani, who took office on 1 August 2019.

Side note, I’ve just merged Workaround for salad by pierotofy · Pull Request #554 · LibreTranslate/LibreTranslate · GitHub which should help mitigate this issue with all models for single word translations.

1 Like

Thank you @pierotofy and @argosopentech ! A few more questions:

  • When I stopped training before the 50000 step and then resumed it again (by executing the train command), it resumed from step 9000. Is that normal?
  • Does the 1.0 model has a BLEU score and/or acc and ppl for comparison? What is the process now for updating the existing model in the argospentech website (that is loaded by default with libretranslate)?

I’m not sure exactly, this would depend on the interaction between Locomotive and OpenNMT-py.

I don’t have any BLEU scores for the 1.0 model. BLEU aren’t very reliable in my experience so I don’t use them much.

Are to a Portuguese speaker? Do you think the 1.1 model is an improvement over the 1.0 one? I can push a commit to the argospm-index repo to update the prod model.

Thanks PJ Finlay,

I am a native speaker, but honestly I can’t really say if it improves the existing model in a objective way. Some phrases do look better, but other phrases look a lot worse. I think the best way would be to A/B test it with real users. (in other words, please don’t push this new model for now, keep the old model)

The reason I was training the model was because of that repeated words problem, but it seems that pierotofy has fixed it in another way?

For the pt-es model, do you think we would get visible improvements with your previous experience of other languages that used the lang1->en->lang2 double translate path? If so, I can try training that.

1 Like

I normally haven’t included non-English language pairs but I would be willing to merge es-pt. Since the languages are so similar I think there would be a big performance improvement from not pivoting through English.

I’ve seen this issue where the models need to be re-zipped a few times now and I believe it’s caused by Google Drive. Downloading from Google Drive seems to mess with the .zip file compression somehow.

I’ve encountered this also, it’s due to certain compression algorithms not being universally supported by all zip extractors.

Will do! I will start training the pt-es model with this config:

{
    "from": {
        "name": "Portuguese",
        "code": "pt"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "opus://NLLB",
        "opus://MultiParaCrawl",
        "opus://OpenSubtitles",
        "opus://ELRC-EMEA",
        "opus://LinguaTools-WikiTitles",
        "opus://XLEnt",
        "opus://EUbookshop",
        "opus://TildeMODEL",
        "opus://SciELO",
        "opus://Europarl",
        "opus://WikiMatrix",
        "opus://JRC-Acquis",
        "opus://EMEA",
        "opus://DGT",
        "opus://KDE4",
        "opus://GNOME",
        "opus://GlobalVoices",
        "opus://NeuLab-TedTalks",
        "opus://Tatoeba",
        "opus://News-Commentary"
    ]   
}

Any other data source I should include in this list? (or maybe exclude?). Does more data sources mean better results? Or can it hinder the quality of the model after a certain point? Thanks in advance

1 Like

@argosopentech , here are the PT-ES models:

https://github.com/bruno-kakele/argos/raw/main/translate-pt_es-1_0.argosmodel
https://github.com/bruno-kakele/argos/raw/main/translate-es_pt-1_0.argosmodel

Can you check if they look sane? I did some manual tests and the translation looks OK. Thanks in advance!

1 Like

This model is live! Thanks for contributing

2 Likes

I looked into this more and I think there is something broken with the current zip implementation. I made a pull request with a new version that seems to work better.

1 Like

Just merged! Thanks.

1 Like

I’ve encountered the same problem.
I assume it has to do with the way the files are sorted in the script (9 more than 5).
I worked around this problem temporarily by simply deleting step files 1000 through 9000.
The same problem exists when averaging control points, so be careful, you may average points 9000 and 50000 for example, BLEU may be high, but translate with mediocre quality.

The checkpoint sorting should be fixed with add get_checkpoints · LibreTranslate/Locomotive@19a3777 · GitHub

2 Likes

I’m seeing this issue again with a new model. Unzipping and re-zipping the model fixes it but I don’t know what the root cause is.

Oh, interesting!
I added the readme.txt file to the already packaged .argos archive. Perhaps this is the case.
I used MC in Ubuntu if that helps.

Hmmm, I’m not sure. I use Ubuntu too and normally zip directories by right clicking on them.

It’s good to know you re-zipped it. I think Locomotive’s packaging code probably isn’t broken then.