Odd translation behavior repeating words

Thanks PJ Finlay,

I am a native speaker, but honestly I can’t really say if it improves the existing model in a objective way. Some phrases do look better, but other phrases look a lot worse. I think the best way would be to A/B test it with real users. (in other words, please don’t push this new model for now, keep the old model)

The reason I was training the model was because of that repeated words problem, but it seems that pierotofy has fixed it in another way?

For the pt-es model, do you think we would get visible improvements with your previous experience of other languages that used the lang1->en->lang2 double translate path? If so, I can try training that.

1 Like

I normally haven’t included non-English language pairs but I would be willing to merge es-pt. Since the languages are so similar I think there would be a big performance improvement from not pivoting through English.

I’ve seen this issue where the models need to be re-zipped a few times now and I believe it’s caused by Google Drive. Downloading from Google Drive seems to mess with the .zip file compression somehow.

I’ve encountered this also, it’s due to certain compression algorithms not being universally supported by all zip extractors.

Will do! I will start training the pt-es model with this config:

{
    "from": {
        "name": "Portuguese",
        "code": "pt"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "opus://NLLB",
        "opus://MultiParaCrawl",
        "opus://OpenSubtitles",
        "opus://ELRC-EMEA",
        "opus://LinguaTools-WikiTitles",
        "opus://XLEnt",
        "opus://EUbookshop",
        "opus://TildeMODEL",
        "opus://SciELO",
        "opus://Europarl",
        "opus://WikiMatrix",
        "opus://JRC-Acquis",
        "opus://EMEA",
        "opus://DGT",
        "opus://KDE4",
        "opus://GNOME",
        "opus://GlobalVoices",
        "opus://NeuLab-TedTalks",
        "opus://Tatoeba",
        "opus://News-Commentary"
    ]   
}

Any other data source I should include in this list? (or maybe exclude?). Does more data sources mean better results? Or can it hinder the quality of the model after a certain point? Thanks in advance

1 Like

@argosopentech , here are the PT-ES models:

https://github.com/bruno-kakele/argos/raw/main/translate-pt_es-1_0.argosmodel
https://github.com/bruno-kakele/argos/raw/main/translate-es_pt-1_0.argosmodel

Can you check if they look sane? I did some manual tests and the translation looks OK. Thanks in advance!

1 Like

This model is live! Thanks for contributing

2 Likes

I looked into this more and I think there is something broken with the current zip implementation. I made a pull request with a new version that seems to work better.

1 Like

Just merged! Thanks.

1 Like

I’ve encountered the same problem.
I assume it has to do with the way the files are sorted in the script (9 more than 5).
I worked around this problem temporarily by simply deleting step files 1000 through 9000.
The same problem exists when averaging control points, so be careful, you may average points 9000 and 50000 for example, BLEU may be high, but translate with mediocre quality.

The checkpoint sorting should be fixed with add get_checkpoints · LibreTranslate/Locomotive@19a3777 · GitHub

2 Likes

I’m seeing this issue again with a new model. Unzipping and re-zipping the model fixes it but I don’t know what the root cause is.

Oh, interesting!
I added the readme.txt file to the already packaged .argos archive. Perhaps this is the case.
I used MC in Ubuntu if that helps.

Hmmm, I’m not sure. I use Ubuntu too and normally zip directories by right clicking on them.

It’s good to know you re-zipped it. I think Locomotive’s packaging code probably isn’t broken then.