Odd translation behavior repeating words

bruno-kakele · December 12, 2023, 4:39pm

Thanks PJ Finlay,

I am a native speaker, but honestly I can’t really say if it improves the existing model in a objective way. Some phrases do look better, but other phrases look a lot worse. I think the best way would be to A/B test it with real users. (in other words, please don’t push this new model for now, keep the old model)

The reason I was training the model was because of that repeated words problem, but it seems that pierotofy has fixed it in another way?

For the pt-es model, do you think we would get visible improvements with your previous experience of other languages that used the lang1->en->lang2 double translate path? If so, I can try training that.

argosopentech · December 12, 2023, 5:56pm

I normally haven’t included non-English language pairs but I would be willing to merge es-pt. Since the languages are so similar I think there would be a big performance improvement from not pivoting through English.

argosopentech · December 12, 2023, 8:32pm

I’ve seen this issue where the models need to be re-zipped a few times now and I believe it’s caused by Google Drive. Downloading from Google Drive seems to mess with the .zip file compression somehow.

pierotofy · December 12, 2023, 10:07pm

I’ve encountered this also, it’s due to certain compression algorithms not being universally supported by all zip extractors.

bruno-kakele · December 13, 2023, 3:47am

Will do! I will start training the pt-es model with this config:

{
    "from": {
        "name": "Portuguese",
        "code": "pt"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "opus://NLLB",
        "opus://MultiParaCrawl",
        "opus://OpenSubtitles",
        "opus://ELRC-EMEA",
        "opus://LinguaTools-WikiTitles",
        "opus://XLEnt",
        "opus://EUbookshop",
        "opus://TildeMODEL",
        "opus://SciELO",
        "opus://Europarl",
        "opus://WikiMatrix",
        "opus://JRC-Acquis",
        "opus://EMEA",
        "opus://DGT",
        "opus://KDE4",
        "opus://GNOME",
        "opus://GlobalVoices",
        "opus://NeuLab-TedTalks",
        "opus://Tatoeba",
        "opus://News-Commentary"
    ]   
}

Any other data source I should include in this list? (or maybe exclude?). Does more data sources mean better results? Or can it hinder the quality of the model after a certain point? Thanks in advance

bruno-kakele · December 15, 2023, 6:27pm

@argosopentech , here are the PT-ES models:

https://github.com/bruno-kakele/argos/raw/main/translate-pt_es-1_0.argosmodel
https://github.com/bruno-kakele/argos/raw/main/translate-es_pt-1_0.argosmodel

Can you check if they look sane? I did some manual tests and the translation looks OK. Thanks in advance!

argosopentech · December 15, 2023, 9:52pm

This model is live! Thanks for contributing

argosopentech · December 16, 2023, 6:00pm

I looked into this more and I think there is something broken with the current zip implementation. I made a pull request with a new version that seems to work better.

pierotofy · December 16, 2023, 6:40pm

Just merged! Thanks.

lynxpda · December 31, 2023, 2:00pm

I’ve encountered the same problem.
I assume it has to do with the way the files are sorted in the script (9 more than 5).
I worked around this problem temporarily by simply deleting step files 1000 through 9000.
The same problem exists when averaging control points, so be careful, you may average points 9000 and 50000 for example, BLEU may be high, but translate with mediocre quality.

pierotofy · December 31, 2023, 4:19pm

The checkpoint sorting should be fixed with add get_checkpoints · LibreTranslate/Locomotive@19a3777 · GitHub

argosopentech · February 11, 2024, 8:51pm

I’m seeing this issue again with a new model. Unzipping and re-zipping the model fixes it but I don’t know what the root cause is.

lynxpda · February 11, 2024, 9:16pm

Oh, interesting!
I added the readme.txt file to the already packaged .argos archive. Perhaps this is the case.
I used MC in Ubuntu if that helps.

argosopentech · February 11, 2024, 10:17pm

Hmmm, I’m not sure. I use Ubuntu too and normally zip directories by right clicking on them.

It’s good to know you re-zipped it. I think Locomotive’s packaging code probably isn’t broken then.