Creating specific corpus for LibreOffice documentation

ohallot · March 9, 2023, 6:20pm

Hi

First, I thank you for the talk at FOSDEM2023, which I attended in person.

I work on LibreOffice documentation and I do translation for pt-BR as well. LibreOffice has a very large amount of documentation and our community is actively updating its contents. We use Weblate for translating the UI and Help, but nothing for our community literature, including wiki.

Weblate has a translation memory ™ feature we actively use. I’m investigating if argotranslate can go beyond TM in quality and efficiency.

The defaut LibreTranslate for pt-BR (pt) performs badly for our LibreOffice strings, which has xml tags. Also, I am not translating generic English but a rather restricted set of vocabulary related to an office suite.

I am interested to improve de quality of argotranslate, which I suppose can be achieve if we stick to a restricted set of words and sentences. Correct me if I’m wrong.

We have TMX files, po/pot files, for ~70 languages.

Can you point me where to start? I have LibreTranslate installed in a Ubuntu box for development.

Thank you for your attention.

Olivier
PS: pt-BR is considered a different language than pt-PT. LibreTranslate Portuguese is pt-PT actually.

dingedi · March 9, 2023, 6:25pm

libretranslate uses argostranslate and its models so the results should be the same.
If you translate the xml file directly with the tags it is indeed likely that you will have bad results, it is better to parse the xml file or other file type and only translate the text elements without the tags for optimal results.

pierotofy · March 10, 2023, 4:37am

Just as dingedi said, translating XML directly might be more difficult (however, have you tried passing the html option to the format parameter?)

You could take your existing translation data (pick two language pairs) and use https://github.com/argosopentech/argos-train to train a new model that is based on your specific domain. The models can then be loaded in both argos-translate and LT.