Creating specific corpus for LibreOffice documentation


First, I thank you for the talk at FOSDEM2023, which I attended in person.

I work on LibreOffice documentation and I do translation for pt-BR as well. LibreOffice has a very large amount of documentation and our community is actively updating its contents. We use Weblate for translating the UI and Help, but nothing for our community literature, including wiki.

Weblate has a translation memory ™ feature we actively use. I’m investigating if argotranslate can go beyond TM in quality and efficiency.

The defaut LibreTranslate for pt-BR (pt) performs badly for our LibreOffice strings, which has xml tags. Also, I am not translating generic English but a rather restricted set of vocabulary related to an office suite.

I am interested to improve de quality of argotranslate, which I suppose can be achieve if we stick to a restricted set of words and sentences. Correct me if I’m wrong.

We have TMX files, po/pot files, for ~70 languages.

Can you point me where to start? I have LibreTranslate installed in a Ubuntu box for development.

Thank you for your attention.

PS: pt-BR is considered a different language than pt-PT. LibreTranslate Portuguese is pt-PT actually.

1 Like

libretranslate uses argostranslate and its models so the results should be the same.
If you translate the xml file directly with the tags it is indeed likely that you will have bad results, it is better to parse the xml file or other file type and only translate the text elements without the tags for optimal results.


Just as dingedi said, translating XML directly might be more difficult (however, have you tried passing the html option to the format parameter?)

You could take your existing translation data (pick two language pairs) and use to train a new model that is based on your specific domain. The models can then be loaded in both argos-translate and LT.

1 Like