Is it possible to train a new language model in our own environment?

hribarinho · April 20, 2023, 4:54am

Hi all,

I’ve just discovered LibreTranslate today and I’m very keen on trying it out. We have our own dedicated environment for this with enough GPU power, so we’d like to run it and, more importantly, train it within our environment. We’d use our own corpora which one more argument to use our own environment. The video tutorial for training a new language model shows it has to be done on vast.ai if I understand correctly. So, can we use our environment for training as well?

Secondly, the aim for us would be to use it as a backend service with enabled API which we would integrate with our translation tools. This is also possible, right?

Best, seba

dingedi · April 20, 2023, 7:06am

the tutorial presents vast.ai because it’s easy to use for people who don’t have gpu but if you have an environment with gpu you can do exactly like the tutorial but without the specific handling for vast.ai

then with the obtained module you can install it and it will be available with argos and libretranslate

hribarinho · April 20, 2023, 8:04am

So just to confirm we can train the model without exposing our proprietary data outside our environment?

dingedi · April 20, 2023, 8:39am

yes argos-train can be used on your computer

pierotofy · April 20, 2023, 3:22pm

So long as you respect the terms of the AGPLv3 license LibreTranslate/LICENSE at main · LibreTranslate/LibreTranslate · GitHub, yes. If you modify the software you’ll need to make the source code of your modifications available to your users. (I’m not a lawyer and this does not constitute legal advice).

hribarinho · April 20, 2023, 7:23pm

Please excuse my poor wording. We wouldn’t change the software at all. We would train LibreTranslate with our own data and have it running separately. Then, our own translation tool would call LibreTranslate API. There wouldn’t be any modification or integration. I was referring to integration from the process point of view.

hribarinho · April 25, 2024, 3:15pm

I have two follow-up questions.

Is it possible to set it up in a way that we get back bilingual documents? So it’s easier for translators to check the machine translations. Ideally this would be in a feedback learning loop, if possible.
Is it possible to hire someone to the initial setup in our environment? If so, is this the place to search for them?

NicoLe · May 3, 2024, 2:11pm

Hi there,
It is perfectly possible. You need

a Python dependency mirror in your environment if you want to isolate it from the Internet
a “Locomotive” server for training the models (this requires GPU)
a “LibreTranslate” instance does not require GPU for inference, it can run up to 4 one thousand character requests a second on a single CPU core

If you want to use your own models, I advise defining a dedicated user to run libretranslate. You will need to copy the packages under said user’s home directory from root/.local/share/argos-translate/packages, but updating LT or argostranslate will not change the models used for production.

Pipeline as follows

train models on the Locomotive server (Windows for convenience, but it’s OS agnostic), best metrics to follow are val.BLEU and, to a lesser extent, ppl
serve them to the LibreTranslate instance using pscp
install them with a python “file_to_package.py” script (see there)
copy them to the user directory (remove existing packages before)
reboot the service

argosopentech · May 3, 2024, 11:50pm

I can help setup the environment. You can email me at [email protected]