Adding a non supported language

sttx · March 23, 2022, 5:08am

What data do I need to add a new language? I saw this tutorial but it does not explain how to add/train a non-supported language (ISO 639-2) on LibreTranslate. I’m a beginner, sorry if I don’t understand it that much.

argosopentech · March 26, 2022, 1:34pm

To add a new language pair you need to collect data (I normally use Opus), and add the data to data-index.json in the Argos Train repo. Then you need to train an Argos Translate model and install it on your LibreTranslate instance.

Docs

pierotofy · March 26, 2022, 3:57pm

I’ve added these details in the README too.

jefs42 · April 4, 2022, 8:51pm

I’m trying, but having some trouble getting the various instructions put together to get to the end.

I was trying to follow this - Neural Machine Translation : Training OpenNMT on Swedish-English corpus. | by Muhammad Saad Ali | Medium - using a da->en TMX of the Europarl document from Opus.

(Bloom County anyone…? )

With some fixes to their example, I have a directory now like this:

+ corpus_da-en
    - src-test.txt
    - src-train.txt
    - src-val.txt
    - tgt-test.txt
    - tgt-train.txt
    - tgt-val.txt

The next step says to run this preprocess.py but I don’t see that file in the OpenNMT-py directory. There is train.py and translate.py

Then the train.py example has this models/baseline but there also is no models directory in the OpentNMT-py directory.

So I’m a little stuck in the middle of how to get from europarl_da-en.tmx to a data-europarl-da_en.argosdata

Hmm, I may have to setup a linux distro on my PC… Looks like the OpenNMT and argostrain need (or prefer?) GPU to run… Have been doing the above so far on a digital ocean droplet.

argosopentech · April 5, 2022, 12:20am

To create a .argosdata package from Opus data you want to download data in the Moses format.

Argos Train uses Moses format data, which is two files where each line is a piece of parallel data. To create the .argosdata file download the Moses data then save the two data files as source and target in a new directory and zip the new directory with a metadata.json file. To see the contents of an existing .argosdata package run unzip sample.argosdata. The tutorial you linked is manually splitting into train and validate datasets but Argos Train will do that for you automatically.

It’s easiest to prepare your data on a normal DigitalOcean droplet but to actually train the model you will need a GPU.

jefs42 · April 18, 2022, 5:04am

Is there a direct way? I don’t know much about docker…

I have my argosdata files. I cloned argostrain git to my Manjaro installtion on my PC so I’d have GPU to use. I tried to source bin/argos-train-init and it ran through a bunch of stuff but then said /root/env/bin/activate doesn’t exist (which is true…)

Can I not somehow just download the github, adjust the data-index.json to point to my DA->EN argosdata files and tell it to go train?

argosopentech · April 18, 2022, 12:02pm

Using Docker for Argos Train isn’t necessary, it just makes it easier to have a consistent environment.

I’m guessing part of the argos-train-init script is failing. I recommend inspecting the logs or running the script in sections manually and verifying as you go.

jefs42 · April 19, 2022, 11:50pm

Thanks, I think I’m almost there. I have to steal my HTPC back from the kids

Got pretty far, then error about CUDA (and a number of errors after that, but one at a time)

My findings so far regarding agros-train-init:

it has to be debian based OS, or at least any that use apt package manager
it assumes python virtualenv is installed. This must vary by distribution, I had to manually install it
torch seems to be flaky, even on just LibreTranslate install. I manually installed latest - 1.11 or so, init script unistalled it and installed 1.9xxx but argos-train complained about torch version. It was fine after I manually re-installed it back to 1.11

Does it matter what to/from Opus I use? Can I just switch source and target?

Since I’m getting close on this da->en, I figured I’d prepare a no->en as well. There’s a CCMatrix of en->no. I assumed I can still use that but switch which file is source and which is target?

.

argosopentech · April 20, 2022, 10:44am

The scripts use apt-get to install dependencies so you will need to either use a Debian based distro or install them manually.
The CUDA uninstall and reinstall is trying to fix this issue, I think this is specific to vast.ai so if you’re using your own GPU just try to match the CUDA version of used by torch to what you have installed.
I’ve generally followed the convention of English as the source language for data packages and then Argos Train will automatically swap source and target if you’re training in the other direction. This convention isn’t necessary though.

It sounds like you’re making good progress, let me know if you have more questions. Also once you’ve finished with data packages please make a pull request to Argos Train so we can add links to the packages you’ve mirrored and created.

jefs42 · April 21, 2022, 3:42am

Will do.
I have them up there:
https://libretranslate.fortytwo-it.com/argosdata/data-europarl-da_en.argosdata
https://libretranslate.fortytwo-it.com/argosdata/data-ccmatrix-no_en.argosdata

I think they are good, but since I haven’t yet successfully completed a training they’re technically still work-in-progress

Are you saying, taking the da-en one for example, I can just change what I say when I run argos-train?

So in this case da is the source file and en is the target file. But if I run argos-train and say source is en and target is da it will automatically switch which file it uses and I can train both da->en and en->da models using the same argosdata file?

For my own needs, but also to help out, I want to translate Danish to English. (I saw a few other requests for Danish in that one thread on github). I’d do a Danish to English as well afterward anyway, but if it’s that easy to do both from same data…

argosopentech · April 21, 2022, 11:09am

So in this case da is the source file and en is the target file. But if I run argos-train and say source is en and target is da it will automatically switch which file it uses and I can train both da->en and en->da models using the same argosdata file?

Correct, here is the code doing that.

jefs42 · April 29, 2022, 8:55pm

I have a translate-da_en-1_0.argosmodel (and en_da-1_0 training now). How can I add this to my local libretranslate installation?

I found and did this:

#!/usr/bin/python3

from argostranslate import package, translate
package.install_from_path('translate-da_en-1_0.argosmodel')
installed_languages = translate.get_installed_languages()

for i in range(0,len(installed_languages)):
        print(installed_languages[i])

And I do see Danish in the output. I restarted my libretranslate server.

print_r( $translator->Languages() )

however, does not show a [da] => Danish entry.
I figured I should try it out before submitting for inclusion

argosopentech · April 30, 2022, 11:14am

Like you said you can install packages manually with:

import argostranslate.package

argostranslate.package.install_from_path('translate-da_en-1_0.argosmodel')

You should then be able to see the package installed with:

$ argospm list
translate-da_en

I think restarting LibreTranslate should then make it available, this is how we get installed languages in LibreTranslate.

You can also host your own version of the package index with your package and connect to it from LibreTranslate with the environment variable ARGOS_PACKAGE_INDEX=https://yourindex.com/index.json.

jefs42 · May 1, 2022, 2:05am

Oh, I ran my python code as root thinking it would install the models “system-wide”. I had to run it as the libretranslate user, now shows Danish and Norwegian options on the webpage.

Do models stack when installed?

The Danish translation tests with just the Europarl model were a bit funky. But Norwegian translation tests with CCMatrix data seem pretty decent.

I was going to do CCMatrix for Danish then, but argos-train said the data was too large. Can I increase that limit (I didn’t see anything in config.yml I could think of…).

So I did a CCAligned version instead, installed it, and the translations on the web page seem a lot better. Is it now using both the Europarl and the CCAligned data. If I train other da_en data sources will that then add more info for translations?

argosopentech · May 1, 2022, 10:30am

The model packages are installed to $HOME/.local so they are by user.

All your installed Argos Translate models should be able to pivot through English to translate to different languages.

The max data size is configurable in bin/argos-train.

Generally the more data you use the better, the max data setting is to try to exclude very large datasets (normally CCMatrix) which are very large and possibly lower quality.

jefs42 · May 1, 2022, 5:13pm

Thanks. I have been creating the en_da and en_no models as well.

What I was checking on was, is my argostranslate now using both the Europarl and the CCAligned models for Danish or just the last model installed? Would adding a MediaWiki model add to those or replace the previous?

If it increases the data used, I may train more to increase accuracy. If it only uses one, then I guess I’ve completed my Danish and Norwegian and can start using them on my client’s sites page content.

It got me wondering because I noticed in current data-index.json there are multiple datasets for en_es and wondered why if it would only use one of the models…

argosopentech · May 1, 2022, 10:08pm

When you train a .argosmodel package Argos Train will attempt to use all of the data for that language pair. If you install multiple packages for the same language pair only one will be used and it won’t combine their data.

jefs42 · May 1, 2022, 10:35pm

Then why multiple argosdata for en_es (and other combos)?

Oh… am I doing this backwards (maybe sideways)?

Instead of training a europarl-en_da by itself, and then a ccaligned-en_da by itself, I should put both in the argos-train JSON file and the training will use both of them to create one argosmodel?

argosopentech · May 1, 2022, 11:00pm

Correct, you should put links to all of the data you want to use for a language pair in data-index.json and it will be used to train one model.

jefs42 · May 10, 2022, 1:46pm

Do I need a “bigger” server? I increased the MAX_DATA_SIZE so it is processing my larger CCMatrix data file also, but now the training just keeps being Killed, but I can’t find in logs or df or top anything specific as to why

(env) argosopentech@b8f995ca3e6e:~/argos-train$ argos-train
From code (ISO 639): en
To code (ISO 639): da
From name: English
To name: Danish
Version: 1.3
ccmatrix-en_da
paracrawl-en_da
ccaligned-en_da
europarl-en_da
wikimatrix-en_da
Read data from file
Killed
(env) argosopentech@b8f995ca3e6e:~/argos-train$