Adding a non supported language

Thanks, I think I’m almost there. I have to steal my HTPC back from the kids :slight_smile:

Got pretty far, then error about CUDA (and a number of errors after that, but one at a time)

My findings so far regarding agros-train-init:

  • it has to be debian based OS, or at least any that use apt package manager
  • it assumes python virtualenv is installed. This must vary by distribution, I had to manually install it
  • torch seems to be flaky, even on just LibreTranslate install. I manually installed latest - 1.11 or so, init script unistalled it and installed 1.9xxx but argos-train complained about torch version. It was fine after I manually re-installed it back to 1.11

Does it matter what to/from Opus I use? Can I just switch source and target?

Since I’m getting close on this da->en, I figured I’d prepare a no->en as well. There’s a CCMatrix of en->no. I assumed I can still use that but switch which file is source and which is target?

.

1 Like
  • The scripts use apt-get to install dependencies so you will need to either use a Debian based distro or install them manually.
  • The CUDA uninstall and reinstall is trying to fix this issue, I think this is specific to vast.ai so if you’re using your own GPU just try to match the CUDA version of used by torch to what you have installed.
  • I’ve generally followed the convention of English as the source language for data packages and then Argos Train will automatically swap source and target if you’re training in the other direction. This convention isn’t necessary though.

It sounds like you’re making good progress, let me know if you have more questions. Also once you’ve finished with data packages please make a pull request to Argos Train so we can add links to the packages you’ve mirrored and created.

Will do.
I have them up there:
https://libretranslate.fortytwo-it.com/argosdata/data-europarl-da_en.argosdata
https://libretranslate.fortytwo-it.com/argosdata/data-ccmatrix-no_en.argosdata

I think they are good, but since I haven’t yet successfully completed a training they’re technically still work-in-progress :slight_smile:

Are you saying, taking the da-en one for example, I can just change what I say when I run argos-train?

So in this case da is the source file and en is the target file. But if I run argos-train and say source is en and target is da it will automatically switch which file it uses and I can train both da->en and en->da models using the same argosdata file?

For my own needs, but also to help out, I want to translate Danish to English. (I saw a few other requests for Danish in that one thread on github). I’d do a Danish to English as well afterward anyway, but if it’s that easy to do both from same data…

So in this case da is the source file and en is the target file. But if I run argos-train and say source is en and target is da it will automatically switch which file it uses and I can train both da->en and en->da models using the same argosdata file?

Correct, here is the code doing that.

I have a translate-da_en-1_0.argosmodel (and en_da-1_0 training now). How can I add this to my local libretranslate installation?

I found and did this:

#!/usr/bin/python3

from argostranslate import package, translate
package.install_from_path('translate-da_en-1_0.argosmodel')
installed_languages = translate.get_installed_languages()

for i in range(0,len(installed_languages)):
        print(installed_languages[i])

And I do see Danish in the output. I restarted my libretranslate server.

print_r( $translator->Languages() )

however, does not show a [da] => Danish entry.
I figured I should try it out before submitting for inclusion

1 Like

Like you said you can install packages manually with:

import argostranslate.package

argostranslate.package.install_from_path('translate-da_en-1_0.argosmodel')

You should then be able to see the package installed with:

$ argospm list
translate-da_en

I think restarting LibreTranslate should then make it available, this is how we get installed languages in LibreTranslate.

You can also host your own version of the package index with your package and connect to it from LibreTranslate with the environment variable ARGOS_PACKAGE_INDEX=https://yourindex.com/index.json.

Oh, I ran my python code as root thinking it would install the models “system-wide”. I had to run it as the libretranslate user, now shows Danish and Norwegian options on the webpage.

Do models stack when installed?

The Danish translation tests with just the Europarl model were a bit funky. But Norwegian translation tests with CCMatrix data seem pretty decent.

I was going to do CCMatrix for Danish then, but argos-train said the data was too large. Can I increase that limit (I didn’t see anything in config.yml I could think of…).

So I did a CCAligned version instead, installed it, and the translations on the web page seem a lot better. Is it now using both the Europarl and the CCAligned data. If I train other da_en data sources will that then add more info for translations?

1 Like

The model packages are installed to $HOME/.local so they are by user.

All your installed Argos Translate models should be able to pivot through English to translate to different languages.

The max data size is configurable in bin/argos-train.

Generally the more data you use the better, the max data setting is to try to exclude very large datasets (normally CCMatrix) which are very large and possibly lower quality.

Thanks. I have been creating the en_da and en_no models as well.

What I was checking on was, is my argostranslate now using both the Europarl and the CCAligned models for Danish or just the last model installed? Would adding a MediaWiki model add to those or replace the previous?

If it increases the data used, I may train more to increase accuracy. If it only uses one, then I guess I’ve completed my Danish and Norwegian and can start using them on my client’s sites page content.

It got me wondering because I noticed in current data-index.json there are multiple datasets for en_es and wondered why if it would only use one of the models…

1 Like

When you train a .argosmodel package Argos Train will attempt to use all of the data for that language pair. If you install multiple packages for the same language pair only one will be used and it won’t combine their data.

Then why multiple argosdata for en_es (and other combos)?

Oh… am I doing this backwards (maybe sideways)?

Instead of training a europarl-en_da by itself, and then a ccaligned-en_da by itself, I should put both in the argos-train JSON file and the training will use both of them to create one argosmodel?

1 Like

Correct, you should put links to all of the data you want to use for a language pair in data-index.json and it will be used to train one model.

1 Like

Do I need a “bigger” server? I increased the MAX_DATA_SIZE so it is processing my larger CCMatrix data file also, but now the training just keeps being Killed, but I can’t find in logs or df or top anything specific as to why

(env) argosopentech@b8f995ca3e6e:~/argos-train$ argos-train
From code (ISO 639): en
To code (ISO 639): da
From name: English
To name: Danish
Version: 1.3
ccmatrix-en_da
paracrawl-en_da
ccaligned-en_da
europarl-en_da
wikimatrix-en_da
Read data from file
Killed
(env) argosopentech@b8f995ca3e6e:~/argos-train$
1 Like

“Killed” normally means you ran out of RAM, you can try adding swap space.

The max data size config is to prevent this problem since the data is loaded into memory during training. I’ve had good results just excluding the largest datasets since they’re likely lower quality and cause problems.

I did some experiments with multiple servers to use CCMatrix and other large datasets preprocessed but ended up with worse results. Sometimes the max data size can exclude OpenSubtitles though which is a very high quality dataset in my experience.

1 Like

I get this sometimes:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 67: invalid continuation byte

Is there a problem in the data from OPUS directly? Or a download from OPUS, adjust, make zip, change name, upload to place A, training then downloads to place B… and the data is somehow getting slightly mangled in between those various transfers?

1 Like

I haven’t seen that before, my best guess is a bad character encoding somewhere in the data.

baby character encoding? Not familiar with that term :slight_smile:

Got this with a few data sources from OPUS for Romanian. Want to say CCAligned and OpenSubtitles. But also with CCMatrix for Norwegian.

Not sure who’s saying it - argos-train, sentencepiece, cTranslate2, etc…

And I know download a zip could get slightly mangled but still extractable (I do usually do test zip). Then extract, change some things, re-zip, upload. Then train downloads it, extracts…

I assume it’s GIGO of some sort. Just don’t know if it starts with the OPUS data itself, or any of the steps between download OPUS and running train. Going to try redoing Romanian from scratch, maybe just do one OPUS source data train, then add a second if that works, then add a third, etc.

1 Like

Woops, *bad character encoding.

I would like to knit a model pair of en-ro and ro-en, but since vast.ai is not free, I wanted to ask how many hours you estimate it can take.

It takes around 8 hours to train a model on a RTX 3090.

1 Like