Thanks, I think I’m almost there. I have to steal my HTPC back from the kids
Got pretty far, then error about CUDA (and a number of errors after that, but one at a time)
My findings so far regarding agros-train-init:
it has to be debian based OS, or at least any that use apt package manager
it assumes python virtualenv is installed. This must vary by distribution, I had to manually install it
torch seems to be flaky, even on just LibreTranslate install. I manually installed latest - 1.11 or so, init script unistalled it and installed 1.9xxx but argos-train complained about torch version. It was fine after I manually re-installed it back to 1.11
Does it matter what to/from Opus I use? Can I just switch source and target?
Since I’m getting close on this da->en, I figured I’d prepare a no->en as well. There’s a CCMatrix of en->no. I assumed I can still use that but switch which file is source and which is target?
The scripts use apt-get to install dependencies so you will need to either use a Debian based distro or install them manually.
The CUDA uninstall and reinstall is trying to fix this issue, I think this is specific to vast.ai so if you’re using your own GPU just try to match the CUDA version of used by torch to what you have installed.
I’ve generally followed the convention of English as the source language for data packages and then Argos Train will automatically swap source and target if you’re training in the other direction. This convention isn’t necessary though.
It sounds like you’re making good progress, let me know if you have more questions. Also once you’ve finished with data packages please make a pull request to Argos Train so we can add links to the packages you’ve mirrored and created.
I think they are good, but since I haven’t yet successfully completed a training they’re technically still work-in-progress
Are you saying, taking the da-en one for example, I can just change what I say when I run argos-train?
So in this case da is the source file and en is the target file. But if I run argos-train and say source is en and target is da it will automatically switch which file it uses and I can train both da->en and en->da models using the same argosdata file?
For my own needs, but also to help out, I want to translate Danish to English. (I saw a few other requests for Danish in that one thread on github). I’d do a Danish to English as well afterward anyway, but if it’s that easy to do both from same data…
So in this case da is the source file and en is the target file. But if I run argos-train and say source is en and target is da it will automatically switch which file it uses and I can train both da->en and en->da models using the same argosdata file?
I have a translate-da_en-1_0.argosmodel (and en_da-1_0 training now). How can I add this to my local libretranslate installation?
I found and did this:
#!/usr/bin/python3
from argostranslate import package, translate
package.install_from_path('translate-da_en-1_0.argosmodel')
installed_languages = translate.get_installed_languages()
for i in range(0,len(installed_languages)):
print(installed_languages[i])
And I do see Danish in the output. I restarted my libretranslate server.
print_r( $translator->Languages() )
however, does not show a [da] => Danish entry.
I figured I should try it out before submitting for inclusion
You can also host your own version of the package index with your package and connect to it from LibreTranslate with the environment variable ARGOS_PACKAGE_INDEX=https://yourindex.com/index.json.
Oh, I ran my python code as root thinking it would install the models “system-wide”. I had to run it as the libretranslate user, now shows Danish and Norwegian options on the webpage.
Do models stack when installed?
The Danish translation tests with just the Europarl model were a bit funky. But Norwegian translation tests with CCMatrix data seem pretty decent.
I was going to do CCMatrix for Danish then, but argos-train said the data was too large. Can I increase that limit (I didn’t see anything in config.yml I could think of…).
So I did a CCAligned version instead, installed it, and the translations on the web page seem a lot better. Is it now using both the Europarl and the CCAligned data. If I train other da_en data sources will that then add more info for translations?
The model packages are installed to $HOME/.local so they are by user.
All your installed Argos Translate models should be able to pivot through English to translate to different languages.
The max data size is configurable in bin/argos-train.
Generally the more data you use the better, the max data setting is to try to exclude very large datasets (normally CCMatrix) which are very large and possibly lower quality.
Thanks. I have been creating the en_da and en_no models as well.
What I was checking on was, is my argostranslate now using both the Europarl and the CCAligned models for Danish or just the last model installed? Would adding a MediaWiki model add to those or replace the previous?
If it increases the data used, I may train more to increase accuracy. If it only uses one, then I guess I’ve completed my Danish and Norwegian and can start using them on my client’s sites page content.
It got me wondering because I noticed in current data-index.json there are multiple datasets for en_es and wondered why if it would only use one of the models…
When you train a .argosmodel package Argos Train will attempt to use all of the data for that language pair. If you install multiple packages for the same language pair only one will be used and it won’t combine their data.
Then why multiple argosdata for en_es (and other combos)?
Oh… am I doing this backwards (maybe sideways)?
Instead of training a europarl-en_da by itself, and then a ccaligned-en_da by itself, I should put both in the argos-train JSON file and the training will use both of them to create one argosmodel?
Do I need a “bigger” server? I increased the MAX_DATA_SIZE so it is processing my larger CCMatrix data file also, but now the training just keeps being Killed, but I can’t find in logs or df or top anything specific as to why
(env) argosopentech@b8f995ca3e6e:~/argos-train$ argos-train
From code (ISO 639): en
To code (ISO 639): da
From name: English
To name: Danish
Version: 1.3
ccmatrix-en_da
paracrawl-en_da
ccaligned-en_da
europarl-en_da
wikimatrix-en_da
Read data from file
Killed
(env) argosopentech@b8f995ca3e6e:~/argos-train$
“Killed” normally means you ran out of RAM, you can try adding swap space.
The max data size config is to prevent this problem since the data is loaded into memory during training. I’ve had good results just excluding the largest datasets since they’re likely lower quality and cause problems.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 67: invalid continuation byte
Is there a problem in the data from OPUS directly? Or a download from OPUS, adjust, make zip, change name, upload to place A, training then downloads to place B… and the data is somehow getting slightly mangled in between those various transfers?
baby character encoding? Not familiar with that term
Got this with a few data sources from OPUS for Romanian. Want to say CCAligned and OpenSubtitles. But also with CCMatrix for Norwegian.
Not sure who’s saying it - argos-train, sentencepiece, cTranslate2, etc…
And I know download a zip could get slightly mangled but still extractable (I do usually do test zip). Then extract, change some things, re-zip, upload. Then train downloads it, extracts…
I assume it’s GIGO of some sort. Just don’t know if it starts with the OPUS data itself, or any of the steps between download OPUS and running train. Going to try redoing Romanian from scratch, maybe just do one OPUS source data train, then add a second if that works, then add a third, etc.