Training an Icelandic model

Icelandic appears to be pretty popular on the tracker, with about 21 combined votes.

There’s a pretty big Icelandic<->English parallel corpus that I found maintained by the Icelandic language institute: parice.arnastofnun.is

1 Like

I’d love to add Icelandic support. I’ve thought about training a model for Icelandic before but haven’t gotten around to it. I visited Iceland last year.

Thanks for the link this site looks really cool. There isn’t too much overlap with the English-Icelandic data on Opus and I think there’s enough data to train a good model. Plus it includes translations of the Sagas which is neat to include in the training data

Actually, this corpus is ready for use in opus : https://opus.nlpl.eu/results/is&en/corpus-result-table
You can train the best model using the following sources:
CCMatrix or NLLB,
MultiHPLT, (langues are reversed in HPLT, use fastext -see how below- to filter the English inserts in Icelandic sentences)
Parice,
XLEnt (this one as a dictionary, do not filter any content),
MaCoCu,
TildeMODEL (you might want to use fasttext in this one, depends on the language pair),
ELRC-5067-SciPar (if you want scientific terminology)
WikiMatrix (use fasttext)
ELRC-www.norden.org and all ELRC corpora that feature official Icelandic sources, starting with
ELRC-4324-Government_Offices_I
ELRC-4295-malfong.is
ELRC-4334-Rkiskaup_2020
ELRC-4338-University_Iceland, and so on.

For the fasttext filtering, I have pulled a PR in Locomotive, you can adapt the script or take the PR wholesale, it has other useful features.

I uploaded the Opus Icelandic data to data.argosopentech.com. I’ll try to do the parice.arnastofnun.is data too when I get a chance. This lets you train a model automatically with Argos Train.