According to the test results, the final models showed the following result:
EN_KAB
LT1.0 - 9.4 BLEU (0,6606 COMET-22)
NLLB-200 3.3B - 6.9 BLEU
opus2m-2020-08-01 - 2.50 BLEU
KAB_EN
LT1.0 - 19.9 BLEU (0,6289 COMET-22)
NLLB-200 1.3B distilled - 22.8 BLEU
tatoeba-lowest/opus-2020-06-15 - 2.3 BLEU
The improvements ranged from 0.9 to 1.8 BLEU.
Unfortunately, this is the best result that has been obtained so far.
I think one of the effective options could be the collection of correct translations by the community through translation editing/feedback form in the LibreOffice interface.
I also post the entire collected dataset:
kab.all.raw.zip - text only in Kabyle, divided into sentences (needs filtering). Can be used for back translation.
“en-kab.zip\ber” - pure en-ber dataset from various sources, mostly Tatoeba
“en-kab.zip\translit” - en-kab dataset, texts in Kabyle written in English Latin alphabet
“en-kab.zip\original” - pure en-kab dataset from different sources
“en-kab.zip\dic” - en-kab dictionary
“en-kab.zip\bt1” - back translation from kab to en using NLLB
I cleaned up the Kabyle language using this wonderful repository:
Model files and dataset: