Help wanted : Kabyle language model for Argos Translate

butterflyoffire · January 11, 2024, 3:59pm

Hello everyone,

I tried to generate a Kabyle language model for Argos Translate and LibreTranslate using Locomotive and opus://Tatoeba but I failed.

Kabyle language is well ressourced on Tatoeba and our community is contributing over there.

Link : Number of sentences per language - Tatoeba

For now, there is no quiet good service on the Web offering a usable translation. I tried NLLB-200 for kabyle language but it is far from generating usable translation from English or French to Kabyle, such as MinT implemented by Wikipedia.

Someone made a website to offer kabyle translation but the licence is not clear even if the data and corpus came from Tatoeba.

Link : Tasuqilt

Translations from that website are okay and usable. This is why I’m looking for the help of our LibreTranslate community to generate even an beta model for kabyle language to test it.

Tomorrow, January 12th, we berber people across northen Africa are celebrating our new year called : Yennayer. So happy new year everyone and best wishes.

Have a nice day.

pierotofy · January 11, 2024, 7:32pm

That’s a great effort! My guess would be that the Tatoeba dataset doesn’t have sufficient sentences to train a good model.

lynxpda · January 11, 2024, 8:01pm

There really isn’t a lot of data. If you could provide links to a resource in the Kabyle language, then could try using a method for iterative reverse translation to create a high-quality synthetic corpus.

lynxpda · January 12, 2024, 9:29am

So I did a little research, maybe the following information might be useful:
OPUS has the following EN-KAB text corpora:

However, you can download more recent data from the Tatoeba website, I ended up with about 85k proposals, here en-kab.txt.tatoeba.zip you can download the already processed archive.

I also previously opened and looked through the NLLB v1 text corpus - it is large, but requires a serious approach to filtering, at least most of the messages are in English, and not in the desired Kabyle.

Also on the OPUS website there are monolingual corpora of text in Kabyle, about 1M pairs of sentences (for example here and here), which can already be used for reverse translation.

To summarize, the general plan could be like this:

Qualitatively check and filter the NLLB v1 text corpus (you need a native speaker’s opinion on the most common errors in the corpus)
Pre-train the model on a large filtered NLLB v1 corpus
Carry out additional fine-tuning on additionally added high-quality translations like Tatoeba (give them more weight, for NLLB v1 reduce the weight)
Optional:
Using the obtained models, create synthetic reverse translation corpora from monolingual data and train new models using the extended text corpus.

The amount of work is quite large to obtain high-quality models.

butterflyoffire · January 18, 2024, 1:16pm

Hi @lynxpda

Thank you very much ! I’m a native speaker I’ll try to follow your advise.
See ya soon.

lynxpda · January 18, 2024, 2:14pm

Great! If you can evaluate the quality of the NLLB v1 data and provide your summary and wait until the end of February, then perhaps I can try to train the model. It would be interesting to try to train a model in a language with low resources.

lynxpda · February 12, 2024, 12:13pm

An alpha version of the translation into the Kabyle language is currently in progress.

@butterflyoffire I ask you to evaluate how bad the translation is (this is the first integration) and what are the typical errors. If acceptable, there will be several more iterations of improvement.

English sentences:

Hey, what's up? Long time no see!
I'm so tired, I could sleep for a week.
Can you pass me the remote? I want to change the channel.
Did you hear about the party tonight? It's gonna be epic!
I can't believe she actually said that. Talk about awkward!
Let's grab a bite to eat. I'm starving!
Are you free this weekend? We should hang out.
I'm sorry, I didn't catch what you said. Can you repeat that?
I'm running late, can you give me a ride?
I'm not feeling well, I think I caught a cold.
I'm so excited for the concert tomorrow. It's gonna be amazing!
Can you believe how fast time flies? It feels like yesterday was New Year's.
I can't decide what to wear. Can you help me pick an outfit?
Do you want to go for a walk? The weather is so nice today.
I'm really craving some ice cream. Let's go get some.

Sentences in Kabyle:

Ihi d acu i d-yettbanen? cciṭan mačči d ayen ara teẓreḍ.
Ɛyiɣ mliḥ, zemreɣ ad gneɣ ssmana.
Tzemreḍ ad iyi-d-tesɛeddiḍ lebεid? Bɣiɣ ad beddleɣ tikli
Tesliḍ i tmeɣra tameddit-a? Ad tili d tigejdit!
Ur umineɣ ara d akken tenna-d tidet. Mmeslay ɣef iɣeblan!
As-d ad nečč seksu. Mmuteɣ deg laẓ.
Ad testufuḍ tagara n ddurt-a? Ilaq ad nteffeɣ.
Suref-iyi, ur d-ṭṭifeɣ ara deg wayen i d-tennam. Tzemreḍ ad s-tɛiwdeḍ?
Lemmer ad ternuḍ ad iyi tɛiwneḍ ?
Ur ufiɣ ara iman-iw, ṭṭfeɣ-d asemmiḍ.
Lliɣ yernu mazal bɣiɣ ad iliɣ azekka nni. Aya ad yili yessewham!
Tzemreḍ ad tamneḍ s tɣawla deg yizan? Ilindi amzun d iḍelli !
Ur zmireɣ ara ad tt-fruɣ d acu ara lseɣ. Tzemreḍ ad iyi-tɛawneḍ ad d-tawiḍ llebsa ifazen?
Tebɣam ad teddum ad teddum? Yelha mliḥ lḥal ass-a.
Aql-i ččiɣ cwiṭ n uyefki. Ad nruḥ ad d-nawi kra.

lynxpda · February 12, 2024, 12:20pm

I also encountered several difficulties while learning:

1. Model en_kab:

Error when starting training:

lynx@lynx-B650M-PG-Riptide:/mnt/DeepLearning/Locomotive$ ./start.sh
Training English --> Kabyle (1.0)
Sources: 5
 - file://dataset/en-kab/nllb (hash:e74992b | weight:59)
 - file://dataset/en-kab/tatoeba (hash:72e5988 | weight:5)
 - file://dataset/en-kab/dic (hash:52fccc0 | weight:1)
 - file://dataset/en-kab/bible (hash:afc1cb2 | weight:1)
 - file://dataset/en-kab/bt (hash:bb5f4f3 | weight:3)
Traceback (most recent call last):
  File "/mnt/DeepLearning/Locomotive/train.py", line 202, in <module>
    extract_flores_val(config['from']['code'], config['to']['code'], run_dir, da
taset="devtest")
  File "/mnt/DeepLearning/Locomotive/data.py", line 190, in extract_flores_val
    tgt_val = get_flores(tgt_code, dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/data.py", line 180, in get_flores
    source = os.path.join(flores_dataset, nllb_langs[lang_code] + f".{dataset}")
                                          ~~~~~~~~~~^^^^^^^^^^^
KeyError: 'kab'
Done!

Here I simply manually added the language to the data.py list

2 .Model kab_en:

There is no trained stanza model for Kabyle.

3. Low quality of training data (possibly also testing data) makes it difficult to control the learning process.

In principle, all of these problems can be solved.

argosopentech · February 12, 2024, 3:29pm

The easiest fix would be to use a Stanza model for another similar language. Since Kabyle is written using the Latin alphabet [1] other models for languages using the same character set, like English or Turkish, might work. The Stanza model only needs to recognize the sentence boundaries not translate so as long as the basic structure of sentences look similar between languages it will probably work.

I’ve been exploring switching from Stanza to Spacy but it looks like Spacy dosn’t support Kabyle either (Spacy supports these languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Multi-language, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Ukrainian).

I’ve also done some experiments on using neural networks that are trained with OpenNMT-py and run on CTranslate2 to do sentence boundary detection in Argos Translate 2 Beta. The benefit of this approach is that it lets you use one software stack for both splitting sentences and translating. However, as you can see in my experimental results this approach performs much worse than libraries that are designed specifically to do this type of text processing like Stanza and Spacy.

In the Argos Translate 2 Beta code I have better configuration options so that you can deactivate sentence boundary detection in Argos Translate. This would let you translate short strings of text without needing to do any sentence splitting.

lynxpda · February 12, 2024, 3:54pm

Thanks for the valuable advice!
Yes, I also considered two options: use the stanza model for a language written in Latin (English most likely, since there is no stanza option for any language from this family of languages), or train the stanza model (if the first option fails to be used or it will not work correctly).

In general, when it comes to teaching languages with low resources, this is a difficult and uncharted territory, unfortunately.

If something works out with this model, as an improvement option I was considering teaching a multilingual model for the whole family to benefit from the transfer of learning, but I’m not yet sure how best to implement this using Loklmotive.

mehdi · February 15, 2024, 1:15am

Hi lynxpda, maybe this help
450k mixed kabyle sentences translated into english

butterflyoffire · February 15, 2024, 2:29am

Hello,

For your information, Muḥend Belkacem has made a repo to gather some clean and uncleaned corpus on Github. He started some work, years ago.

May be this can help. I contacted him by mail to join this topic and waiting for his answer as he is expert and leading different localization initiatives in FLOSS, starting with Mozilla.

Please check his repos :

1 Corpus GitHub - MohammedBelkacem/corpus-kab: Tuddar, ismawen d imeḍqan

2 Tools GitHub - MohammedBelkacem/KabyleNLP: Natural language processing for the kabyle language

In the repos you can find not only tools for corpus cleaning but different monolingual corpus quality.

Hope he will join the community

Regards,

lynxpda · February 15, 2024, 8:57am

Thanks, nice case! But at first glance, it was not the Kabyle alphabet that was used here, but a transliteration into English.
We can try to reverse transliterate into Kabyle or use kab_en for learning by mixing corpuses.

lynxpda · February 15, 2024, 9:00am

Of course, any increase in the size of the corpus will be useful, thank you!
I’ll try to process it and add it to the corpus for back translation.

lynxpda · February 21, 2024, 3:09pm

If you’re interested, I’m posting early alpha versions of the models for testing:

translate-kab_en-1_0.argosmodel

translate-en_kab-1_0.argosmodel

So far the results are as follows:

EN_KAB
8.5 BLEU
(NLLB-200 3.3B - 6.9 BLEU,
opus2m-2020-08-01 - 2,50 BLEU)

KAB_EN
18.1 BLEU
(NLLB-200 1.3B distilled - 22.8 BLEU
tatoeba-lowest/opus-2020-06-15 - 2,3 BLEU)

I think in 2 interactions of back translation I will be able to post the final version (around March 04).

P.S. autodetection of the language does not work, the language must be set manually.

butterflyoffire · February 21, 2024, 3:46pm

Hello @lynxpda

Thank you very much, we really appreciate your work!
I’ll test the model tonight and I’ll ping our community for more comments

Thank you very much, once again.
Friendly,

lynxpda · February 21, 2024, 4:07pm

Great, I’ll look forward to the results.
Don’t expect too much from these models, I still hope to improve them, but I’m afraid the improvements won’t be revolutionary, about +2-5BLEU would be a good result.
Unfortunately, there is a lack of diversity of data across domains like science, economics, etc. but I’m sure over time there will be many more of them!
And one more thing… the KAB_EN model, as a side effect, turned out to be multilingual, it should understand transliteration (thanks for the @mehdi data set) and Berber.
Language selection in nllb-200 style, but need to check.

here is an example configuration file:

{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Kabyle",
        "code": "kab"
    },
    "version": "1.0",
    "sources": [
       {"source": "file://dataset/en-kab/original", "weight": 70},
       {"source": "file://dataset/en-kab/dic", "weight": 15},
       {"source": "file://dataset/en-kab/bt1", "weight": 350, "src_prefix": "<BT>"},
       {"source": "file://dataset/en-kab/bt2", "weight": 350, "src_prefix": "<BT>"},
       {"source": "file://dataset/en-kab/ber", "weight": 50, "src_prefix": "ber_Latn", "tgt_prefix": "eng_Latn"},
       {"source": "file://dataset/en-kab/translit", "weight": 100, "src_prefix": "<TL>"}
    ],
    "batch_size": 2000,
    "accum_count": 27,
    "warmup_steps": 8000,
    "train_steps": 17000,
    "learning_rate": 1,
    "vocab_size": 32000,
    "world_size": 2,
    "gpu_ranks": [0,1],
    "avg_checkpoints": 3,
    "src_seq_length": 200,
    "tgt_seq_length": 200,
    "enc_layers": 6,
    "dec_layers": 6,
    "heads": 8,
    "hidden_size": 512,
    "word_vec_size": 512,
    "transformer_ff": 4096,
    "save_checkpoint_steps": 1000,
    "valid_steps": 2500,
    "num_workers": 6,
    "optim": "pagedadamw8bit",
    "valid_batch_size": 64,
    "bucket_size": 32768,
    "early_stopping": 0,
    "dropout": 0.3
}

lynxpda · February 22, 2024, 10:18am

Here is an example of a translation from wikipedia in Kabyle:

Kabyle:

Urigami d awal i d-yekkan seg tutlayt tajapunit. Yesdukel amyag [oru] = Ḍfes akked yisem [kami] = Akaɣeḍ. Zik, deg tmurt n Japun, ttinin-as i kra yellan d aḍfas : [Urikata neɣ Urigata ] = Taɣara n uḍfas. S tmaziɣt, nesmizeɣ awal ama deg tira ama deg ususru. Nerra [u] deg webdil n [o] acku asekkil ‘’o’’ ur yedda ara akked yilugan n tjerrumt n tutlayt-nneɣ mačči am usekkil ‘’P’’ yufa amḍiq-is . Asekkil ‘’o’’ ur d yemmezg ara akked usekkil ‘’u’’, yettḥaraf-it. Rnu i waya s, yessefk fell-aɣ ad ad nxemmem i usuddem n wawalen nniḍen ara neḥwiǧ sya ɣer sdat.
Amedya n wurigami : Afrux s ukaɣeḍ.

Akka ihi i as-d-nessumer isem n tẓuri-a s tmaziɣt : Urigami. D isem asuf ur yesɛa asegget. Ma nsers-it deg waddad amaruz, ad t-naru akka : wurigami. Asmizzeɣ n yisem n tẓuri isashel-d aslugen-is. Seg neslugen isem n tẓuri, yeldi-d ubrid s asuddem n twacult n wawalen i neḥwaǧ, i yessefken ad ilin deg tutlayt-nneɣ.

English:

Urigami is a word derived from the Japanese language. It combines the verb [oru] = fold with the name [kami] = The tail. In the past, in Japan, it was called for all the treatments: [Urikata or Urigata ] = The quality of treatment. In Berber, I speak both in writing and pronunciation. We made [u] in exchange for [o] because the suffix ‘o’ does not go with the grammar rules of our language rather than the suffix ‘P’ found its place . The suffix ‘o’ is not compatible with the suffix ‘u’, it protects it. Additionally, we have to think about setting up other words that we will need in the future.
An example of origami: A bird with paper.

So we suggested the name of this art in Berber: Urigami. It’s a souvenir name that has no capital. If we put it in the standard, we write it like this: origami. The beauty of the name of art facilitates its rule. From the rule of the name of art, the road was opened with the adaptation of the family of words we need, which should be in our language.

If the translation is even slightly correct and I understand correctly, then the written language is in an active stage of its development.

p.s. @pierotofy
It may be useful, I attach a modified file train.py , in which I added the ability to use prefixes for training multilingual models or marking with special tokens back translation, perhaps not very optimally written, but to test the hypothesis worked.
(lines 87-97, 118-128, 207-238, 242, 322-331)

argosopentech · February 22, 2024, 5:46pm

Argos Translate has support for target prefixes (configured in the metadata.json file) too:

github.com

argosopentech/argos-translate/blob/34b5c4bb74086523e2858dd49ac4ff5e1caa3311/argostranslate/translate.py#L461


      
          info("sentences", sentences)
          
          # Tokenization
          tokenized = [pkg.tokenizer.encode(sentence) for sentence in sentences]
          info("tokenized", tokenized)
          
          # Translation
          BATCH_SIZE = 32
          target_prefix = None
          
          if pkg.target_prefix != "":
              target_prefix = [[pkg.target_prefix]] * len(tokenized)
          
          translated_batches = translator.translate_batch(
              tokenized,
              target_prefix=target_prefix,
              replace_unknowns=True,
              max_batch_size=BATCH_SIZE,
              beam_size=max(num_hypotheses, 4),
              num_hypotheses=num_hypotheses,
              length_penalty=0.2,

lynxpda · March 5, 2024, 5:37pm

According to the test results, the final models showed the following result:

EN_KAB
LT1.0 - 9.4 BLEU (0,6606 COMET-22)
NLLB-200 3.3B - 6.9 BLEU
opus2m-2020-08-01 - 2.50 BLEU

KAB_EN
LT1.0 - 19.9 BLEU (0,6289 COMET-22)
NLLB-200 1.3B distilled - 22.8 BLEU
tatoeba-lowest/opus-2020-06-15 - 2.3 BLEU

The improvements ranged from 0.9 to 1.8 BLEU.
Unfortunately, this is the best result that has been obtained so far.
I think one of the effective options could be the collection of correct translations by the community through translation editing/feedback form in the LibreOffice interface.

I also post the entire collected dataset:

kab.all.raw.zip - text only in Kabyle, divided into sentences (needs filtering). Can be used for back translation.
“en-kab.zip\ber” - pure en-ber dataset from various sources, mostly Tatoeba
“en-kab.zip\translit” - en-kab dataset, texts in Kabyle written in English Latin alphabet
“en-kab.zip\original” - pure en-kab dataset from different sources
“en-kab.zip\dic” - en-kab dictionary
“en-kab.zip\bt1” - back translation from kab to en using NLLB

I cleaned up the Kabyle language using this wonderful repository:

Model files and dataset: