New argos model en_eu

urtzai · September 20, 2024, 9:43am

Hi,

I just trained a new argosmodel for Basque following the this training tutorial and using all Opus project en > eu datasets (12M sentences approx.) available.

I know that @argosopentech is also interested and working on it, but I wanted to have the first approach. I also tested the model in some of the large wikipedia texts and the result is quite decent. So the goal is to have this first approach available in LibreTranslate.

If this model is accepted, I will train argostranslate eu > en . Next step will be to translate LibreTranslate itself.

Here is the model: https://file.io/zbq8KBpihZlg

argosopentech · September 20, 2024, 10:43am

Awesome work! I’ll take a look at the model.

I’ve also been working on Basque and finished training a en->eu model last night. I’ll compare our two models but I’m guessing they’re pretty similar.

I uploaded Basque data to Argos Train (Add Basque data · argosopentech/argos-train@f1d0436 · GitHub) (including Wiktionary data) so it should be pretty easy to experiment with training Basque models going forward.

argosopentech · September 20, 2024, 10:44am

If you’re a Basque speaker then translating the LibreTranslate interface using Weblate would be very helpful.

urtzai · September 20, 2024, 11:29am

Yep, I’am and that’s what I want to do. Maybe I can use these models to auto translate LibreTranslate and check the performance.

And how do you compare two models objectively (because I suppose that you don’t know Basque)…

argosopentech · September 20, 2024, 11:44am

I translate English → Basque and then back translate Basque → English using Google Translate. Then I compare the back translated text to the original text.

Here are some examples of text translated with my model:

Basque is the only surviving descendant of the Paleo-European languages, predating the arrival of speakers of the Indo-European languages that dominate the European continent today. Basque is spoken by the Basques and other residents of the Basque Country, a region that straddles the westernmost Pyrenees in adjacent parts of northern Spain and southwestern France. Basque is classified as a language isolate. The Basques are indigenous to and primarily inhabit the Basque Country.[7] The Basque language is spoken by 806,000 Basques in all territories. Of these, 93.7% (756,000) are in the Spanish area of the Basque Country and the remaining 6.3% (50,000) are in the French portion.[1]

Euskara da Paleo-europar hizkuntzen ondorengo bakarra, gaur egun Europako kontinentea menderatzen duten hizkuntza indoeuroparren hiztunak iritsi aurretik. Euskaldunek eta Euskal Herriko beste biztanle batzuek euskaraz hitz egiten dute, mendebaldeko Pirinioak zeharkatzen dituen eskualde bat, Espainia iparraldeko eta Frantziako hego-mendebaldeko eskualdeetan. Euskara hizkuntza isolatu gisa sailkatzen da. Euskaldunak indigen eta batez ere Euskal Herrian bizi dira. [7] Euskara 806.000 euskaldunek hitz egiten dute lurralde guztietan. Horietatik %93,7 (%756.000) Euskal AEko espainiar eremuan daude, eta gainerako %6,3 (50.000) Frantziako aldean. [1]

Basque millers traditionally employed a separate number system of unknown origin. In this system the symbols are arranged either along a vertical line or horizontally. On the vertical line the single digits and fractions are usually off to one side, usually at the top. When used horizontally, the smallest units are usually on the right and the largest on the left. As with the Basque system of counting in general, it is vigesimal (base 20). Although it is in theory capable of indicating numbers above 100, most recorded examples do not go above 100. Fractions are relatively common, especially 1⁄2.

Euskal errotariek jatorri ezezaguneko zenbaki-sistema bereizi bat erabiltzen zuten. Sistema honetan ikurrak lerro bertikalean edo horizontalki antolatzen dira. Lerro bertikalean digitu eta frakzio bakarrak alde batera joaten dira normalean, normalean goialdean. Horizontalki erabiltzen denean, unitaterik txikienak eskuinean eta handienak ezkerrean izaten dira. Euskal sistema orokorrean bezala, bigesimala da (20 oinarria). Nahiz eta teorian 100etik gorako zenbakiak adieraz ditzakeen, adibide erregistratu gehienak ez dira 100etik gora. Frakzioak nahiko arruntak dira, batez ere 1⁄2.

Source

urtzai · September 20, 2024, 1:37pm

This looks pretty decent (more than a decent, actually) to me

NicoLe · September 20, 2024, 3:45pm

To have a better idea, you can try translating things that are out of the training data, for instance, news, press articles, or websites that are not translated.

12M sentences is good: if translating other sources than Wikipedia disappoints, there is a margin for refining the data.

1/Click the “eye” icons on opus, it will display a sample, and quite often you may deduce from the sample whether the corpus is meaningful for training and what data you should filter within it to improve the model.

2/ The Locomotive project has a whole filter kit to prepare datasets, I have developed a few myself for use with large-resource language, and I have more in the qualification pipeline (multi-step filters that check each sentence pair’s accuracy and also detect and filter other languages). They do not function with every language, so they may not be useful for euskara.

3/ Then you may want to gain performance with a more advanced model architecture, a data scientist working on the project and I have made some research on the topic, and we came with some candidates, two excellent ones that require tweaking ctranslate2’s code a little, and a couple others that do not (at least one, i did not try every combination after finding the superiority of the first two).

ArtanisTheOne · September 20, 2024, 6:23pm

I suggest using the Flores200 benchmark, dataset from facebook that has ~1000 translated sentences (1000 dev, 1000 devtest) in a bunch of languages. All the sentences in each language are the same source sentence. It has Euskera.

urtzai · September 20, 2024, 9:10pm

So, this are my results:

Euskara da Paleoeuropar hizkuntzen ondorengo bakarra, gaur egun Europako kontinentea menderatzen duten hizkuntza indoeuroparren hiztunen etorrera iragartzen duena. Euskal Herriak eta Euskal Herriko beste biztanle batzuek euskaraz hitz egiten dute, Espainia iparraldeko eta Frantziako hego-mendebaldeko Pirinioak zeharkatzen dituen eskualdea. Euskara hizkuntza isolatu gisa sailkatzen da. Euskaldunak indigenak dira eta batez ere Euskal Herrian bizi dira. [7] Euskara 806.000 euskaldunek hitz egiten dute lurralde guztietan. Horietatik, %93,7 (756.000) Euskal AEko espainiar eremuan daude, eta gainerako %6,3 (50.000) frantsesean daude. [1]

And:

Euskal errotariek jatorri ezezaguneko sistema bereizi bat erabili ohi zuten. Sistema honetan ikurrak lerro bertikalean edo horizontalki antolatzen dira. Lerro bertikalean digitu eta zatiki bakarrak alde batera joaten dira normalean, goian. Horizontalki erabiltzean, unitaterik txikienak eskuinean eta ezkerrean egoten dira. Euskal sistema orokorrean bezala, bigesimala da (20 oinarria). Nahiz eta teoriaren arabera 100etik gorako zenbakiak adieraz ditzakeen, adibide erregistratu gehienak ez dira 100etik gorakoak. Frakzioak nahiko arruntak dira, batez ere 1⁄2.

There is no much difference but I would clearly say your model is slightly better

urtzai · September 20, 2024, 10:26pm

I’ve also translated LibreTranslate to Basque. Here is the pull request: Basque localization finished and reviewed on Weblate by urtzai · Pull Request #678 · LibreTranslate/LibreTranslate · GitHub

And here the translation project: LibreTranslate/App — Basque @ Hosted Weblate

urtzai · September 24, 2024, 9:57am

Hi @argosopentech! I can wait to have the first en → eu / eu → en translation models in LibreTranslate

Do you need help to release them? How can I help you?

argosopentech · September 24, 2024, 8:09pm

I’m training the eu → en model now. Hopefully it’ll be released within a week or so.

argosopentech · October 2, 2024, 12:40pm

The Basque model is live now!

I also added a model for Galician.

urtzai · October 3, 2024, 10:19am

WOW!! Awesome!! I think the only thing left is the Basque language option in LibreTranslate and the translation file update to include the name of it.

I can create a pull request of the PO file but not sure how to add in the menu.