I just trained a new argosmodel for Basque following the this training tutorial and using all Opus project en > eu datasets (12M sentences approx.) available.
I know that @argosopentech is also interested and working on it, but I wanted to have the first approach. I also tested the model in some of the large wikipedia texts and the result is quite decent. So the goal is to have this first approach available in LibreTranslate.
If this model is accepted, I will train argostranslate eu > en . Next step will be to translate LibreTranslate itself.
I translate English → Basque and then back translate Basque → English using Google Translate. Then I compare the back translated text to the original text.
Here are some examples of text translated with my model:
Basque is the only surviving descendant of the Paleo-European languages, predating the arrival of speakers of the Indo-European languages that dominate the European continent today. Basque is spoken by the Basques and other residents of the Basque Country, a region that straddles the westernmost Pyrenees in adjacent parts of northern Spain and southwestern France. Basque is classified as a language isolate. The Basques are indigenous to and primarily inhabit the Basque Country.[7] The Basque language is spoken by 806,000 Basques in all territories. Of these, 93.7% (756,000) are in the Spanish area of the Basque Country and the remaining 6.3% (50,000) are in the French portion.[1]
Euskara da Paleo-europar hizkuntzen ondorengo bakarra, gaur egun Europako kontinentea menderatzen duten hizkuntza indoeuroparren hiztunak iritsi aurretik. Euskaldunek eta Euskal Herriko beste biztanle batzuek euskaraz hitz egiten dute, mendebaldeko Pirinioak zeharkatzen dituen eskualde bat, Espainia iparraldeko eta Frantziako hego-mendebaldeko eskualdeetan. Euskara hizkuntza isolatu gisa sailkatzen da. Euskaldunak indigen eta batez ere Euskal Herrian bizi dira. [7] Euskara 806.000 euskaldunek hitz egiten dute lurralde guztietan. Horietatik %93,7 (%756.000) Euskal AEko espainiar eremuan daude, eta gainerako %6,3 (50.000) Frantziako aldean. [1]
Basque millers traditionally employed a separate number system of unknown origin. In this system the symbols are arranged either along a vertical line or horizontally. On the vertical line the single digits and fractions are usually off to one side, usually at the top. When used horizontally, the smallest units are usually on the right and the largest on the left. As with the Basque system of counting in general, it is vigesimal (base 20). Although it is in theory capable of indicating numbers above 100, most recorded examples do not go above 100. Fractions are relatively common, especially 1⁄2.
Euskal errotariek jatorri ezezaguneko zenbaki-sistema bereizi bat erabiltzen zuten. Sistema honetan ikurrak lerro bertikalean edo horizontalki antolatzen dira. Lerro bertikalean digitu eta frakzio bakarrak alde batera joaten dira normalean, normalean goialdean. Horizontalki erabiltzen denean, unitaterik txikienak eskuinean eta handienak ezkerrean izaten dira. Euskal sistema orokorrean bezala, bigesimala da (20 oinarria). Nahiz eta teorian 100etik gorako zenbakiak adieraz ditzakeen, adibide erregistratu gehienak ez dira 100etik gora. Frakzioak nahiko arruntak dira, batez ere 1⁄2.
To have a better idea, you can try translating things that are out of the training data, for instance, news, press articles, or websites that are not translated.
12M sentences is good: if translating other sources than Wikipedia disappoints, there is a margin for refining the data.
1/Click the “eye” icons on opus, it will display a sample, and quite often you may deduce from the sample whether the corpus is meaningful for training and what data you should filter within it to improve the model.
2/ The Locomotive project has a whole filter kit to prepare datasets, I have developed a few myself for use with large-resource language, and I have more in the qualification pipeline (multi-step filters that check each sentence pair’s accuracy and also detect and filter other languages). They do not function with every language, so they may not be useful for euskara.
3/ Then you may want to gain performance with a more advanced model architecture, a data scientist working on the project and I have made some research on the topic, and we came with some candidates, two excellent ones that require tweaking ctranslate2’s code a little, and a couple others that do not (at least one, i did not try every combination after finding the superiority of the first two).
I suggest using the Flores200 benchmark, dataset from facebook that has ~1000 translated sentences (1000 dev, 1000 devtest) in a bunch of languages. All the sentences in each language are the same source sentence. It has Euskera.
Euskara da Paleoeuropar hizkuntzen ondorengo bakarra, gaur egun Europako kontinentea menderatzen duten hizkuntza indoeuroparren hiztunen etorrera iragartzen duena. Euskal Herriak eta Euskal Herriko beste biztanle batzuek euskaraz hitz egiten dute, Espainia iparraldeko eta Frantziako hego-mendebaldeko Pirinioak zeharkatzen dituen eskualdea. Euskara hizkuntza isolatu gisa sailkatzen da. Euskaldunak indigenak dira eta batez ere Euskal Herrian bizi dira. [7] Euskara 806.000 euskaldunek hitz egiten dute lurralde guztietan. Horietatik, %93,7 (756.000) Euskal AEko espainiar eremuan daude, eta gainerako %6,3 (50.000) frantsesean daude. [1]
And:
Euskal errotariek jatorri ezezaguneko sistema bereizi bat erabili ohi zuten. Sistema honetan ikurrak lerro bertikalean edo horizontalki antolatzen dira. Lerro bertikalean digitu eta zatiki bakarrak alde batera joaten dira normalean, goian. Horizontalki erabiltzean, unitaterik txikienak eskuinean eta ezkerrean egoten dira. Euskal sistema orokorrean bezala, bigesimala da (20 oinarria). Nahiz eta teoriaren arabera 100etik gorako zenbakiak adieraz ditzakeen, adibide erregistratu gehienak ez dira 100etik gorakoak. Frakzioak nahiko arruntak dira, batez ere 1⁄2.
There is no much difference but I would clearly say your model is slightly better