Estonian is currently the most requested language by upvotes on my language requests Github poll:
I can train Estonian models, but unfortunately not before the end of February 2024, as GPUs are scheduled until then…
I started training this model taking into account all the developments, but there is one question:
I collected about 30M pairs of sentences, it seems that this will be enough to train a high-quality model with 159M parameters.
If the size of the model is not a problem, then in my opinion, in this case, the quality justifies such a size. However, i can try to optimize up to 86M parameters with a loss of quality.
This should be plenty of data. I think the minimum amount of data needed for a usable translation model is about 2M parallel sentences.
Larger or smaller models are fine, whichever you thinks works best for that language,
Thanks for the answer! Then I’ll focus here on my model 159M, it showed better performance than the Transformer Big. A little later I will try to compile and share my observations on the influence of different parameters on the performance of models with references to many other studies.
I’m glad to introduce translation models to the community.
The models have a Transformer DEEP architecture with 159M parameters, which is 30% smaller than the Transformer BIG and almost 2 times faster in inference (according to my observations, comparable to the Transformer Base, as the decoder part has comparable dimensions).
EN_ET
Model | BLEU | COMET-22 | Note |
---|---|---|---|
LT1.0 | 29,1 | 0,9017 | |
GoogleTranslate | 31 | 0,918 | |
eng-est/opus…2022-03-13 | 28,30 | - | |
facebook/m2m100_1.2B | 24,5 | - |
ET_EN
Model | BLEU | COMET-22 | Note |
---|---|---|---|
LT1.0 | 38.7 | 0,8903 | |
GoogleTranslate | 42,4 | 0,8997 | |
est-eng/opus…2022-03-09 | 38,6 | - | |
facebook/nllb-200-3.3B | 35,7 | - |
To train the models, the back translation method was also used:
- All Estonian Wikipedia and news for 2 years.
- About half of Wikipedia is in English, news for one year and commentary on the news.
In total there are about 35M pairs of back translation sentences.
To be honest, both models, although they slowed down, continued to train and improve metrics, but they exhausted the GPU time limit.
Models:
This looks great thank you!
Here’s some demo text I ran through the et->en model:
Estonian Source Text
Samburu loodusreservaat (inglise keeles Samburu National Reserve) on loodusreservaat Keenia keskosas.
Reservaat paikneb piki Ewaso Ng’iro jõe põhjakallast. Vastaskaldal asub Buffalo Springsi reservaat. Samburu reservaat jääb Samburu maakonna alale ja ulatub maakonna lõunapiirini. Reservaat asub Nairobist kirdes umbes 350 km kaugusel.
Reservaati pääseb Nairobi-Isiolo-Marsabiti ja Maralali-Wamba-Isiolo teed mööda. Reservaadi alal on väike lennuväli ning lennukid Nairobi ja Samburu vahel lendavad iga päev. Lend kestab 45 minutit.
Reservaadi pindala on 165 km². Kõrgus merepinnast on 800–1230 meetrit, lääneosa on madalam ja idaosa kõrgem.
Reservaat loodi 1948. See on saanud nime samburute, piirkonnas elava rahva järgi. Värvikaid samburuid võib jõe kaldal tihti näha, kui nad oma loomi jootmas käivad.
English Translation
Samburu National Reserve is a natural reserve in central Kenya.
The reservation is located along the north bank of the Ewaso Ng’iro River. On the opposite shore is the Buffalo Springs Reservation. The Samburu Reservation remains in the Samburu County area and extends to the southern border of the county. The reservation is located about 350 km northeast of Nairobi.
The reservation can be reached by the Nairobi-Isiolo-Marsabit and Maralali-Wamba-Isiolo roads. There is a small airstrip in the reservation area, and planes between Nairobi and Samburu fly daily. The flight lasts 45 minutes.
The reserve covers an area of 165 km2. The elevation is between 800 and 1230 meters above sea level, the western part is lower and the eastern part higher.
The reservation was established in 1948. It is named after the Samburute, a people living in the area. Colorful shamboos can often be seen on the riverside as they water their animals.
I had forgotten that we had already uploaded an Estonian model from Opus-MT and didn’t realize it until I uploaded @lynxpda’s model. Woops! I overwrote the Opus-MT model on Cloudflare with the new Estonian model so this change should be live now.
Both the Opus-MT model and the new model are version “1.9” so there’s no change to the package index.
Yes, I also noticed that the Estonian model was already in the index.
In any case, the new model, according to the metrics, is slightly better than the best model from OPUS 2022, and the index included a small model from 2020, if I’m not mistaken.