How's the English -> Italian model created?

Perhaps I’m missing something obvious, I’ve noticed that data-index.json (https://github.com/argosopentech/argos-train/blob/master/data-index.json) doesn’t have direct datasets from English (en) to Italian (it).

Have they just not been listed there, even though there’s an English → Italian model available in the argos index?

1 Like

There are a lot of language models available that don’t have data on the data index because I trained them manually.

I trained all of the original models manually by downloading the data from Opus, concatenating the datasets together with Unix commands, running SentencePiece and OpenNMT, and then manually building the .argosmodel zip archive. As I added more languages I gradually built up automations for the process which went in the argos-train repo.

As a result only some of the languages have data packages that can be used programmatically but I’m gradually moving them all onto the new system.

1 Like

Makes sense, thanks!

1 Like

I just added English-Italian data to Argos Train:

1 Like

I trained a new trial model for Italian (20230820) and compared it to the current production model (1.0).

I think that the biggest difference between these models is that 1.0 uses data from CCMatrix and ParaCrawl, which are large but lower quality datasets, and 20230820 does not. 20230820 only uses the data in the commit above so it’s a smaller but, I think, higher quality dataset.

To me the two models seem to have similar quality translations at least in this example. I don’t speak Italian and I think I can completely understand the text from either of the translations.

Italian → English

Italian

Gli studi di Jane Goodall hanno dimostrato l’esistenza di una vera e propria cultura nelle comunità di scimpanzé della Tanzania, analoga a quella dei primi appartenenti al genere Homo e Australopithecus. Questi studi hanno consentito di chiarire le differenze fra scimpanzé e bonobo e di identificare entrambe le specie come ominidi (insieme ai gorilla), a differenza dell’orango (primate appartenente ai pongidi); la Goodall è stata fra gli ideatori del Progetto Grandi Scimmie Antropomorfe che mira a ottenere, per i grandi primati, un certo numero di diritti fondamentali riconosciuti a livello internazionale all’uomo, quali il diritto alla vita, alla protezione della libertà individuale e alla protezione dalla tortura. Ha inoltre evidenziato la scoperta dell’uso di utensili da parte degli scimpanzé: la studiosa, infatti, ha scoperto che questi animali sono soliti utilizzare, ad esempio, degli stecchini per “pescare” le termiti all’interno dei loro nidi, le larve e i galagoni dalle cavità dei tronchi d’albero o il miele dagli alveari, o ancora l’utilizzo di pietre per rompere i gusci dei semi più duri.

Studi recenti hanno dimostrato che gli scimpanzé hanno la consapevolezza della morte degli altri individui, non solamente come accade in altri animali molto intelligenti, che spesso appaiono depressi o provano vero dispiacere per la morte di un compagno, di un cucciolo o di un umano (se addomesticati), ma come vera capacità di astrazione e autocoscienza, dimostrata anche dai comportamenti ritualistici documentati. Essi hanno inoltre una morale di gruppo, talvolta comportamenti negativi - come “fare la guerra”, organizzati in piccoli “eserciti” o “bande”, composti dal clan famigliare o dal branco che divengono così una tribù; talvolta indulgere al cannibalismo di membri estranei al gruppo appena uccisi, se particolarmente affamati - del tutto analoghi a quelli umani e ben distinti da quelli di altri animali gregari.

(Source)

1.0

Jane Goodall’s studies have demonstrated the existence of a true culture in the communities of chimpanzees of Tanzania, similar to that of the first belonging to the genus Homo and Australopithecus. These studies have made it possible to clarify the differences between chimpanzees and bonobos and to identify both species as hominids (together with gorillas), unlike the orngo (primate belonging to the pongids); Goodall was among the creators of the Great Anthropomorphic Monkeys Project, which aims to obtain, for the great primates, a number of fundamental rights internationally recognized to man, such as the right to life, the protection of individual freedom and the protection of torture. He also highlighted the discovery of the use of tools by chimpanzees: the scholar, in fact, found that these animals are used, for example, to use stecchini to “weave” the termites within their nests, the larvae and the galagons from the cavities of the trunks of tree or honey from the hives, or even the use of stones to break the shells of the harshest seeds.

Recent studies have shown that chimpanzees have the awareness of the death of other individuals, not only as it happens in other very intelligent animals, which often appear depressed or feel real sorry for the death of a companion, a puppy or a human (if adomesticated), but as a true ability to abstraction and self-consciousness, also demonstrated by documented ritualistic behaviors. They also have a group morale, sometimes negative behaviors - like “make war”, organized in small “exercises” or “bands”, composed by the family clan or the pack that thus become a tribe; sometimes indulge in the cannibalism of foreign members to the newly killed group, if particularly hungry - quite similar to those humans and well distinguished from those of other gregarious animals.

20230820

Jane Goodall’s studies have shown the existence of a true culture in the communities of chimpanzees in Tanzania, similar to that of the first members of the genus Homo and Australopithecus. These studies have made it possible to clarify the differences between chimpanzees and bonobos and to identify both species as ominides (together with gorillas), unlike the orango (primate belonging to the pongids); Goodall was among the designers of the Great Antropomorfe Project that aims to achieve, for the great primates, a number of internationally recognized fundamental rights to humans, such as the right to life, the protection of individual freedom and protection from torture. He also highlighted the discovery of the use of utensils by chimpanzees: The study found that these animals are used, for example, to use stecchini to “fish” thermotes within their nests, the larvae and the galaxies from the cavities of tree logs or honey from the hives, or the use of stones to break the tastes of the harshest seeds.

Recent studies have shown that chimpanzees are aware of the death of other individuals, not only as is the case in other very intelligent animals, who often seem depressed or feel sorry for the death of a partner, a puppy or a human (if domesticated), but as a true abstraction and self-consumption capability, also demonstrated by documented ritualistic behaviors. They also have group morals, sometimes negative behaviors, such as “warfare,” organized in small “exarchs” or “bands”, composed of the family clan or the pack that thus become a tribe. Sometimes to indulge in the cannibalism of members outside the newly killed group, if particularly hungry, who are completely similar to humans and well distinguished from those of other Greek animals.

Native speaker here, here’s my analysis and scoring:

1.0 is slightly more accurate (“hanno dimostrato” → “demonstrated” not “shown”). 20230820 also uses the word “members” which is not in the original text (and wouldn’t be my choice for translation).

1.0 is better, as 20230820 uses the word “ominides”, which does not exist in English. Correct is “hominids” (used by 1.0).

About equal.

1.0 translates more accurately “Great Anthropomorphic Monkeys Project” (better than “Great Antropomorfe Project” which is wrongish), but 20230820 gets “protection from torture” (correct), rather than “protection of torture” (wrong).

1.0 wins, 20230820 writes: “use stecchini to “fish” thermotes within their nests, the larvae and the galaxies” as well as “the use of stones to break the tastes of the harshest seeds.” (“break the shells of the harshest seeds” is correct) :laughing:

1.0 is more accurate overall (although it uses “adomesticated” which is not a word, “domesticated” is correct), 20230820 interprets “autocoscienza” as “self-consumption” which is totally wrong, whereas 1.0 uses “self-consciousness” (correct).

Both models fail to translate “eserciti” correctly, but 1.0 does a better job at the end with “animali gregari” → “gregarious animals” (correct) instead of “Greek animals” (really wrong).

Overall, 1.0 for this particular text seems to perform better.

1 Like

English → Italian

Here’s the en->it model:

English

Source

When Buck earned sixteen hundred dollars in five minutes for John Thornton, he made it possible for his master to pay off certain debts and to journey with his partners into the East after a fabled lost mine, the history of which was as old as the history of the country. Many men had sought it; few had found it; and more than a few there were who had never returned from the quest. This lost mine was steeped in tragedy and shrouded in mystery. No one knew of the first man. The oldest tradition stopped before it got back to him. From the beginning there had been an ancient and ramshackle cabin. Dying men had sworn to it, and to the mine the site of which it marked, clinching their testimony with nuggets that were unlike any known grade of gold in the Northland.

But no living man had looted this treasure house, and the dead were dead; wherefore John Thornton and Pete and Hans, with Buck and half a dozen other dogs, faced into the East on an unknown trail to achieve where men and dogs as good as themselves had failed. They sledded seventy miles up the Yukon, swung to the left into the Stewart River, passed the Mayo and the McQuestion, and held on until the Stewart itself became a streamlet, threading the upstanding peaks which marked the backbone of the continent.

John Thornton asked little of man or nature. He was unafraid of the wild. With a handful of salt and a rifle he could plunge into the wilderness and fare wherever he pleased and as long as he pleased. Being in no haste, Indian fashion, he hunted his dinner in the course of the day’s travel; and if he failed to find it, like the Indian, he kept on travelling, secure in the knowledge that sooner or later he would come to it. So, on this great journey into the East, straight meat was the bill of fare, ammunition and tools principally made up the load on the sled, and the time-card was drawn upon the limitless future.

1.0

Quando Buck guadagnò sedicicento dollari in cinque minuti per John Thornton, rese possibile per il suo padrone di pagare alcuni debiti e di viaggiare con i suoi partner in Oriente dopo una mina smarrita, la cui storia era vecchia quanto la storia del paese. Molti uomini lo avevano cercato; pochi lo avevano trovato; e più di pochi erano quelli che non erano mai tornati dalla ricerca. Questa miniera perduta fu immersa nella tragedia e avvolta nel mistero. Nessuno sapeva del primo uomo. La più antica tradizione si fermò prima che gli ritornasse. Fin dall’inizio c’era stato un antico e ramshackle cabina. Gli uomini che morivano avevano giurato, e alla miniera il luogo di cui ha segnato, abbagliando la loro testimonianza con pepite che erano a differenza di qualsiasi grado d’oro conosciuto nel Nordland.

Ma nessun uomo vivente aveva saccheggiato questa casa del tesoro, e i morti erano morti; quindi John Thornton e Pete e Hans, con Buck e mezza dozzina di altri cani, affrontati in Oriente su un sentiero sconosciuto per raggiungere dove uomini e cani come se stessi avevano fallito. Essi slittarono settanta miglia su Yukon, swung a sinistra nel fiume Stewart, passarono il Mayo e la McQuestion, e si tennero fino a quando lo stesso Stewart divenne un ruscello, infilando le vette che segnarono la spina dorsale del continente.

John Thornton chiese a poco uomo o natura. Non aveva paura del selvaggio. Con una manciata di sale e un fucile poteva immergersi nel deserto e andare ovunque gli piaceva e finché gli piaceva. Essendo in nessun fretta, moda indiana, ha cacciato la sua cena nel corso del viaggio della giornata; e se non ha trovato, come l’indiano, ha continuato a viaggiare, sicuro nella conoscenza che prima o poi sarebbe venuto a esso. Così, in questo grande viaggio in Oriente, la carne retta era il disegno di legge di tariffa, munizioni e strumenti principalmente costituito il carico sulla slitta, e la carta del tempo è stato disegnato sul futuro senza limiti.

20230820

Quando Buck ha guadagnato sedicicento dollari in cinque minuti per John Thornton, ha permesso al suo padrone di pagare determinati debiti e di viaggiare con i suoi partner in Oriente dopo una miniera perduta, la cui storia era antica quanto la storia del paese. Molti uomini lo avevano cercato; pochi lo avevano trovato e più di alcuni non erano mai tornati dalla ricerca. Questa miniera perduta è stata ripidata in tragedia e avvolta nel mistero. Nessuno sapeva del primo uomo. La tradizione più antica si è fermata prima di tornare a lui. Fin dall’inizio c’era una cabina antica e ramshackle. Gli uomini che avevano indossato e, a mia volta, il cui sito era contrassegnato, piegavano la loro testimonianza con nuggetti che erano diversi da qualsiasi grado noto di oro nel nord del paese.

Ma nessun uomo vivente aveva saccheggiato questa casa del tesoro e i morti erano morti; dove John Thornton e Pete e Hans, con Buck e mezzo dozzina di altri cani, si trovavano in Oriente su una pista sconosciuta per raggiungere dove uomini e cani erano buoni come se stessi erano falliti. Hanno scavato settanta miglia lungo lo Yukon, a sinistra nel fiume Stewart, hanno superato il Mayo e la McQuestion e si sono tenuti fino a quando lo stesso Stewart non divenne un ruscello, imitando le vette che hanno segnato la spina dorsale del continente.

John Thornton ha chiesto poco di uomo o di natura. Era privo di fango. Con una manciata di sale e un fucile, potrebbe precipitare nel deserto e fare ogni volta che si rallegra e fino a quando si rallegra. Essendo senza fretta, di moda indiana, ha cacciato la cena nel corso del viaggio diurno e se non riesce a trovarla, come l’India, ha continuato a viaggiare, con la consapevolezza che prima o poi sarebbe arrivato. Quindi, in questo grande viaggio verso l’est, la carne retta era il disegno di legge, le munizioni e gli strumenti che costituivano principalmente il carico inguidito e la carta a tempo indeterminato fu utilizzata per il futuro senza limiti.

20230820 is better overall (1.0 gets “mine” in the wrong context, “mina” is a “exploding mine”).

1.0 is better. “Ripidata” is not a word in Italian AFAIK.

Equal.

Equal (both kind of awkward, but it’s a hard one to translate literally).

Both miss “ramshackle” from their dictionary.

1.0 is better. “nuggets” are not “nuggetti” (“pepite” is correct).

1.0 is better, 20230820 fails to capture “as good as themselves” and instead writes “as delicious as themselves”. 1.0 here avoided the whole description of “good”, although the sentence retains the meaning somewhat.

Equal, both have issues.

20230820 a bit better, but both fail to capture the essence.

1.0 is much better. 20230820 fails by translating: “He was without mud” lol

1.0 is slightly better syntax wise, although both interpret “wilderness” as “desert” (incorrect).

Equal, 1.0 is a bit more awkward, but 20230820 fails to interpret “indian” and uses “India” (country) instead.

Equal, both awkward.

Verdict: 1.0 is slightly better (but not much) than 20230820.

1 Like

Interesting thanks!

So I think the takeaway is that including large but mediocre quality datasets increases the quality a little but not a lot. For now I’ll leave CCMatrix out of the default dataset since it slows down training for a small benefit.

I think the best strategy for improving performance is to focus on collecting as much high quality data as possible.