New German Model Training sponsored by Zammad

Argos Translate has received a $1000 sponsorship from Zammad to train new (and hopefully improved) German-English models for Argos Translate!

I have already uploaded new German-English datasets to data.argosopentech.com:
Add en-de NLLB data · argosopentech/argos-train@2dce6c0 · GitHub
Add more German-English datasets · argosopentech/argos-train@f056682 · GitHub

My plan is to train a new pair of models using Argos Train. I’m going to try using a very large value for train_steps and let the training run a long time. I normally do 50k steps and am going to try for 200k+.

5 Likes

That’s awesome! Wish I could speak German to do some human evaluation.

The new models are being trained!

Datasets:

  • NLLB
  • Paracrawl
  • TildeModel
  • Wikimedia
  • Wikipedia
  • Wiktionary
  • DGT
  • EUBookshop
  • EuroPat
  • OpenSubtitles
  • XLEnt

I tried training for 200k steps and increased the early_stopping param value to try to prevent early stopping. This didn’t seem to work very well. The de->en model I trained for 200k steps would just return the input German text with no translation. I’m guessing it overfit or something. I’m training new models with the current default argos-train settings.

2 Likes

I trained new de->en and en->de models. Unfortunately, they both have the same issue where they don’t translate and just slightly modify the input text. I’m not sure what the issue is.

How does the loss graph look like? Also I would do some random sampling on the bitext datasets and check that the sentences are aligned. If you can post the output of the OpenNMT training process maybe there are some cues there too.

Maybe the process is getting stuck in a local minima very quickly? In that case tweaking the learning rate and/or the warm up steps / schedule might help.

I don’t think Argos Train has functionality to view the loss graph unless it’s part of OpenNMT-py.

If you have the log directory (I think by default OpenNMT-py saves it) you should be able to open it with Tensorboard: How to use TensorBoard with PyTorch — PyTorch Tutorials 2.9.0+cu128 documentation

1 Like

I’m examining the NLLB data to try to figure out why the training is failing.

(env) argosopentech@f775282a7f66:~/argos-train$ wget https://data.argosopentech.com/data-nllb-en_de.argosdata
--2025-12-03 15:35:18--  https://data.argosopentech.com/data-nllb-en_de.argosdata
Resolving data.argosopentech.com (data.argosopentech.com)... 194.135.93.195
Connecting to data.argosopentech.com (data.argosopentech.com)|194.135.93.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21173959272 (20G)
Saving to: ‘data-nllb-en_de.argosdata’

data-nllb-en_de.argosdata           100%[================================================================>]  19.72G  25.7MB/s    in 26m 37s 

2025-12-03 16:01:56 (12.6 MB/s) - ‘data-nllb-en_de.argosdata’ saved [21173959272/21173959272]

(env) argosopentech@f775282a7f66:~/argos-train$ unzip data-nllb-en_de.argosdata 
Archive:  data-nllb-en_de.argosdata
   creating: data-nllb-en_de/
  inflating: data-nllb-en_de/metadata.json  
  inflating: data-nllb-en_de/source  
  inflating: data-nllb-en_de/target  
  inflating: data-nllb-en_de/LICENSE  
  inflating: data-nllb-en_de/README 
(env) argosopentech@f775282a7f66:~/argos-train$ shuf -n 25 data-nllb-en_de/source 
I'd love the tutorial!
Well, at this point that says in 2018 and next years, a quick and dutiful farewell is necessary to Cicero, James with his beautiful Bond girls and all the HUMINT that has contributed and contributes to the security of the Nations?
Yet, Russia's foreign minister, Lavrov, said today that "we [the US and Russia] have no ideological differences which make the Cold War inevitable."
noise:queen — i want to break free
Accompanied by a guide from Maastricht Underground, you'll visit the Rijkskluis (national vault) in the marl caves where valuable Dutch art treasures were safely stored during the Second World War.
This chemical is also commonly found in a wide range of foods, smoke, household items and personal care products (21).
It was carried out mostly by the soldiers themselves, under the command of the head of the 20th engineering battalion, Maj. Juliusz Levittoux.
During Easter Week, egg-shaped samovars were put on the table.
In earlier years, it was traditional for the monarch to bestow a knighthood on newly appointed Canadian prime ministers.
Anne Ahira is actually a recognised entrepreneur and effective coach in her own country of Indonesia.
Indeed, "it is precisely at this early age that inclinations to vice or virtue are manifest."
Seven Signs Can Help You Recognize The Path With Heart:
‘Whenever civil resistance has shown the slightest signs of evolving from symbolic action into anything remotely threatening, the crackdown is merciless.
Her song "Not Following" was used by German singer Lena on her debut album My Cassette Player.
This might be our day.”
Google Helpouts are a great option.
11 Now there went with Absalom two hundred men out of Jerusalem that were called, going with simplicity of heart, and knowing nothing of the design.
The scope is to follow the newlyweds’ day from the beginning to the end, capturing those unique moments that would otherwise be lost forever.
Hold the imagery in your mind and open up to it, accepting anything that arises in your mind.
Who You Are Determines What You Write.
(AP) -- A cheese from Ohio has captured the top spot at the U.S. Championship Cheese Contest in Wisconsin.
You fall in love, you lose control."
Phil Spencer is “not a big fan of Xbox One and a Half”
But if I wanted to hide, my hair would be shorter and my nails wouldn’t be red.
He seeks to do evil because he is evil.

I used ChatGPT to generate this script sample.sh:

# Save random line numbers
shuf -i 1-$(wc -l < data-nllb-en_de/source) -n 10 > lines.txt

# Print source + target together
while read -r i; do
    src=$(sed -n "${i}p" data-nllb-en_de/source)
    tgt=$(sed -n "${i}p" data-nllb-en_de/target)
    echo "[$i] SRC: $src"
    echo "     TGT: $tgt"
    echo
done < lines.txt

Which returned this output:

(env) argosopentech@f775282a7f66:~/argos-train$ ./sample.sh                                                                                  
[221586271] SRC: Twenty-Four Preludes for Piano                                                                                              
     TGT: 24 Preludes für Piano                                                                                                              
                                                                                                                                             
[14521845] SRC: With a little more luck than we had at Most, we should be successful.”
     TGT: Mit ein wenig mehr Glück als in Most sollte uns das auch gelingen."

[207143241] SRC: Konrad Dissertori: "I served as the vice-chairman of the ‘Verkehrs- und Verschönerungsverein’ and later as the director of the Tourism Association, which I headed until 1993.
     TGT: Ich war stellvertretender Obmann des Verkehrsund Verschönerungsvereines und dann Direktor des Tourismusvereins, den ich bis 1993 geführt habe.

[10369981] SRC: Citizens want to see results, but it remains to be seen whether the new Commission will be able to enact every-thing that Europeans hope for."
     TGT: Die Bürger wollen Ergebnisse sehen, doch es wird sich zeigen, ob die neue Kommission alles umsetzen kann, was sich die Europäer erhoffen."

[245963058] SRC: St Georges is the most beautiful area with a tiny fishing hamlet set on the edge of the Akamas Peninsula and an ideal location for nature lovers and those that enjoy the splendor of sunset views or to dine at a traditional tavern.
     TGT: St. George ist ein kleines Fischerdörfchen am Rande der Akamas-Halbinsel und ein idealer Ort für Naturliebhaber und diejenigen, die die wunderbaren Sonnenuntergänge in einer traditionellen zypriotischen Taverne beim Abendessen genießen möchten.

[74855006] SRC: Malolactic fermentation and aging in French oak barrels of 500 liters
     TGT: malolaktische Fermentierung und Elevations in Eichenfässern Französisch von 500 Litern

[149072163] SRC: You may have a single goal or you may have multiple.
     TGT: Sie können einschichtig oder mehrschichtig sein.

[10825941] SRC: Some people are there who celebrates the new year just because they can party out and enjoy their life.
     TGT: Es gibt Leute, die das neue Jahr feiern, nur weil sie feiern und ihr Leben genießen können.

[235400548] SRC: But the show had its troubling side.
     TGT: Dieses Schauspiel hatte seine erschreckende Seite.

[36737454] SRC: Support for cannabis legalization – both for medicinal uses and for recreational purposes – has grown significantly in recent years.
     TGT: Unterstützung für Cannabis Legalisierung - sowohl für medizinische Anwendungen und für Erholungszwecke - in den letzten Jahren deutlich gewachsen.

This looks like high quality data to me. Although I don’t speak German.

2 Likes

I trained a new English->German model and it has the same issue where it just clones the input and returns it untranslated. I’m going to try training one without NLLB next.

Edit: I started this training run:

(env) argosopentech@de77ee04ee88:~/argos-train$ git diff
diff --git a/data-index.json b/data-index.json
index 50dd93e..5cb7d5b 100644
--- a/data-index.json
+++ b/data-index.json
@@ -450,17 +450,6 @@
             "http://data.argosopentech.com/data-europarl-en_da.argosdata"
         ]
     },
-    {
-        "name": "NLLB",
-        "type": "data",
-        "from_code": "en",
-        "to_code": "de",
-        "size": 247470736,
-        "reference": "ai.meta.com/research/no-language-left-behind/",
-        "links": [
-            "https://data.argosopentech.com/data-nllb-en_de.argosdata"
-        ]
-    },
     {
         "name": "Wiktionary",
         "type": "data",

Update: Even with the NLLB data removed I’m still getting broken models.

1 Like

I’m trying a new training run removing all of these datasets to see what happens.

1 Like

I tried training one more model but don’t think it worked very well either. I reviewed all of my training attempts and the English->German models seem to work but there’s an issue with both of the German->English models I tried to train:

English → German

Seems to work

German2

English->German seems to work but German->English mirrors the input

English → German (GermanTryThree)

Seems to work

English → German no NLLB (GermanTryFour)

Seems to work

English → German remove new datasets but keep NLLB (GermanTryFive)

Seems to work

German → English all datasets (GermanFinalTry)

Mirrors input

I’m going to make these models available for download on Google Drive here if anyone else wants to take a look and compare them to the current production models.

1 Like

I thought maybe the issue with the German->English models could be Sentence Boundary Detection but it looks fine in the logs

('sentences', ['Schwefel ist ein chemisches Element mit dem Elementsymbol S und der Ordnungszahl 16.', 'Er zählt zu den Chalkogenen in der sechzehnten Gruppe des Periodensystems.', 'In der Häufigkeit der in der Lithosphäre vorkommenden Elemente steht er an 16. Stelle.', 'Elementarer Schwefel ist ein bei 25 °C gelber, nichtmetallischer Feststoff, der eine Vielzahl allotroper Modifikationen bildet.', 'In der unbelebten Natur kommt er sowohl gediegen als auch in Form zahlreicher Mineralien vor, hierin vor allem als Sulfid, Disulfid und Sulfat, seltener als Sulfit.', 'Schwefelverbindungen sind auch Bestandteile aller Pflanzen, Tiere und Menschen, zum Beispiel als essentielle Aminosäuren und Coenzyme.', 'Auch Kohle und Erdöl enthalten daher Schwefelverbindungen.', 'In Mikroorganismen spielt Schwefel auch eine Rolle bei der anaeroben Energiegewinnung.', 'Den größten Teil des elementar gewonnenen oder in Raffinerien erzeugten Schwefels verwendet die chemische Industrie zur Herstellung von Schwefelsäure, einer der technisch wichtigsten und meistproduzierten Grundchemikalien.', 'Als Komponente des sauren Regens besitzen Schwefeloxide und verschiedene Schwefelsäuren erhebliche'])

Here’s a sample of text translated with the first model I trained in this thread and the current production LibreTranslate.com:

Source Wikipedia

Scott David Zolak (born December 13, 1967) is an American broadcaster and former professional football player. He played quarterback in the National Football League (NFL) for nine seasons, primarily with the New England Patriots. Over the course of his career, he played in 55 games, with 7 starts, for the Patriots and Miami Dolphins, completed 124 of 248 passes for 1,314 yards, threw eight touchdowns and seven interceptions, and finished his career with a passer rating of 64.8.

Zammad model

Scott David Zolak (* 13. Dezember 1967 in Los Angeles) ist ein US-amerikanischer Fußballspieler. Er spielte in der National Football League (NFL) für neun Saisons, vor allem mit den New England Patriots. Im Laufe seiner Karriere spielte er in 55 Spielen, mit 7 Starts, für die Patriots und Miami Dolphins, absolvierte 124 von 248 Pässe für 1.314 Yards, warf acht Touchdowns und sieben Abfangen, und beendete seine Karriere mit einer Passer-Bewertung von 64.8.

LibreTranslate.com

Scott David Zolak (* 13. Dezember 1967) ist ein US-amerikanischer Sender und ehemaliger Fußballprofi. Er spielte Quarterback in der National Football League (NFL) für neun Saisons, vor allem mit den New England Patriots. Im Laufe seiner Karriere spielte er in 55 Spielen mit 7 Starts für die Patriots und Miami Dolphins, absolvierte 124 von 248 Pässen für 1.314 Yards, warf acht Touchdowns und sieben Interceptions und beendete seine Karriere mit einer Passer-Bewertung von 64,8.

1 Like

Is this still a problem? From what I read above I’m trying to understand if this mirroring happens on every input or occasionally. What datasets are you using other than NLLB?

I uploaded a new pair of models to the Google Drive in the “GermanNoNLLB” folder where I didn’t include the NLLB data. This time the German->English model seems to work but the English->German one is broken.

1 Like