Swahili Support

This is a thread to track Swahili (sw) support in Argos Translate.

1 Like

I’ve converted and uploaded all of the datasets from Opus for en-sw and am going to try to train a en-sw and sw-en model with them using Argos Train.

My en->sw model finished training overnight and I’ve started training the sw->en model now. Here’s some sample text from the en->sw model:

English Text (Wikipedia)

Portrayals of scientists were a favourite topic in 17th-century Dutch painting and Vermeer’s oeuvre includes both this astronomer and the slightly later The Geographer. Both are believed to portray the same man, possibly Antonie van Leeuwenhoek. A 2017 study indicated that the canvas for the two works came from the same bolt of material, confirming their close relationship. It has been proposed that Vermeer used a camera obscura as an aid to reconstruct the geometry of the rooms and the objects in his paintings. Both paintings portray the same room and furniture, slightly rearranged.

Swahili translation with Argos Translate en->sw

Portrayals ya wanasayansi walikuwa mada favorite katika mchoro wa Kiholanzi wa karne ya 17 na oeuvre ya Vermeer ni pamoja na astronomia hii na baadaye kidogo Geographer. Wawili hao wanasemekana kuwa na uhusiano wa kimapenzi na Antonie van Leeuwenhoek. Utafiti wa 2017 ulionyesha kuwa turuba za kazi hizo mbili zilitoka kwa bolt moja ya vifaa, kuthibitisha uhusiano wao wa karibu. Imependekezwa kwamba Vermeer alitumia kamera kama msaada wa kujenga upya jiometri ya vyumba na vitu katika uchoraji wake. Picha zote mbili zinaonyesha chumba kimoja na samani, kidogo kilichopangwa.

Back Translation with Google Translate

Portrayals of scientists were a favorite subject in 17th-century Dutch painting, and Vermeer’s oeuvre includes this astronomer and later the Geographer. The two are said to have had a romantic relationship with Antonie van Leeuwenhoek. A 2017 study showed that the canvases for both works came from the same bolt of equipment, confirming their close relationship. It has been suggested that Vermeer used the camera as an aid to reconstructing the geometry of the rooms and objects in his paintings. Both paintings show the same room and furniture, slightly arranged.

Conclusion

The model seems to be decently functional. I also tried some individual words and short phrases which it seemed to handle well.

1 Like

This is awesome! Look forward to test both once they are ready. :clap:

I’ve made several attempts at training a sw->en model but they all just mirror the input text. I don’t know what the issue is. en->sw seemed to work fine.

Was this with argos-train I assume? If you can share the config details I can try to take a look.

I’ve started training sw=>en the on the same dataset with Locomotive, I’ll post the results. The data looks OK.

python train.py --config en-sw.json --reverse
{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Swahili",
        "code": "sw"
    },
    "filters": [
        "duplicates",
        "digits_mismatch",
        "digits_ratio",
        "nonalphanum_ratio",
        "characters_count_mismatch",
        {"contains": {"words": ["http:", "http :", "&amp", "Âż"]}}
    ],
    "transforms": [
        "first_case_normalize",
        "remove_unpaired_quotes_and_brackets",
        {"remove_chars": {"chars": "|│"}}
    ],
    "version": "1.0",

    "sources": [
        "https://data.argosopentech.com/data-ccaligned-en_sw.argosdata",
        "https://data.argosopentech.com/data-ccmatrix-en_sw.argosdata",
        "https://data.argosopentech.com/data-globalvoices-en_sw.argosdata",
        "https://data.argosopentech.com/data-gnome-en_sw.argosdata",
        "https://data.argosopentech.com/data-nllb-en_sw.argosdata",
        "https://data.argosopentech.com/data-opensubtitles-en_sw.argosdata",
        "https://data.argosopentech.com/data-paracrawl-en_sw.argosdata",
        "https://data.argosopentech.com/data-ted2020-en_sw.argosdata",
        "https://data.argosopentech.com/data-tico19-en_sw.argosdata",
        "https://data.argosopentech.com/data-wikimatrix-en_sw.argosdata",
        "https://data.argosopentech.com/data-wikimedia-en_sw.argosdata",
        "https://data.argosopentech.com/data-xlent-en_sw.argosdata"
    ]
}
1 Like

So I’ve run training for 20,000 steps (just a test run) and got:

[2026-06-14 23:39:43,613 INFO] Step 19900/20000; acc: 57.5; ppl:  30.2; xent: 3.4; lr: 0.00106; sents:  114659; bsz: 4105/5343/287; 34400/44776 tok/s;  12522 sec;
[2026-06-14 23:40:28,867 INFO] Step 19950/20000; acc: 57.3; ppl:  30.6; xent: 3.4; lr: 0.00106; sents:  114780; bsz: 4144/5284/287; 36631/46710 tok/s;  12568 sec;
[2026-06-14 23:41:22,686 INFO] Step 20000/20000; acc: 58.0; ppl:  29.6; xent: 3.4; lr: 0.00106; sents:  117410; bsz: 4198/5355/294; 31200/39804 tok/s;  12621 sec;
[2026-06-14 23:41:31,967 INFO] valid stats calculation
                           took: 9.279984474182129 s.
[2026-06-14 23:42:04,234 INFO] The translation of the valid dataset for dynamic scoring
                               took : 32.26653695106506 s.
[2026-06-14 23:42:04,234 INFO] UPDATING VALIDATION BLEU
[2026-06-14 23:42:04,611 INFO] validation BLEU: 20.0751705568148
[2026-06-14 23:42:04,612 INFO] Train perplexity: 31.3108
[2026-06-14 23:42:04,613 INFO] Train accuracy: 56.4431
[2026-06-14 23:42:04,614 INFO] Sentences processed: 3.16746e+07
[2026-06-14 23:42:04,614 INFO] Average bsz: 4165/5368/305
[2026-06-14 23:42:04,615 INFO] Validation perplexity: 25.9221
[2026-06-14 23:42:04,615 INFO] Validation accuracy: 59.3329
[2026-06-14 23:42:04,615 INFO] Decreasing patience: 3/4
[2026-06-14 23:42:04,629 INFO] Saving checkpoint run/sw_en-1.0/opennmt/openmt.model_step_20000.pt
Total checkpoints: 17
Averaging 2 models
Converting to ctranslate2
Writing E:\Locomotive\run\sw_en-1.0\translate-sw_en-1_0.argosmodel
Done!

Back translation seems to work:

(venv) E:\Locomotive>python eval.py --config en-sw.json --reverse  --cpu
Starting interactive mode
(sw)> Portrayals ya wanasayansi walikuwa mada favorite katika mchoro wa Kiholanzi wa karne ya 17 na oeuvre ya Vermeer ni pamoja na astronomia hii na baadaye kidogo Geographer
(en)> Portrayals of scientists were the favorite topics in the 17th century Dutch diagram and the vermeer of Vermeer includes this astronomy and later Geographer
(sw)>

argosmodel => https://drive.google.com/file/d/1rBVypT_C8lleexxkzWOuPGRzQweCJl7z/view?usp=sharing

Note I probably wouldn’t use this model as-is, probably needs more training.

The full config.yml from Locomotive for OpenNMT:

accum_count: 8
accum_steps: 0
adam_beta2: 0.998
attention_dropout: 0.1
batch_size: 6144
batch_type: tokens
bucket_size: 262144
data:
  corpus_1:
    path_src: run/sw_en-1.0/src-train.txt
    path_tgt: run/sw_en-1.0/tgt-train.txt
    transforms:
    - sentencepiece
    - filtertoolong
    weight: 1
  valid:
    path_src: run/sw_en-1.0/src-val.txt
    path_tgt: run/sw_en-1.0/tgt-val.txt
    transforms:
    - sentencepiece
dec_layers: 6
decay_method: rsqrt
decoder_type: transformer
dropout: 0.1
dropout_steps: 0
early_stopping: 4
enc_layers: 6
encoder_type: transformer
gpu_ranks:
- 0
heads: 8
hidden_size: 512
keep_checkpoint: 10
label_smoothing: 0.1
learning_rate: 0.15
max_generator_batches: 2
max_grad_norm: 0
model_dtype: fp16
normalization: tokens
num_worker: 2
optim: adam
param_init: 0
param_init_glorot: true
position_encoding: true
queue_size: 10000
rnn_size: 512
save_checkpoint_steps: 1000
save_data: run/sw_en-1.0/opennmt
save_model: run/sw_en-1.0/opennmt/openmt.model
self_attn_type: scaled-dot
share_decoder_embeddings: true
share_embeddings: true
share_vocab: true
skip_empty_level: silent
src_onmttok_kwargs:
  lang: sw
  mode: none
src_seq_length: 150
src_subword_alpha: 0.0
src_subword_model: run/sw_en-1.0/sentencepiece.model
src_subword_nbest: 1
src_subword_type: sentencepiece
src_vocab: run/sw_en-1.0/opennmt/openmt.vocab
src_vocab_size: 50000
tgt_onmttok_kwargs:
  lang: en
  mode: none
tgt_seq_length: 150
tgt_subword_alpha: 0.0
tgt_subword_model: run/sw_en-1.0/sentencepiece.model
tgt_subword_nbest: 1
tgt_subword_type: sentencepiece
tgt_vocab: run/sw_en-1.0/opennmt/openmt.vocab
tgt_vocab_size: 50000
train_steps: 20000
transformer_ff: 2048
valid_batch_size: 2048
valid_metrics:
- BLEU
valid_steps: 500
warmup_steps: 16000
word_vec_size: 512
world_size: 1

Environment: Python 3.12 (Windows)

Pip dependencies:

absl-py==2.4.0
annotated-doc==0.0.4
annotated-types==0.7.0
anyio==4.13.0
blinker==1.9.0
blis==1.3.3
catalogue==2.0.10
certifi==2026.5.20
cffi==2.0.0
charset-normalizer==3.4.7
click==8.4.1
cloudpathlib==0.24.0
colorama==0.4.6
confection==1.3.3
ConfigArgParse==1.7.5
cryptography==49.0.0
ctranslate2==4.8.0
cymem==2.0.13
emoji==2.15.0
fastshuffle==1.0.1
fasttext-wheel==0.9.2
filelock==3.29.4
Flask==3.1.3
fsspec==2026.4.0
google-auth==2.54.0
google-auth-oauthlib==1.0.0
grpcio==1.81.1
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
idna==3.18
iso639==0.1.4
itsdangerous==2.2.0
Jinja2==3.1.6
joblib==1.5.3
lxml==6.1.1
Markdown==3.10.2
markdown-it-py==4.2.0
MarkupSafe==3.0.3
mdurl==0.1.2
mock==5.2.0
mpmath==1.3.0
murmurhash==1.0.15
networkx==3.6.1
numpy==1.26.4
oauthlib==3.3.1
OpenNMT-py==3.5.1
packaging==26.2
portalocker==3.2.0
preshed==3.0.13
protobuf==7.35.1
pyahocorasick==2.3.1
pyasn1==0.6.3
pyasn1_modules==0.4.2
pybind11==3.0.4
pycparser==3.0
pydantic==2.13.4
pydantic_core==2.46.4
Pygments==2.20.0
pyonmttok==1.38.1
pywin32==312
PyYAML==6.0.1
RapidFuzz==3.14.5
regex==2026.5.9
removedup==1.0.6
requests==2.31.0
requests-oauthlib==2.0.0
rich==15.0.0
sacrebleu==2.3.1
sacremoses==0.0.53
sentencepiece==0.2.1
setuptools==82.0.1
shellingham==1.5.4
six==1.16.0
smart_open==7.6.1
spacy==3.8.14
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.5.3
stanza==1.10.1
subword-nmt==0.3.8
sympy==1.14.0
tabulate==0.10.0
tensorboard==2.14.0
tensorboard-data-server==0.7.2
thinc==8.3.13
torch==2.2.0+cu118
tqdm==4.68.2
typer==0.26.7
typing-inspection==0.4.2
typing_extensions==4.15.0
urllib3==2.7.0
waitress==3.0.2
wasabi==1.1.3
weasel==1.0.0
Werkzeug==3.1.8
wheel==0.47.0
wrapt==2.2.1

Hopefully this can help troubleshoot the problem in argos-train?

Training should also be re-done since I’ve updated the deduplication logic of Locomotive to match the suggestions of Advanced filters, transforms, reliable deduplication, versatile SBD, updated NLLB list, eval with pivot and translation memoirs, more custom transformers, byte-fallback by lecoqnicolas · Pull Request #36 · LibreTranslate/Locomotive · GitHub

via

Thanks for trying this. I think there’s actually a bug in Argos Train where I broke reverse translations. I’m testing a fix now.

1 Like

Sorry for this,

I’ve got a pair of French-Swahili models if you need them, trained from a bunch of French-Swahili and distilled English-Swahili data.

Not so good at translating sports, but does more or less the job otherwise. I could also try to train a Swahili<->English model pair from those English-Swahili data.

1 Like

I found the bug:

The sw->en model seems to be working now. I’m going to try to test it this weekend and potentially publish it.

1 Like

Swahili (sw) → English (en)

Kila mtu anaweza kuhariri makala yoyote, kutoa makosa ya lugha, kutohoa maneno na kuendeleza na kukuza makala, kwa kuandika kwa ufupi au kwa urefu. Tunakushauri ujiandikishe na kufungua akaunti. Angalia Wikipedia:Mwongozo ili ujifunze unavyoweza kuhariri au kusanifisha ukurasa. Usisite kushiriki; kwa sababu huu ni mradi wa ushirikiano, na hautaisha kamwe. Kamusi elezo itaimarika vizuri zaidi kila wakati. Angalia ukurasa wa jumuia, ukurasa wa jamii, na ule wa makala za msingi za kamusi elezo ili kung’amua unaweza kufanya kazi zipi ili kuboresha Wikipedia.

Everyone can edit any article, make language mistakes, not speak words and develop and promote articles, in short or in length. We encourage you to register and open an account. Check out Wikipedia: A guide to learn how you can edit or edit a page. Don’t hesitate to participate, because this is a collaborative project, and it will never end. You will get better every time. Check out the community page, social page, and those of the basic encyclopedia articles to find out what you can do to improve Wikipedia.

en → sw

On 24 June 2026, two[a] large strike-slip earthquakes affected northwestern and central Venezuela. The epicenters of both earthquakes were in San Felipe, Yaracuy. The first earthquake, which measured Mw 7.2, occurred at 18:04 VET, and was classified as a foreshock. It was followed 39 seconds later by a Mw 7.5 mainshock. The two earthquakes caused widespread damage across the country, particularly in La Guaira and Caracas. At least 920 people were killed, more than 4,500 were injured, and over 50,000 were reported missing. The United States Geological Survey (USGS) Prompt Assessment of Global Earthquakes for Response (PAGER) system predicted the death toll to rise significantly, potentially exceeding 100,000. The mainshock became the strongest in Venezuela since the 1900 San Narciso earthquake.

Tarehe 24 Juni 2026, matetemeko mawili makubwa ya ardhi yaliathiri kaskazini-magharibi na kati ya Venezuela. Matetemeko ya ardhi yalikuwa San Felipe, Yaracuy. Tetemeko la kwanza, ambalo lilipima Mw 7.2, lilitokea saa 18:04 VET, na liliainishwa kama kizuizi. Ilifuatiwa na sekunde 39 na kichwa cha Mw 7.5. Tetemeko hilo la ardhi lilisababisha uharibifu mkubwa nchini kote, hususan katika miji ya La Guaira na Caracas. Takriban watu 920 waliuawa, zaidi ya 4,500 walijeruhiwa, na zaidi ya 50,000 waliripotiwa kutoweka. Uchunguzi wa kijiolojia wa Marekani (USGS) wa haraka wa mfumo wa Global Earthquakes for Response (PAGER) ulitabiri idadi ya vifo kuongezeka kwa kiasi kikubwa, na uwezekano wa kuzidi 100,000. Mji wa San Narciso ulikuwa mkubwa zaidi nchini Venezuela tangu mwaka 1900.

This is the log I get for Sentence Boundary Detection with the latest Argos Translate:

('Splitting sentences using SBD Model: (sw) MiniSBDSentencizer',)
('sentences', ['Tarehe 24 Juni 2026, matetemeko mawili makubwa ya ardhi yaliathiri kaskazini-magharibi na kati ya Venezuela.', 'Matetemeko ya ardhi yalikuwa San Felipe, Yaracuy.', 'Tetemeko la kwanza, ambalo lilipima Mw 7.2, lilitokea saa 18:04 VET, na liliainishwa kama kizuizi.', 'Ilifuatiwa na sekunde 39 na kichwa cha Mw 7.5.', 'Tetemeko hilo la ardhi lilisababisha uharibifu mkubwa nchini kote, hususan katika miji ya La Guaira na Caracas.', 'Takriban watu 920 waliuawa, zaidi ya 4,500 walijeruhiwa, na zaidi ya 50,000 waliripotiwa kutoweka.', 'Uchunguzi wa kijiolojia wa Marekani (USGS) wa haraka wa mfumo wa Global Earthquakes for Response (PAGER) ulitabiri idadi ya vifo kuongezeka kwa kiasi kikubwa, na uwezekano wa kuzidi 100,000.', 'Mji wa San Narciso ulikuwa mkubwa zaidi nchini Venezuela tangu mwaka 1900.'])

Stanza doesn’t support Swahili so I packaged the Stanza en model disguised as sw in the .argosmodel package. The log looks it’s using MiniSBD, which is based on Stanza so probably doesn’t support sw, so maybe it’s falling back to en too.

Swahili is live!

1 Like

English Text

Alan Greenspan (March 6, 1926 – June 22, 2026) was an American economist who served as the 13th chair of the Federal Reserve from 1987 to 2006. He worked as a private adviser and provided consulting for firms through his company, Greenspan Associates LLC.

Many have argued that the “easy-money” policies of the Fed during Greenspan’s tenure, including the practice known as the “Greenspan put”, were a leading cause of the dot-com bubble and subprime mortgage crisis (the latter occurring within a year of his leaving the Fed), which, said The Wall Street Journal, “tarnished his reputation”. Yale economist Robert Shiller argues that “once stocks fell, real estate became the primary outlet for the speculative frenzy that the stock market had unleashed”. Greenspan argued that the housing bubble was not a result of low-interest short-term rates but rather a worldwide phenomenon caused by the progressive decline in long-term interest rates – a direct consequence of the relationship between high savings rates in the developing world and its inverse in the developed world.

Translation using new Argos Translate model

Alan Greenspan (Machi 6, 1926 - 22 Juni 2026) alikuwa mwanauchumi wa Marekani ambaye aliwahi kuwa mwenyekiti wa 13 wa Hifadhi ya Shirikisho kutoka 1987 hadi 2006. Alifanya kazi kama mshauri binafsi na kutoa ushauri kwa makampuni kupitia kampuni yake, Greenspan Associates LLC.

Wengi wamehoji kuwa sera za “fedha rahisi” za Fed wakati wa umiliki wa Greenspan, ikiwa ni pamoja na mazoezi inayojulikana kama “Greenspan put”, zilikuwa sababu inayoongoza ya Bubble dot-com na shida ya mikopo ya chini (yaliyotokea ndani ya mwaka wake wa kuondoka Fed), ambayo, alisema Wall Street Journal, “alikuza sifa yake”. Mchumi wa Yale Robert Shiller anasema kwamba “mara tu hisa zilianguka, mali isiyohamishika ikawa msingi wa frenzy ya kubahatisha kwamba soko la hisa lilikuwa limepungua”. Greenspan alisema kuwa Bubble ya makazi sio matokeo ya viwango vya chini vya riba ya muda mfupi lakini ni jambo la kimataifa linalosababishwa na kupungua kwa kasi kwa viwango vya riba ya muda mrefu - matokeo ya moja kwa moja ya uhusiano kati ya viwango vya juu vya akiba katika ulimwengu unaoendelea na kinyume chake katika ulimwengu ulioendelea.

Back Translation with Argos Translate

Alan Greenspan (March 6, 1926 – June 22, 2026) was an American economist who served as the 13th chairman of the Federal Reserve from 1987 to 2006. He worked as a personal mentor and mentored companies through his company, Greenspan Associates LLC.

Many have argued that Fed’s “easy money” policies during Greenspan’s tenure, including practices known as “Greenspan put”, were the leading cause of the dot-com bubble and the low credit problem (which occurred within its year of leaving the Fed), which, said Wall Street Journal, “raised its reputation”. Yale economist Robert Shiller says that “as soon as the stocks fell, real estate became a frenzy based guess that the stock market had decreased.” Greenspan said that the housing bubble is not the result of low-cost interest rates but it is a global phenomenon caused by a rapid decline in high-end interest rates – a direct result of the relationship between higher savings rates in the developing world and vice versa in the developed world.

Yeah MiniSBD will fall back to English in this case.

1 Like