Odd translation behavior repeating words

bruno-kakele · December 8, 2023, 3:55pm

Hi,

I want to begin this topic by thanking for this amazing project and everyone that works on it! If this is the wrong place to report this issue please let me know.

Recently I encountered the following problem with libretranslate/argos translation from Portuguese to English:

curl -v -d ‘q=quase quase&target=en&source=pt’ -X POST myserver…

And the response is:

{"translatedText":"almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost almost"}

I am using the libretranslate docker image (version 1.5.0). What could be causing this issue, and is there a way to fix it? Please let me know if I can provide more information.

Thanks in advance

pierotofy · December 8, 2023, 5:01pm

github.com/LibreTranslate/LibreTranslate

salad salad salad salad salad salad salad salad

opened 12:46AM - 19 Feb 21 UTC

closed 08:39AM - 19 May 22 UTC

fdelapena

bug

salad salad salad salad salad salad salad salad salad salad salad salad salad sa…lad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad sala salad salad salad salad salad sala salad salad salad salad salad salad salad salad salad sala salad salad salad salad salad salad salad salad salad salad salad salad sala salad salad salad sala salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad salad sala salad salad salad salad salad salad salad salad salad salad salad sala sala sala sala ![Captura de pantalla de 2021-02-18 18-45-47](https://user-images.githubusercontent.com/174885/108440874-92593200-7219-11eb-964f-26e52fd94689.png)

bruno-kakele · December 8, 2023, 5:48pm

Thanks for your response!

If I understand correctly, I need to release a more recent model for Portuguese? Based on this: Argos Open Tech it seems that Portuguese is still at version 1.0. How do I know if a language uses the Wiktextract data? It seems that the Wikiextract isn’t used anymore? (searching for references to generate-wiktionary-data in the repo)

And if that is right, then I should generate it using: GitHub - argosopentech/argos-train: Training scripts for Argos Translate

Is this right? (sorry it is my first time looking at this training data, still learning about it)

bruno-kakele · December 8, 2023, 10:23pm

Also filed this bug that some argosdata is inaccessible: DigitalOcean argosdata files cannot be accessed · Issue #35 · argosopentech/argos-train · GitHub

bruno-kakele · December 8, 2023, 10:46pm

One other question:

Should I use the tokenized lines for training? (mono column):

I am looking into generating a new model for the PT language to include the ParaCrawl and Wiktionary data (according to the master/data-index.json they are not included)

argosopentech · December 9, 2023, 2:05pm

Thanks for finding this. I just pushed a fix.

argosopentech · December 9, 2023, 2:15pm

Hi,

Thanks for your interest! Not that much effort has gone into the Portuguese model and it would probably be possible to train a better one. If you do train a better Portuguese model please contribute it and I can add it to the index. I generally don’t use Wikiextract anymore because the data isn’t very high quality.

You can train new models with Argos Train or Locomotive

Selecting data from Opus

Argos Translate uses data from Opus in the “Moses” format.

These are the largest datasets that Opus currently has for English-Portuguese. Here are my recommendations for which datasets to include:

Name	Tokens	Reccomend
NLLB v1	3.8G	Yes
ParaCrawl v9	1.6G	Yes
CCMatrix v1	1.2G	No
CCAligned v1	654.7M	No
OpenSubtitles v2018	248.9M	Yes
ELRC-EMEA v1	0.8M	Yes
LinguaTools-WikiTitles v2014	23.0M	Yes
XLEnt v1.2	18.7M	Yes
DGT v2019	111.7M	Yes
WikiMatrix v1	222.4M	No
EUbookshop v2	179.5M	Yes
TildeMODEL v2018	100.4M	Yes
SciELO v1	95.4M	Yes
Europarl v8	61.1M	Yes
Wikipedia v1.0	44.8M	Yes
JRC-Acquis v3.0	64.8M	Yes
CAPES v1	39.1M	Yes
EMEA v3	16.4M	Yes

It could also be worthwhile to train a es->pt model. Currently if you want to translate Spanish to Portuguese with Argos Translate it will be translated es->en->pt which isn’t ideal because Spanish and Portuguese are very similar languages.

bruno-kakele · December 9, 2023, 5:21pm

Thanks for the help! Locomotive sounds very promising for a newbie like me I tried using it, but there was the following error:

$ python train.py --config en-pt-config.json --tensorboard
Training English --> Portuguese (1.1)
Sources: 14
Corrupted .zip file, redownloading /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843.zip
https://object.pouta.csc.fi/OPUS-CAPES/v1/moses/en-pt.txt.zip [100%]     
Extracting /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843.zip to /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843
https://object.pouta.csc.fi/OPUS-EMEA/v3/moses/en-pt.txt.zip [100%]     
Extracting /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d.zip to /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d
 - https://object.pouta.csc.fi/OPUS-NLLB/v1/moses/en-pt.txt.zip (hash:7976cb0 | weight:1)
 - https://object.pouta.csc.fi/OPUS-ParaCrawl/v9/moses/en-pt.txt.zip (hash:720f36b | weight:1)
 - https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/moses/en-pt.txt.zip (hash:803d6e9 | weight:1)
 - https://object.pouta.csc.fi/OPUS-ELRC-EMEA/v1/moses/en-pt.txt.zip (hash:e878016 | weight:1)
 - https://object.pouta.csc.fi/OPUS-LinguaTools-WikiTitles/v2014/moses/en-pt.txt.zip (hash:d527c9c | weight:1)
 - https://object.pouta.csc.fi/OPUS-XLEnt/v1.2/moses/en-pt.txt.zip (hash:52c0f5a | weight:1)
 - https://object.pouta.csc.fi/OPUS-EUbookshop/v2/moses/en-pt.txt.zip (hash:7f3b573 | weight:1)
 - https://object.pouta.csc.fi/OPUS-TildeMODEL/v2018/moses/en-pt.txt.zip (hash:cd826c2 | weight:1)
 - https://object.pouta.csc.fi/OPUS-SciELO/v1/moses/en-pt.txt.zip (hash:f48f89e | weight:1)
 - https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-pt.txt.zip (hash:2f34d28 | weight:1)
 - https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/moses/en-pt.txt.zip (hash:debbe41 | weight:1)
 - https://object.pouta.csc.fi/OPUS-JRC-Acquis/v3.0/moses/en-pt.txt.zip (hash:1451206 | weight:1)
 - https://object.pouta.csc.fi/OPUS-CAPES/v1/moses/en-pt.txt.zip (hash:369ede1 | weight:1)
 - https://object.pouta.csc.fi/OPUS-EMEA/v3/moses/en-pt.txt.zip (hash:0d79d68 | weight:1)
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 64.5MB/s]              
2023-12-09 09:14:28 INFO: Downloading these customized packages for language: en (English)...
=======================
| Processor | Package |
-----------------------
| tokenize  | ewt     |
=======================

Downloading http://nlp.stanford.edu/software/stanza/1.1.0/en/tokenize/ewt.pt: 100%|███████████████████████| 631k/631k [00:00<00:00, 8.74MB/s]
2023-12-09 09:14:28 INFO: Finished downloading models and saved to /home/bruno/Desktop/Locomotive/run/en_pt-1.1/stanza.
Downloading flores200 dataset...
Wrote /home/bruno/Desktop/Locomotive/run/en_pt-1.1/src-val.txt
Wrote /home/bruno/Desktop/Locomotive/run/en_pt-1.1/tgt-val.txt
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /home/bruno/Desktop/Locomotive/cache/7976cb079cc3eb3f4bb601122b2511a5/NLLB.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/720f36b53ed6eaca1b7c7ff318ff4ef0/ParaCrawl.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/803d6e9518ca8550b1e9f1be6901f52d/OpenSubtitles.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/e8780165d3f4346430d77b3a6516706e/ELRC-EMEA.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/d527c9c660ca65084ef986260616f531/LinguaTools-WikiTitles.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/52c0f5a30f62a5d3497efa656eacd0d0/XLEnt.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/7f3b57309e253830146f5ab11c02f4d4/EUbookshop.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/cd826c2a5300ea74f33f7d4893646d7e/TildeMODEL.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/f48f89e440549427ea57e582ffa10535/SciELO.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/2f34d28d7a42dd9ff48a65ded8afe0c2/Europarl.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/debbe4100b212af08032982cb5524aa8/Wikipedia.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/14512066e78758d370056541ed29abde/JRC-Acquis.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843/CAPES.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d/EMEA.en-pt.en
  input: /home/bruno/Desktop/Locomotive/cache/7976cb079cc3eb3f4bb601122b2511a5/NLLB.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/720f36b53ed6eaca1b7c7ff318ff4ef0/ParaCrawl.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/803d6e9518ca8550b1e9f1be6901f52d/OpenSubtitles.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/e8780165d3f4346430d77b3a6516706e/ELRC-EMEA.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/d527c9c660ca65084ef986260616f531/LinguaTools-WikiTitles.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/52c0f5a30f62a5d3497efa656eacd0d0/XLEnt.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/7f3b57309e253830146f5ab11c02f4d4/EUbookshop.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/cd826c2a5300ea74f33f7d4893646d7e/TildeMODEL.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/f48f89e440549427ea57e582ffa10535/SciELO.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/2f34d28d7a42dd9ff48a65ded8afe0c2/Europarl.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/debbe4100b212af08032982cb5524aa8/Wikipedia.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/14512066e78758d370056541ed29abde/JRC-Acquis.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843/CAPES.en-pt.pt
  input: /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d/EMEA.en-pt.pt
  input_format: 
  model_prefix: /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 1000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/7976cb079cc3eb3f4bb601122b2511a5/NLLB.en-pt.en
trainer_interface.cc(145) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 2000000 lines

...

trainer_interface.cc(145) LOG(INFO) Loaded 639000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/debbe4100b212af08032982cb5524aa8/Wikipedia.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 640000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 641000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/14512066e78758d370056541ed29abde/JRC-Acquis.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 642000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 643000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843/CAPES.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 644000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d/EMEA.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 645000000 lines
trainer_interface.cc(409) LOG(INFO) Sampled 1000000 sentences from 645435140 sentences.
trainer_interface.cc(414) LOG(INFO) Skipped 591 too long sentences.
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=99032336
trainer_interface.cc(548) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=1451
trainer_interface.cc(559) LOG(INFO) Final character coverage=1
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 999999 sentences.
unigram_model_trainer.cc(222) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(226) LOG(INFO) Extracting frequent sub strings... node_num=48319884
unigram_model_trainer.cc(274) LOG(INFO) Initialized 1001451 seed sentencepieces
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 999999
trainer_interface.cc(608) LOG(INFO) Done! 946659
unigram_model_trainer.cc(564) LOG(INFO) Using 946659 sentences for EM training
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=372312 obj=12.3229 num_tokens=2076375 num_tokens/piece=5.57698
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=329770 obj=9.88506 num_tokens=2087466 num_tokens/piece=6.33007
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=247304 obj=9.86642 num_tokens=2169589 num_tokens/piece=8.77296
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=247138 obj=9.8551 num_tokens=2170183 num_tokens/piece=8.78126
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=185352 obj=9.91244 num_tokens=2284923 num_tokens/piece=12.3275
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=185347 obj=9.89977 num_tokens=2284917 num_tokens/piece=12.3278
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=139010 obj=9.97952 num_tokens=2411614 num_tokens/piece=17.3485
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=139010 obj=9.96461 num_tokens=2411552 num_tokens/piece=17.348
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=104257 obj=10.0664 num_tokens=2545622 num_tokens/piece=24.4168
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=104257 obj=10.0481 num_tokens=2545638 num_tokens/piece=24.417
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=78192 obj=10.1762 num_tokens=2683771 num_tokens/piece=34.3228
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=78192 obj=10.153 num_tokens=2683913 num_tokens/piece=34.3246
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=58644 obj=10.3113 num_tokens=2826789 num_tokens/piece=48.2025
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=58644 obj=10.2812 num_tokens=2826943 num_tokens/piece=48.2052
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=55000 obj=10.318 num_tokens=2858431 num_tokens/piece=51.9715
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=55000 obj=10.3112 num_tokens=2858560 num_tokens/piece=51.9738
trainer_interface.cc(686) LOG(INFO) Saving model: /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece.model
trainer_interface.cc(698) LOG(INFO) Saving vocabs: /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece.vocab
Wrote /home/bruno/Desktop/Locomotive/run/en_pt-1.1/config.yml
Converting /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece.vocab
Traceback (most recent call last):
  File "/home/bruno/Desktop/Locomotive/train.py", line 346, in <module>
    sp_vocab_to_onmt_vocab(sp_vocab_file, onmt_vocab_file)
  File "/home/bruno/Desktop/Locomotive/onmt_tools.py", line 51, in sp_vocab_to_onmt_vocab
    w, c = line.rstrip("\n").split(None, 1)
ValueError: not enough values to unpack (expected 2, got 1)

My config.json looks like this:

{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Portuguese",
        "code": "pt"
    },
    "version": "1.1",
    "sources": [
        "opus://NLLB",
        "opus://ParaCrawl",
        "opus://OpenSubtitles",
        "opus://ELRC-EMEA",
        "opus://LinguaTools-WikiTitles",
        "opus://XLEnt",
        "opus://EUbookshop",
        "opus://TildeMODEL",
        "opus://SciELO",
        "opus://Europarl",
        "opus://Wikipedia",
        "opus://JRC-Acquis",
        "opus://CAPES",
        "opus://EMEA"
    ]   
}

Thanks in advance

argosopentech · December 9, 2023, 5:40pm

You didn’t do anything wrong this is a known issue that can happen when using SentencePiece.

What I think is happening here is that when the SentencePiece tokenizer runs it selects the '\r' character as a token. This can then break the parsing of the sentencepiece.vocab file.

Correct sentencepiece.vocab

▁Dia    -8.0681
▁who    -8.06953
▁high   -8.07295
ra      -8.07622
ka      -8.08309
▁He     -8.08323
▁New    -8.08698
▁So     -8.08781
ru      -8.08827

Broken sentencepiece.vocab

▁Dia    -8.0681
▁who    -8.06953
▁high   -8.07295
ra      -8.07622
<carriage return (ascii 13)>  -8.08827
▁He     -8.08323
▁New    -8.08698
▁So     -8.08781
ru      -8.08827

I merged a fix in OpenNMT-py v2 but it’s broken again in v3.

You can run this program to test your sentencepiece.vocab file:

VOCAB_FILE = 'sentencepiece.vocab'
lines = open(VOCAB_FILE).readlines()
for i, line in enumerate(lines):
    print(line)
    split = line.split()
    assert(len(split) == 2)
    assert(split[1][-1].isdigit())
    print(f'Checked line {i}/{len(lines)}')

Reference

bruno-kakele · December 9, 2023, 6:05pm

Thanks! I removed the offending line from the vocab. For future reference, I also removed the if on line 344, so that it regenerates the openmt.vocab (if not os.path.isfile(onmt_vocab_file):). Training the model now! Will update it here when I have a result.

One note: the tensorboard seems to be empty (however the model is training): No dashboards are active for the current data set.

argosopentech · December 9, 2023, 6:06pm

I made a pull request to fix this in Locomotive:

github.com/LibreTranslate/Locomotive

Improve parsing of SentencePiece file

LibreTranslate:main ← argosopentech:main

opened 06:01PM - 09 Dec 23 UTC

argosopentech

+4 -1

When there is whitespace in the SentencePiece token value the Python `strip` fu…nction fails. - https://github.com/argosopentech/OpenNMT-py/commit/5515c93cb6e9e9bbc7bcaf71e87ae7af0d30b1ef - https://community.libretranslate.com/t/odd-translation-behavior-repeating-words/827/8 - https://forum.opennmt.net/t/valueerror-not-enough-values-to-unpack-expected-2-got-1/4220/5 - https://forum.opennmt.net/t/opennmt-py-error-when-training-with-large-amount-of-data/4310/22

New Implementation

def sp_vocab_to_onmt_vocab(sp_vocab, onmt_vocab):
    print(f"Converting {sp_vocab}")
    with open(sp_vocab, 'r', encoding="utf-8") as fin:
        with open(onmt_vocab, 'wb') as fout:
            OMIT = (DefaultTokens.UNK, DefaultTokens.BOS, DefaultTokens.EOS)
            for line in fin:
                line_and_freq = line.rstrip("\n").split(None, 1)
                if len(line_and_freq) != 2:
                    continue
                w, c = line_and_freq
                if w in OMIT:
                    continue
                c = math.exp(float(c)) * 1000000
                c = int(c) + 1
                fout.write(f'{w}\t{c}\n'.encode("utf-8"))
    print(f"Wrote {onmt_vocab}")

Old Implementation

def sp_vocab_to_onmt_vocab(sp_vocab, onmt_vocab):
    print(f"Converting {sp_vocab}")
    with open(sp_vocab, 'r', encoding="utf-8") as fin:
        with open(onmt_vocab, 'wb') as fout:
            OMIT = (DefaultTokens.UNK, DefaultTokens.BOS, DefaultTokens.EOS)
            for line in fin:
                w, c = line.rstrip("\n").split(None, 1)
                if w in OMIT:
                    continue
                c = math.exp(float(c)) * 1000000
                c = int(c) + 1
                fout.write(f'{w}\t{c}\n'.encode("utf-8"))
    print(f"Wrote {onmt_vocab}")

In the old implementation when there was whitespace in the SentencePiecetoken value the Python strip function fails.

This checks that the line was parsed correctly and skips the line otherwise:

line_and_freq = line.rstrip("\n").split(None, 1)
if len(line_and_freq) != 2:
    continue
w, c = line_and_freq

bruno-kakele · December 9, 2023, 6:28pm

Thanks again! I also want to train a new model for English to Spanish, should I use the same recommended data sources? One more model I would like to improve is the English to Polish one, but it seems to be at version 1.9 and that would be good enough?

argosopentech · December 9, 2023, 7:18pm

I would use essentially the same data sources. I try to use as many different data sources from Opus as possible but exclude some of the smaller ones because they’re not worth the effort. I normally exclude CCMatrix if there are other better options available because it’s very large and of mediocre quality.

One more model I would like to improve is the English to Polish one, but it seems to be at version 1.9 and that would be good enough?

The 1.9 models are from Opus-MT and work very well in my tests. I would recommend focusing on other languages but if you can make a model better than the Opus-MT ones that’s awesome.

bruno-kakele · December 10, 2023, 4:40pm

Got it thanks. I am almost done training from en->pt and I have some questions:

I accidentally stopped training at step 41000 (out of 50000). When running again, it resumes from step ~9000. Is this correct?
From step 41000 I generated a model (using the --inflight flag), the BLEU was 68.23786, acc was around 73 and ppl around 10. Was this going in the right direction quality-wise?
How can I compare with the 1.0 model? (It seems to not use the argosmodel, but the other files from training in the eval)

Thank you

bruno-kakele · December 11, 2023, 8:38pm

An update: the models have finished training:

The BLEU scores were ~68 and ~58 respectively, oddly the pt_en one seems to perform worse than the en_pt (I just used the --reverse flag).

When evaluating manually, the error for “quase quase” still persists, it outputs “almost almost almost almost almost almost…” repeated many times.

Trying some samples it does seem to be translate well, but I am not sure how can I compare with the current model in an objective way?

Any feedback or ideas?

Thanks @argosopentech and @pierotofy

argosopentech · December 11, 2023, 9:34pm

Awesome thanks for training these!

These models look good. They didn’t work with Argos Translate out of the box but they worked after I unzipped and re-zipped the model directory. I’m not sure exactly what the issue was or why re-zipping fixed it but they’re working now.

I got this error initially:

Traceback (most recent call last):
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslategui/gui.py", line 396, in load_languages
    self.languages = translate.load_installed_languages()
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/translate.py", line 636, in load_installed_languages
    return get_installed_languages()
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/translate.py", line 521, in get_installed_languages
    packages = package.get_installed_packages()
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/package.py", line 327, in get_installed_packages
    to_return.append(Package(path))
  File "/home/pj/Downloads/env/lib/python3.10/site-packages/argostranslate/package.py", line 193, in __init__
    raise FileNotFoundError(
FileNotFoundError: Error opening package at /home/pj/.local/share/argos-translate/packages/stanza/metadata.json no metadata.json
Aborted (core dumped)

argosopentech · December 11, 2023, 9:35pm

Here’s some translation samples:

en → pt

English Source Text (Wikipedia)

Hector Berlioz (11 December 1803 – 8 March 1869) was a French Romantic composer. His output includes orchestral works such as Harold in Italy, choral pieces including his Requiem and L’enfance du Christ, and works of hybrid genres such as the “dramatic symphony” Roméo et Juliette and the “dramatic legend” La damnation de Faust. Expected to enter medicine, Berlioz defied his family by taking up music, and won the Prix de Rome in 1830. Berlioz married the Irish Shakespearean actress Harriet Smithson, who inspired his first major success, the Symphonie fantastique, in which an idealised depiction of her occurs throughout. His first opera, Benvenuto Cellini, was a failure. The second, the epic Les Troyens, was so large in scale that it was never staged in its entirety during his lifetime. Meeting only occasional success in France as a composer, Berlioz turned to conducting, in which he gained an international reputation. He also wrote musical journalism throughout much of his career.

1.1 (Proposed) Translation

Texto para traduzir de Hector Berlioz (11 de dezembro de 1803 – 8 de março de 1869) foi um compositor francês. Sua saída inclui obras orquestrais como Harold em Itália, peças coral, incluindo seu Requiem e L’enfance du Cristo, e obras de gêneros híbridos como a “fínfoniadramática” Roméo et Juliette e a " lendadramática" La damnation de Faust. Esperava entrar em medicina, Berlioz defiou sua família tomando música e ganhou o Prix de Roma em 1830. Berlioz casou-se com a atriz irlandesa Shakespearean Harriet Smithson, que inspirou seu primeiro grande sucesso, a fantastique Symphonie, na qual ocorre uma representação idealizada de sua atriz. Sua primeira ópera, Benvenuto Cellini, foi uma falha. O segundo, o épico Les Troyens, foi tão grande em escala que nunca foi palco em sua totalidade durante sua vida. Encontro apenas um sucesso ocasional na França como compositor, Berlioz voltou a conduzir, em que ganhou uma reputação internacional. Ele também escreveu jornalismo musical em grande parte de sua carreira.

1.0 (Prod) Translation

Hector Berlioz (11 de dezembro de 1803 - 8 de março de 1869) foi um compositor romântico francês. Sua produção inclui obras orquestrais como Haroldo na Itália, peças corais incluindo sua Requiem e L’enfance du Christ, e obras de gêneros híbridos como a “sinfonia dramática” Roméo et Juliette e a " legenda dramática" La Damnation de Faust. Esperava entrar na medicina, Berlioz desafiou sua família ao tomar música e ganhou o Prix de Roma em 1830. Berlioz casou-se com a atriz irlandesa de Shakespeare Harriet Smithson, que inspirou seu primeiro grande sucesso, a fantasia de Symphonie, na qual uma representação idealizada dela ocorre em toda parte. Sua primeira ópera, Benvenuto Cellini, foi um fracasso. O segundo, o épico Les Troyens, era tão grande em escala que nunca foi encenado em sua totalidade durante sua vida. Reunindo apenas sucesso ocasional na França como compositor, Berlioz virou-se para conduzir, em que ganhou uma reputação internacional. Ele também escreveu jornalismo musical em grande parte de sua carreira.

pt → en

Portuguese Source Text (Wikipedia)

Mohammed Ould Abdel Aziz (Akjoujt, 20 de dezembro de 1956) é um político e foi o 8.º presidente da Mauritânia entre 2009 a 2019.[1] Soldado de carreira e oficial de alta patente, foi destaque durante o golpe em agosto de 2005 que depôs o presidente Maaouya Ould Sid’Ahmed Taya, e liderou o golpe em agosto de 2008, que derrubou o presidente Sidi Ould Cheikh Abdallahi. Após o golpe de 2008, Abdel Aziz tornou-se Presidente do Conselho Superior de Estado como parte do que foi descrito como uma transição política que conduziu a uma nova eleição. Renunciou ao cargo em abril de 2009 para se apresentar como candidato nas eleições presidenciais de julho de 2009, saindo eleito. Foi empossado em 5 de agosto de 2009. Posteriormente, foi reeleito em 2014 e não buscou a reeleição em 2019. Foi sucedido por Mohamed Ould Ghazouani, que assumiu o cargo em 1 de agosto de 2019.

1.1 (Proposed) Translation

Mohammed Ould Abdel Aziz (born December 20, 1956) is a politician, and was the 8th president of Mauritania between 2009 and 2019.[1] High patent career and official soldier, was highlighted during the coup in August 2005 which led President Maaouya Ould Sid’Ahmed Taya, and led the coup in August 2008, which broke down President Sidi Ould Cheikh Abdallahi. After the 2008 coup, Abdel Aziz became President of the Higher Council of State as part of what was described as a political transition that led to a new election. He denounced the position in April 2009 to present himself as a candidate in the presidential elections of July 2009, leaving elected. On 5 August 2009. Subsequently, it was re-elected in 2014 and did not seek re-election in 2019. It was succeeded by Mohamed Ould Ghazouani, who took office on 1 August 2019.

1.0 (Prod) Translation

Mohammed Ould Abdel Aziz (December 20, 1956) is a politician and was the 8th president of Mauritania from 2009 to 2019.[1] High-ranking officer and career soldier, was featured during the coup in August 2005 which deposed President Maaouya Ould Sid’Ahmed Taya, and led the coup in August 2008, which ousted President Sidi Ould. After the 2008 coup, Abdel Aziz became President of the Superior Council of State as part of what was described as a political transition that led to a new election. He resigned in April 2009 to appear as a candidate in the July 2009 presidential election, leaving elected. It was retired on 5 August 2009. He was re-elected in 2014 and did not seek re-election in 2019. It was succeeded by Mohamed Ould Ghazouani, who took office on 1 August 2019.

pierotofy · December 11, 2023, 10:27pm

Side note, I’ve just merged Workaround for salad by pierotofy · Pull Request #554 · LibreTranslate/LibreTranslate · GitHub which should help mitigate this issue with all models for single word translations.

bruno-kakele · December 11, 2023, 10:47pm

Thank you @pierotofy and @argosopentech ! A few more questions:

When I stopped training before the 50000 step and then resumed it again (by executing the train command), it resumed from step 9000. Is that normal?
Does the 1.0 model has a BLEU score and/or acc and ppl for comparison? What is the process now for updating the existing model in the argospentech website (that is loaded by default with libretranslate)?

argosopentech · December 12, 2023, 1:46pm

I’m not sure exactly, this would depend on the interaction between Locomotive and OpenNMT-py.

I don’t have any BLEU scores for the 1.0 model. BLEU aren’t very reliable in my experience so I don’t use them much.

Are to a Portuguese speaker? Do you think the 1.1 model is an improvement over the 1.0 one? I can push a commit to the argospm-index repo to update the prod model.