Thanks for the help! Locomotive sounds very promising for a newbie like me I tried using it, but there was the following error:
$ python train.py --config en-pt-config.json --tensorboard
Training English --> Portuguese (1.1)
Sources: 14
Corrupted .zip file, redownloading /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843.zip
https://object.pouta.csc.fi/OPUS-CAPES/v1/moses/en-pt.txt.zip [100%]
Extracting /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843.zip to /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843
https://object.pouta.csc.fi/OPUS-EMEA/v3/moses/en-pt.txt.zip [100%]
Extracting /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d.zip to /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d
- https://object.pouta.csc.fi/OPUS-NLLB/v1/moses/en-pt.txt.zip (hash:7976cb0 | weight:1)
- https://object.pouta.csc.fi/OPUS-ParaCrawl/v9/moses/en-pt.txt.zip (hash:720f36b | weight:1)
- https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/moses/en-pt.txt.zip (hash:803d6e9 | weight:1)
- https://object.pouta.csc.fi/OPUS-ELRC-EMEA/v1/moses/en-pt.txt.zip (hash:e878016 | weight:1)
- https://object.pouta.csc.fi/OPUS-LinguaTools-WikiTitles/v2014/moses/en-pt.txt.zip (hash:d527c9c | weight:1)
- https://object.pouta.csc.fi/OPUS-XLEnt/v1.2/moses/en-pt.txt.zip (hash:52c0f5a | weight:1)
- https://object.pouta.csc.fi/OPUS-EUbookshop/v2/moses/en-pt.txt.zip (hash:7f3b573 | weight:1)
- https://object.pouta.csc.fi/OPUS-TildeMODEL/v2018/moses/en-pt.txt.zip (hash:cd826c2 | weight:1)
- https://object.pouta.csc.fi/OPUS-SciELO/v1/moses/en-pt.txt.zip (hash:f48f89e | weight:1)
- https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-pt.txt.zip (hash:2f34d28 | weight:1)
- https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/moses/en-pt.txt.zip (hash:debbe41 | weight:1)
- https://object.pouta.csc.fi/OPUS-JRC-Acquis/v3.0/moses/en-pt.txt.zip (hash:1451206 | weight:1)
- https://object.pouta.csc.fi/OPUS-CAPES/v1/moses/en-pt.txt.zip (hash:369ede1 | weight:1)
- https://object.pouta.csc.fi/OPUS-EMEA/v3/moses/en-pt.txt.zip (hash:0d79d68 | weight:1)
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 64.5MB/s]
2023-12-09 09:14:28 INFO: Downloading these customized packages for language: en (English)...
=======================
| Processor | Package |
-----------------------
| tokenize | ewt |
=======================
Downloading http://nlp.stanford.edu/software/stanza/1.1.0/en/tokenize/ewt.pt: 100%|███████████████████████| 631k/631k [00:00<00:00, 8.74MB/s]
2023-12-09 09:14:28 INFO: Finished downloading models and saved to /home/bruno/Desktop/Locomotive/run/en_pt-1.1/stanza.
Downloading flores200 dataset...
Wrote /home/bruno/Desktop/Locomotive/run/en_pt-1.1/src-val.txt
Wrote /home/bruno/Desktop/Locomotive/run/en_pt-1.1/tgt-val.txt
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: /home/bruno/Desktop/Locomotive/cache/7976cb079cc3eb3f4bb601122b2511a5/NLLB.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/720f36b53ed6eaca1b7c7ff318ff4ef0/ParaCrawl.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/803d6e9518ca8550b1e9f1be6901f52d/OpenSubtitles.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/e8780165d3f4346430d77b3a6516706e/ELRC-EMEA.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/d527c9c660ca65084ef986260616f531/LinguaTools-WikiTitles.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/52c0f5a30f62a5d3497efa656eacd0d0/XLEnt.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/7f3b57309e253830146f5ab11c02f4d4/EUbookshop.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/cd826c2a5300ea74f33f7d4893646d7e/TildeMODEL.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/f48f89e440549427ea57e582ffa10535/SciELO.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/2f34d28d7a42dd9ff48a65ded8afe0c2/Europarl.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/debbe4100b212af08032982cb5524aa8/Wikipedia.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/14512066e78758d370056541ed29abde/JRC-Acquis.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843/CAPES.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d/EMEA.en-pt.en
input: /home/bruno/Desktop/Locomotive/cache/7976cb079cc3eb3f4bb601122b2511a5/NLLB.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/720f36b53ed6eaca1b7c7ff318ff4ef0/ParaCrawl.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/803d6e9518ca8550b1e9f1be6901f52d/OpenSubtitles.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/e8780165d3f4346430d77b3a6516706e/ELRC-EMEA.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/d527c9c660ca65084ef986260616f531/LinguaTools-WikiTitles.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/52c0f5a30f62a5d3497efa656eacd0d0/XLEnt.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/7f3b57309e253830146f5ab11c02f4d4/EUbookshop.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/cd826c2a5300ea74f33f7d4893646d7e/TildeMODEL.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/f48f89e440549427ea57e582ffa10535/SciELO.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/2f34d28d7a42dd9ff48a65ded8afe0c2/Europarl.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/debbe4100b212af08032982cb5524aa8/Wikipedia.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/14512066e78758d370056541ed29abde/JRC-Acquis.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843/CAPES.en-pt.pt
input: /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d/EMEA.en-pt.pt
input_format:
model_prefix: /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece
model_type: UNIGRAM
vocab_size: 50000
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 1000000
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
pretokenization_delimiter:
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 0
bos_id: 1
eos_id: 2
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
enable_differential_privacy: 0
differential_privacy_noise_level: 0
differential_privacy_clipping_threshold: 0
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/7976cb079cc3eb3f4bb601122b2511a5/NLLB.en-pt.en
trainer_interface.cc(145) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 2000000 lines
...
trainer_interface.cc(145) LOG(INFO) Loaded 639000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/debbe4100b212af08032982cb5524aa8/Wikipedia.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 640000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 641000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/14512066e78758d370056541ed29abde/JRC-Acquis.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 642000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 643000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/369ede1b2b2c69007995e80f2f5d4843/CAPES.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 644000000 lines
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/bruno/Desktop/Locomotive/cache/0d79d68a9ab037aa16888c445723dd3d/EMEA.en-pt.pt
trainer_interface.cc(145) LOG(INFO) Loaded 645000000 lines
trainer_interface.cc(409) LOG(INFO) Sampled 1000000 sentences from 645435140 sentences.
trainer_interface.cc(414) LOG(INFO) Skipped 591 too long sentences.
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=99032336
trainer_interface.cc(548) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=1451
trainer_interface.cc(559) LOG(INFO) Final character coverage=1
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 999999 sentences.
unigram_model_trainer.cc(222) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(226) LOG(INFO) Extracting frequent sub strings... node_num=48319884
unigram_model_trainer.cc(274) LOG(INFO) Initialized 1001451 seed sentencepieces
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 999999
trainer_interface.cc(608) LOG(INFO) Done! 946659
unigram_model_trainer.cc(564) LOG(INFO) Using 946659 sentences for EM training
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=372312 obj=12.3229 num_tokens=2076375 num_tokens/piece=5.57698
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=329770 obj=9.88506 num_tokens=2087466 num_tokens/piece=6.33007
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=247304 obj=9.86642 num_tokens=2169589 num_tokens/piece=8.77296
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=247138 obj=9.8551 num_tokens=2170183 num_tokens/piece=8.78126
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=185352 obj=9.91244 num_tokens=2284923 num_tokens/piece=12.3275
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=185347 obj=9.89977 num_tokens=2284917 num_tokens/piece=12.3278
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=139010 obj=9.97952 num_tokens=2411614 num_tokens/piece=17.3485
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=139010 obj=9.96461 num_tokens=2411552 num_tokens/piece=17.348
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=104257 obj=10.0664 num_tokens=2545622 num_tokens/piece=24.4168
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=104257 obj=10.0481 num_tokens=2545638 num_tokens/piece=24.417
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=78192 obj=10.1762 num_tokens=2683771 num_tokens/piece=34.3228
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=78192 obj=10.153 num_tokens=2683913 num_tokens/piece=34.3246
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=58644 obj=10.3113 num_tokens=2826789 num_tokens/piece=48.2025
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=58644 obj=10.2812 num_tokens=2826943 num_tokens/piece=48.2052
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=55000 obj=10.318 num_tokens=2858431 num_tokens/piece=51.9715
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=55000 obj=10.3112 num_tokens=2858560 num_tokens/piece=51.9738
trainer_interface.cc(686) LOG(INFO) Saving model: /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece.model
trainer_interface.cc(698) LOG(INFO) Saving vocabs: /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece.vocab
Wrote /home/bruno/Desktop/Locomotive/run/en_pt-1.1/config.yml
Converting /home/bruno/Desktop/Locomotive/run/en_pt-1.1/sentencepiece.vocab
Traceback (most recent call last):
File "/home/bruno/Desktop/Locomotive/train.py", line 346, in <module>
sp_vocab_to_onmt_vocab(sp_vocab_file, onmt_vocab_file)
File "/home/bruno/Desktop/Locomotive/onmt_tools.py", line 51, in sp_vocab_to_onmt_vocab
w, c = line.rstrip("\n").split(None, 1)
ValueError: not enough values to unpack (expected 2, got 1)
My config.json looks like this:
{
"from": {
"name": "English",
"code": "en"
},
"to": {
"name": "Portuguese",
"code": "pt"
},
"version": "1.1",
"sources": [
"opus://NLLB",
"opus://ParaCrawl",
"opus://OpenSubtitles",
"opus://ELRC-EMEA",
"opus://LinguaTools-WikiTitles",
"opus://XLEnt",
"opus://EUbookshop",
"opus://TildeMODEL",
"opus://SciELO",
"opus://Europarl",
"opus://Wikipedia",
"opus://JRC-Acquis",
"opus://CAPES",
"opus://EMEA"
]
}
Thanks in advance