Could someone please explain these parameters and how they affect the model and learning?
onmt_config = {
‘save_data’: rel_onmt_dir,
‘src_vocab’: f"{rel_onmt_dir}/openmt.vocab",
‘tgt_vocab’: f"{rel_onmt_dir}/openmt.vocab",
‘src_vocab_size’: config.get(‘vocab_size’, 50000),
‘tgt_vocab_size’: config.get(‘vocab_size’, 50000),
‘share_vocab’: True,
‘data’: corpora,
‘src_subword_type’: ‘sentencepiece’,
‘tgt_subword_type’: ‘sentencepiece’,
‘src_onmttok_kwargs’: {
‘mode’: ‘none’,
‘lang’: config[‘from’][‘code’],
},
‘tgt_onmttok_kwargs’: {
‘mode’: ‘none’,
‘lang’: config[‘to’][‘code’],
},
‘src_subword_model’: f’{rel_run_dir}/sentencepiece.model’,
‘tgt_subword_model’: f’{rel_run_dir}/sentencepiece.model’,
‘src_subword_nbest’: 1,
‘src_subword_alpha’: 0.0,
‘tgt_subword_nbest’: 1,
‘tgt_subword_alpha’: 0.0,
‘src_seq_length’: 150,
‘tgt_seq_length’: 150,
‘skip_empty_level’: ‘silent’,
‘save_model’: f’{rel_onmt_dir}/openmt.model’,
‘save_checkpoint_steps’: 2500,
‘keep_checkpoint’: 10,
‘valid_steps’: 2500,
‘train_steps’: 100000,
‘early_stopping’: 4,
‘bucket_size’: 262144,
‘num_worker’: 2,
‘world_size’: 1,
‘gpu_ranks’: [0],
‘batch_type’: ‘tokens’,
‘queue_size’: 10000,
‘batch_size’: 8192,
‘valid_batch_size’: 2048,
‘max_generator_batches’: 2,
‘accum_count’: 8,
‘accum_steps’: 0,
‘model_dtype’: ‘fp16’,
‘optim’: ‘adam’,
‘learning_rate’: 0.15,
‘warmup_steps’: 16000,
‘decay_method’: ‘rsqrt’,
‘adam_beta2’: 0.998,
‘max_grad_norm’: 0,
‘label_smoothing’: 0.1,
‘param_init’: 0,
‘param_init_glorot’: True,
‘normalization’: ‘tokens’,
‘encoder_type’: ‘transformer’,
‘decoder_type’: ‘transformer’,
‘position_encoding’: True,
# ‘max_relative_positions’: 20,
‘enc_layers’: 6,
‘dec_layers’: 6,
‘heads’: 8,
‘hidden_size’: 512,
‘rnn_size’: 512,
‘word_vec_size’: 512,
‘transformer_ff’: 2048,
‘dropout_steps’: 0,
‘dropout’: 0.1,
‘attention_dropout’: 0.1,
‘share_decoder_embeddings’: True,
‘share_embeddings’: True,
‘valid_metrics’: [‘BLEU’],
This is a copy/paste of the the OpenNMT documentation. I’m copying it here as a reply so it’s searchable on this Discourse instance.
usage: train.py
Configuration
-config, --config
Path of the main YAML config file.
-save_config, --save_config
Path where to save the config.
Data
-data, --data
List of datasets and their specifications. See examples/*.yaml for further details.
-skip_empty_level, --skip_empty_level
Possible choices: silent, warning, error
Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.
Default: “warning”
-transforms, --transforms
Possible choices: bart, terminology, fuzzymatch, filtertoolong, prefix, suffix, insert_mask_before_placeholder, clean, uppercase, switchout, tokendrop, tokenmask, docify, inferfeats, sentencepiece, bpe, onmt_tokenize, inlinetags, normalize
Default transform pipeline to apply to data. Can be specified in each corpus of data to override.
Default: []
-save_data, --save_data
Output base path for objects that will be saved (vocab, transforms, embeddings, …).
-overwrite, --overwrite
Overwrite existing objects if any.
Default: False
-n_sample, --n_sample
Stop after save this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.
Default: 0
-dump_transforms, --dump_transforms
Dump transforms *.transforms.pt to disk. -save_data should be set as saving prefix.
Default: False
Vocab
-src_vocab, --src_vocab
Path to src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.
-tgt_vocab, --tgt_vocab
Path to tgt vocabulary file. Format: one <word> or <word> <count> per line.
-share_vocab, --share_vocab
Share source and target vocabulary.
Default: False
–decoder_start_token, -decoder_start_token
Default decoder start token for most ONMT models it is <s> = BOS it happens that for some Fairseq model it requires </s>
Default: “<s>”
–default_specials, -default_specials
default specials used for Vocab initialization UNK, PAD, BOS, EOS will take IDs 0, 1, 2, 3 typically <unk> <blank> <s> </s>
Default: [‘<unk>’, ‘<blank>’, ‘<s>’, ‘</s>’]
-src_vocab_size, --src_vocab_size
Maximum size of the source vocabulary.
Default: 32768
-tgt_vocab_size, --tgt_vocab_size
Maximum size of the target vocabulary
Default: 32768
-vocab_size_multiple, --vocab_size_multiple
Make the vocabulary size a multiple of this value.
Default: 8
-src_words_min_frequency, --src_words_min_frequency
Discard source words with lower frequency.
Default: 0
-tgt_words_min_frequency, --tgt_words_min_frequency
Discard target words with lower frequency.
Default: 0
Features
-n_src_feats, --n_src_feats
Number of source feats.
Default: 0
-src_feats_defaults, --src_feats_defaults
Default features to apply in source in case there are not annotated
Pruning
–src_seq_length_trunc, -src_seq_length_trunc
Truncate source sequence length.
–tgt_seq_length_trunc, -tgt_seq_length_trunc
Truncate target sequence length.
Embeddings
-both_embeddings, --both_embeddings
Path to the embeddings file to use for both source and target tokens.
-src_embeddings, --src_embeddings
Path to the embeddings file to use for source tokens.
-tgt_embeddings, --tgt_embeddings
Path to the embeddings file to use for target tokens.
-embeddings_type, --embeddings_type
Possible choices: GloVe, word2vec
Type of embeddings file.
Transform/BART
–permute_sent_ratio, -permute_sent_ratio
Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.
Default: 0.0
–rotate_ratio, -rotate_ratio
Rotate this proportion of inputs.
Default: 0.0
–insert_ratio, -insert_ratio
Insert this percentage of additional random tokens.
Default: 0.0
–random_ratio, -random_ratio
Instead of using <mask>, use random token this often.
Default: 0.0
–mask_ratio, -mask_ratio
Fraction of words/subwords that will be masked.
Default: 0.0
–mask_length, -mask_length
Possible choices: subword, word, span-poisson
Length of masking window to apply.
Default: “subword”
–poisson_lambda, -poisson_lambda
Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.
Default: 3.0
–replace_length, -replace_length
Possible choices: -1, 0, 1
When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)
Default: -1
Transform/Terminology
–termbase_path, -termbase_path
Path to a dictionary file with terms.
–src_spacy_language_model, -src_spacy_language_model
Name of the spacy language model for the source corpus.
–tgt_spacy_language_model, -tgt_spacy_language_model
Name of the spacy language model for the target corpus.
–term_corpus_ratio, -term_corpus_ratio
Ratio of corpus to augment with terms.
Default: 0.3
–term_example_ratio, -term_example_ratio
Max terms allowed in an example.
Default: 0.2
–src_term_stoken, -src_term_stoken
The source term start token.
Default: “⦅src_term_start⦆”
–tgt_term_stoken, -tgt_term_stoken
The target term start token.
Default: “⦅tgt_term_start⦆”
–tgt_term_etoken, -tgt_term_etoken
The target term end token.
Default: “⦅tgt_term_end⦆”
–term_source_delimiter, -term_source_delimiter
Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.
Default: “⦅fuzzy⦆”
Transform/FuzzyMatching
–tm_path, -tm_path
Path to a flat text TM.
–fuzzy_corpus_ratio, -fuzzy_corpus_ratio
Ratio of corpus to augment with fuzzy matches.
Default: 0.1
–fuzzy_threshold, -fuzzy_threshold
The fuzzy matching threshold.
Default: 70
–tm_delimiter, -tm_delimiter
The delimiter used in the flat text TM.
Default: “ “
–fuzzy_token, -fuzzy_token
The fuzzy token to be added with the matches.
Default: “⦅fuzzy⦆”
–fuzzymatch_min_length, -fuzzymatch_min_length
Min length for TM entries and examples to match.
Default: 4
–fuzzymatch_max_length, -fuzzymatch_max_length
Max length for TM entries and examples to match.
Default: 70
Transform/Filter
–src_seq_length, -src_seq_length
Maximum source sequence length.
Default: 192
–tgt_seq_length, -tgt_seq_length
Maximum target sequence length.
Default: 192
Transform/Prefix
–src_prefix, -src_prefix
String to prepend to all source example.
Default: “”
–tgt_prefix, -tgt_prefix
String to prepend to all target example.
Default: “”
Transform/Suffix
–src_suffix, -src_suffix
String to append to all source example.
Default: “”
–tgt_suffix, -tgt_suffix
String to append to all target example.
Default: “”
Transform/InsertMaskBeforePlaceholdersTransform
–response_pattern, -response_pattern
Response patten to locate the end of the prompt
Default: “Response : ⦅newline⦆”
Transform/Clean
–src_eq_tgt, -src_eq_tgt
Remove ex src==tgt
Default: False
–same_char, -same_char
Remove ex with same char more than 4 times
Default: False
–same_word, -same_word
Remove ex with same word more than 3 times
Default: False
–scripts_ok, -scripts_ok
list of unicodata scripts accepted
Default: [‘Latin’, ‘Common’]
–scripts_nok, -scripts_nok
list of unicodata scripts not accepted
Default: []
–src_tgt_ratio, -src_tgt_ratio
ratio between src and tgt
Default: 2
–avg_tok_min, -avg_tok_min
average length of tokens min
Default: 3
–avg_tok_max, -avg_tok_max
average length of tokens max
Default: 20
–langid, -langid
list of languages accepted
Default: []
Transform/Uppercase
–upper_corpus_ratio, -upper_corpus_ratio
Corpus ratio to apply uppercasing.
Default: 0.01
Transform/SwitchOut
-switchout_temperature, --switchout_temperature
Sampling temperature for SwitchOut.
in [WPDN18]. Smaller value makes data more diverse.
Default: 1.0
Transform/Token_Drop
-tokendrop_temperature, --tokendrop_temperature
Sampling temperature for token deletion.
Default: 1.0
Transform/Token_Mask
-tokenmask_temperature, --tokenmask_temperature
Sampling temperature for token masking.
Default: 1.0
Transform/Docify
–doc_length, -doc_length
Number of tokens per doc.
Default: 200
–max_context, -max_context
Max context segments.
Default: 1
Transform/InferFeats
–reversible_tokenization, -reversible_tokenization
Possible choices: joiner, spacer
Type of reversible tokenization applied on the tokenizer.
Default: “joiner”
Transform/Subword/Common
Attention
Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.
-src_subword_model, --src_subword_model
Path of subword model for src (or shared).
-tgt_subword_model, --tgt_subword_model
Path of subword model for tgt.
-src_subword_nbest, --src_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)
Default: 1
-tgt_subword_nbest, --tgt_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)
Default: 1
-src_subword_alpha, --src_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)
Default: 0
-tgt_subword_alpha, --tgt_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)
Default: 0
-src_subword_vocab, --src_subword_vocab
Path to the vocabulary file for src subword. Format: <word> <count> per line.
Default: “”
-tgt_subword_vocab, --tgt_subword_vocab
Path to the vocabulary file for tgt subword. Format: <word> <count> per line.
Default: “”
-src_vocab_threshold, --src_vocab_threshold
Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.
Default: 0
-tgt_vocab_threshold, --tgt_vocab_threshold
Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.
Default: 0
Transform/Subword/ONMTTOK
-src_subword_type, --src_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for src (or shared) in pyonmttok.
Default: “none”
-tgt_subword_type, --tgt_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for tgt in pyonmttok.
Default: “none”
-src_onmttok_kwargs, --src_onmttok_kwargs
Other pyonmttok options for src in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
-tgt_onmttok_kwargs, --tgt_onmttok_kwargs
Other pyonmttok options for tgt in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
–gpt2_pretok, -gpt2_pretok
Preprocess sentence with byte-level mapping
Default: False
Transform/InlineTags
–tags_dictionary_path, -tags_dictionary_path
Path to a flat term dictionary.
–tags_corpus_ratio, -tags_corpus_ratio
Ratio of corpus to augment with tags.
Default: 0.1
–max_tags, -max_tags
Maximum number of tags that can be added to a single sentence.
Default: 12
–paired_stag, -paired_stag
The format of an opening paired inline tag. Must include the character #.
Default: “⦅ph_#_beg⦆”
–paired_etag, -paired_etag
The format of a closing paired inline tag. Must include the character #.
Default: “⦅ph_#_end⦆”
–isolated_tag, -isolated_tag
The format of an isolated inline tag. Must include the character #.
Default: “⦅ph_#_std⦆”
–src_delimiter, -src_delimiter
Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.
Default: “⦅fuzzy⦆”
Transform/Normalize
–src_lang, -src_lang
Source language code
Default: “”
–tgt_lang, -tgt_lang
Target language code
Default: “”
–penn, -penn
Penn substitution
Default: True
–norm_quote_commas, -norm_quote_commas
Normalize quotations and commas
Default: True
–norm_numbers, -norm_numbers
Normalize numbers
Default: True
–pre_replace_unicode_punct, -pre_replace_unicode_punct
Replace unicode punct
Default: False
–post_remove_control_chars, -post_remove_control_chars
Remove control chars
Default: False
Distributed
–gpu_ranks, -gpu_ranks
list of ranks of each process.
Default: []
–world_size, -world_size
total number of distributed processes.
Default: 1
–parallel_mode, -parallel_mode
Possible choices: tensor_parallel, data_parallel
Distributed mode.
Default: “data_parallel”
–gpu_backend, -gpu_backend
Type of torch distributed backend
Default: “nccl”
–gpu_verbose_level, -gpu_verbose_level
Gives more info on each process per GPU.
Default: 0
–master_ip, -master_ip
IP of master for torch.distributed training.
Default: “localhost”
–master_port, -master_port
Port of master for torch.distributed training.
Default: 10000
Model-Embeddings
–src_word_vec_size, -src_word_vec_size
Word embedding size for src.
Default: 500
–tgt_word_vec_size, -tgt_word_vec_size
Word embedding size for tgt.
Default: 500
–word_vec_size, -word_vec_size
Word embedding size for src and tgt.
Default: -1
–share_decoder_embeddings, -share_decoder_embeddings
Use a shared weight matrix for the input and output word embeddings in the decoder.
Default: False
–share_embeddings, -share_embeddings
Share the word embeddings between encoder and decoder. Need to use shared dictionary for this option.
Default: False
–position_encoding, -position_encoding
Use a sin to mark relative words positions. Necessary for non-RNN style models.
Default: False
–position_encoding_type, -position_encoding_type
Possible choices: SinusoidalInterleaved, SinusoidalConcat
Type of positional encoding. At the moment: Sinusoidal fixed, Interleaved or Concat
Default: “SinusoidalInterleaved”
-update_vocab, --update_vocab
Update source and target existing vocabularies
Default: False
Model-Embedding Features
–feat_merge, -feat_merge
Possible choices: concat, sum, mlp
Merge action for incorporating features embeddings. Options [concat|sum|mlp].
Default: “concat”
–feat_vec_size, -feat_vec_size
If specified, feature embedding sizes will be set to this. Otherwise, feat_vec_exponent will be used.
Default: -1
–feat_vec_exponent, -feat_vec_exponent
If -feat_merge_size is not set, feature embedding sizes will be set to N^feat_vec_exponent where N is the number of values the feature takes.
Default: 0.7
Model- Task
-model_task, --model_task
Possible choices: seq2seq, lm
Type of task for the model either seq2seq or lm
Default: “seq2seq”
Model- Encoder-Decoder
–model_type, -model_type
Possible choices: text
Type of source model to use. Allows the system to incorporate non-text inputs. Options are [text].
Default: “text”
–model_dtype, -model_dtype
Possible choices: fp32, fp16
Data type of the model.
Default: “fp32”
–encoder_type, -encoder_type
Type of encoder layer to use. Non-RNN layers are experimental. Default options are [rnn|brnn|ggnn|mean|transformer|cnn|transformer_lm].
Default: “rnn”
–decoder_type, -decoder_type
Type of decoder layer to use. Non-RNN layers are experimental. Default options are [rnn|transformer|cnn|transformer].
Default: “rnn”
–freeze_encoder, -freeze_encoder
Freeze parameters in encoder.
Default: False
–freeze_decoder, -freeze_decoder
Freeze parameters in decoder.
Default: False
–layers, -layers
Number of layers in enc/dec.
Default: -1
–enc_layers, -enc_layers
Number of layers in the encoder
Default: 2
–dec_layers, -dec_layers
Number of layers in the decoder
Default: 2
–hidden_size, -hidden_size
Size of rnn hidden states. Overwrites enc_hid_size and dec_hid_size
Default: -1
–enc_hid_size, -enc_hid_size
Size of encoder rnn hidden states.
Default: 500
–dec_hid_size, -dec_hid_size
Size of decoder rnn hidden states.
Default: 500
–cnn_kernel_width, -cnn_kernel_width
Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in conv layer
Default: 3
–layer_norm, -layer_norm
Possible choices: standard, rms
The type of layer normalization in the transformer architecture. Choices are standard or rms. Default to standard
Default: “standard”
–norm_eps, -norm_eps
Layer norm epsilon
Default: 1e-06
–pos_ffn_activation_fn, -pos_ffn_activation_fn
Possible choices: relu, gelu, silu, gated-gelu
The activation function to use in PositionwiseFeedForward layer. Choices are dict_keys([‘relu’, ‘gelu’, ‘silu’, ‘gated-gelu’]). Default to relu.
Default: “relu”
–input_feed, -input_feed
Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.
Default: 1
–bridge, -bridge
Have an additional layer between the last encoder state and the first decoder state
Default: False
–rnn_type, -rnn_type
Possible choices: LSTM, GRU, SRU
The gate type to use in the RNNs
Default: “LSTM”
–context_gate, -context_gate
Possible choices: source, target, both
Type of context gate to use. Do not select for no context gate.
–bridge_extra_node, -bridge_extra_node
Graph encoder bridges only extra node to decoder as input
Default: True
–bidir_edges, -bidir_edges
Graph encoder autogenerates bidirectional edges
Default: True
–state_dim, -state_dim
Number of state dimensions in the graph encoder
Default: 512
–n_edge_types, -n_edge_types
Number of edge types in the graph encoder
Default: 2
–n_node, -n_node
Number of nodes in the graph encoder
Default: 2
–n_steps, -n_steps
Number of steps to advance graph encoder
Default: 2
–src_ggnn_size, -src_ggnn_size
Vocab size plus feature space for embedding input
Default: 0
Model- Attention
–global_attention, -global_attention
Possible choices: dot, general, mlp, none
The attention type to use: dotprod or general (Luong) or MLP (Bahdanau)
Default: “general”
–global_attention_function, -global_attention_function
Possible choices: softmax, sparsemax
Default: “softmax”
–self_attn_type, -self_attn_type
Self attention type in Transformer decoder layer – currently “scaled-dot” or “average”
Default: “scaled-dot”
–max_relative_positions, -max_relative_positions
This setting enable relative position encodingWe support two types of encodings:set this -1 to enable Rotary Embeddingsmore info: https://arxiv.org/abs/2104.09864set this to > 0 (ex: 16, 32) to useMaximum distance between inputs in relative positions representations. more info: https://arxiv.org/pdf/1803.02155.pdf
Default: 0
–relative_positions_buckets, -relative_positions_buckets
This setting enable relative position biasmore info: https://github.com/google-research/text-to-text-transfer-transformer
Default: 0
–heads, -heads
Number of heads for transformer self-attention
Default: 8
–sliding_window, -sliding_window
sliding window for transformer self-attention
Default: 0
–transformer_ff, -transformer_ff
Size of hidden transformer feed-forward
Default: 2048
–aan_useffn, -aan_useffn
Turn on the FFN layer in the AAN decoder
Default: False
–add_qkvbias, -add_qkvbias
Add bias to nn.linear of Query/Key/Value in MHANote: this will add bias to output proj layer too
Default: False
–multiquery, -multiquery
Use MultiQuery attentionNote: https://arxiv.org/pdf/1911.02150.pdf
Default: False
–num_kv, -num_kv
Number of heads for KV in the variant of MultiQuery attention (egs: Falcon 40B)
Default: 0
–add_ffnbias, -add_ffnbias
Add bias to nn.linear of Position_wise FFN
Default: False
–parallel_residual, -parallel_residual
Use Parallel residual in Decoder LayerNote: this is used by GPT-J / Falcon Architecture
Default: False
–shared_layer_norm, -shared_layer_norm
Use a shared layer_norm in parallel residual attentionNote: must be true for Falcon 7B / false for Falcon 40B
Default: False
Model - Alignement
–lambda_align, -lambda_align
Lambda value for alignement loss of Garg et al (2019)For more detailed information, see: https://arxiv.org/abs/1909.02074
Default: 0.0
–alignment_layer, -alignment_layer
Layer number which has to be supervised.
Default: -3
–alignment_heads, -alignment_heads
of cross attention heads per layer to supervised with
Default: 0
–full_context_alignment, -full_context_alignment
Whether alignment is conditioned on full target context.
Default: False
Generator
–copy_attn, -copy_attn
Train copy attention layer.
Default: False
–copy_attn_type, -copy_attn_type
Possible choices: dot, general, mlp, none
The copy attention type to use. Leave as None to use the same as -global_attention.
–generator_function, -generator_function
Possible choices: softmax, sparsemax
Which function to use for generating probabilities over the target vocabulary (choices: softmax, sparsemax)
Default: “softmax”
–copy_attn_force, -copy_attn_force
When available, train to copy.
Default: False
–reuse_copy_attn, -reuse_copy_attn
Reuse standard attention for copy
Default: False
–copy_loss_by_seqlength, -copy_loss_by_seqlength
Divide copy loss by length of sequence
Default: False
–coverage_attn, -coverage_attn
Train a coverage attention layer.
Default: False
–lambda_coverage, -lambda_coverage
Lambda value for coverage loss of See et al (2017)
Default: 0.0
–lm_prior_model, -lm_prior_model
LM model to used to train the TM
–lm_prior_lambda, -lambda_prior_lambda
LM Prior Lambda
Default: 0.0
–lm_prior_tau, -lambda_prior_tau
LM Prior Tau
Default: 1.0
–loss_scale, -loss_scale
For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.
Default: 0
–apex_opt_level, -apex_opt_level
Possible choices: , O0, O1, O2, O3
For FP16 training, the opt_level to use.See https://nvidia.github.io/apex/amp.html#opt-levels.
Default: “”
–zero_out_prompt_loss, -zero_out_prompt_loss
Set the prompt loss to zero.Mostly for LLM finetuning.Will be enabled only if the insert_mask_before_placeholder transform is applied
Default: False
–use_ckpting, -use_ckpting
Possible choices: ffn, mha, lora
use gradient checkpointing those modules
Default: []
General
–data_type, -data_type
Type of the source input. Options are [text].
Default: “text”
–save_model, -save_model
Model filename (the model will be saved as <save_model>_N.pt where N is the number of steps
Default: “model”
–save_format, -save_format
Possible choices: pytorch, safetensors
Format to save the model weights
Default: “pytorch”
–save_checkpoint_steps, -save_checkpoint_steps
Save a checkpoint every X steps
Default: 5000
–keep_checkpoint, -keep_checkpoint
Keep X checkpoints (negative: keep all)
Default: -1
–lora_layers, -lora_layers
list of layers to be replaced by LoRa layers. ex: [‘linear_values’, ‘linear_query’] cf paper §4.2 https://arxiv.org/abs/2106.09685
Default: []
–lora_embedding, -lora_embedding
replace embeddings with LoRa Embeddings see §5.1
Default: False
–lora_rank, -lora_rank
r=2 successfully tested with NLLB-200 3.3B
Default: 2
–lora_alpha, -lora_alpha
§4.1 https://arxiv.org/abs/2106.09685
Default: 1
–lora_dropout, -lora_dropout
rule of thumb: same value as in main model
Default: 0.0
Reproducibility
–seed, -seed
Set random seed used for better reproducibility between experiments.
Default: -1
Initialization
–param_init, -param_init
Parameters are initialized over uniform distribution with support (-param_init, param_init). Use 0 to not use initialization
Default: 0.1
–param_init_glorot, -param_init_glorot
Init parameters with xavier_uniform. Required for transformer.
Default: False
–train_from, -train_from
If training from a checkpoint then this is the path to the pretrained model’s state_dict.
Default: “”
–reset_optim, -reset_optim
Possible choices: none, all, states, keep_states
Optimization resetter when train_from.
Default: “none”
–pre_word_vecs_enc, -pre_word_vecs_enc
If a valid path is specified, then this will load pretrained word embeddings on the encoder side. See README for specific formatting instructions.
–pre_word_vecs_dec, -pre_word_vecs_dec
If a valid path is specified, then this will load pretrained word embeddings on the decoder side. See README for specific formatting instructions.
–freeze_word_vecs_enc, -freeze_word_vecs_enc
Freeze word embeddings on the encoder side.
Default: False
–freeze_word_vecs_dec, -freeze_word_vecs_dec
Freeze word embeddings on the decoder side.
Default: False
Optimization- Type
–num_workers, -num_workers
pytorch DataLoader num_workers
Default: 2
–batch_size, -batch_size
Maximum batch size for training
Default: 64
–batch_size_multiple, -batch_size_multiple
Batch size multiple for token batches.
Default: 1
–batch_type, -batch_type
Possible choices: sents, tokens
Batch grouping for batch_size. Standard is sents. Tokens will do dynamic batching
Default: “sents”
–normalization, -normalization
Possible choices: sents, tokens
Normalization method of the gradient.
Default: “sents”
–accum_count, -accum_count
Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for Transformer.
Default: [1]
–accum_steps, -accum_steps
Steps at which accum_count values change
Default: [0]
–valid_steps, -valid_steps
Perfom validation every X steps
Default: 10000
–valid_batch_size, -valid_batch_size
Maximum batch size for validation
Default: 32
–train_steps, -train_steps
Number of training steps
Default: 100000
–single_pass, -single_pass
Make a single pass over the training dataset.
Default: False
–early_stopping, -early_stopping
Number of validation steps without improving.
Default: 0
–early_stopping_criteria, -early_stopping_criteria
Criteria to use for early stopping.
–optim, -optim
Possible choices: sgd, adagrad, adadelta, adam, sparseadam, adafactor, fusedadam, adamw8bit, pagedadamw8bit, pagedadamw32bit
Optimization method.
Default: “sgd”
–adagrad_accumulator_init, -adagrad_accumulator_init
Initializes the accumulator values in adagrad. Mirrors the initial_accumulator_value option in the tensorflow adagrad (use 0.1 for their default).
Default: 0
–max_grad_norm, -max_grad_norm
If the norm of the gradient vector exceeds this, renormalize it to have the norm equal to max_grad_norm
Default: 5
–dropout, -dropout
Dropout probability; applied in LSTM stacks.
Default: [0.3]
–attention_dropout, -attention_dropout
Attention Dropout probability.
Default: [0.1]
–dropout_steps, -dropout_steps
Steps at which dropout changes.
Default: [0]
–truncated_decoder, -truncated_decoder
Truncated bptt.
Default: 0
–adam_beta1, -adam_beta1
The beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.
Default: 0.9
–adam_beta2, -adam_beta2
The beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow and Keras, i.e. see: https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer or https://keras.io/optimizers/ . Whereas recently the paper “Attention is All You Need” suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.
Default: 0.999
–label_smoothing, -label_smoothing
Label smoothing value epsilon. Probabilities of all non-true labels will be smoothed by epsilon / (vocab_size - 1). Set to zero to turn off label smoothing. For more detailed information, see: https://arxiv.org/abs/1512.00567
Default: 0.0
–average_decay, -average_decay
Moving average decay. Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation: http://www.aclweb.org/anthology/P18-4020 For more detail on Exponential Moving Average: https://en.wikipedia.org/wiki/Moving_average
Default: 0
–average_every, -average_every
Step for moving average. Default is every update, if -average_decay is set.
Default: 1
Optimization- Rate
–learning_rate, -learning_rate
Starting learning rate. Recommended settings: sgd = 1, adagrad = 0.1, adadelta = 1, adam = 0.001
Default: 1.0
–learning_rate_decay, -learning_rate_decay
If update_learning_rate, decay learning rate by this much if steps have gone past start_decay_steps
Default: 0.5
–start_decay_steps, -start_decay_steps
Start decaying every decay_steps after start_decay_steps
Default: 50000
–decay_steps, -decay_steps
Decay every decay_steps
Default: 10000
–decay_method, -decay_method
Possible choices: noam, noamwd, rsqrt, none
Use a custom decay rate.
Default: “none”
–warmup_steps, -warmup_steps
Number of warmup steps for custom decay.
Default: 4000
Logging
–log_file, -log_file
Output logs to a file under this path.
Default: “”
–log_file_level, -log_file_level
Possible choices: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET, 50, 40, 30, 20, 10, 0
Default: “0”
–verbose, -verbose
Print data loading and statistics for all process(default only log the first process shard)
Default: False
–valid_metrics, -valid_metrics
List of names of additional validation metrics
Default: []
–scoring_debug, -scoring_debug
Dump the src/ref/pred of the current batch
Default: False
–dump_preds, -dump_preds
Folder to dump predictions to.
–report_every, -report_every
Print stats at this interval.
Default: 50
–exp_host, -exp_host
Send logs to this crayon server.
Default: “”
–exp, -exp
Name of the experiment for logging.
Default: “”
–tensorboard, -tensorboard
Use tensorboard for visualization during training. Must have the library tensorboard >= 1.14.
Default: False
–tensorboard_log_dir, -tensorboard_log_dir
Log directory for Tensorboard. This is also the name of the run.
Default: “runs/onmt”
–override_opts, -override-opts
Allow to override some checkpoint opts
Default: False
Dynamic data
-bucket_size, --bucket_size
A bucket is a buffer of bucket_size examples to pick
from the various Corpora. The dynamic iterator batches batch_size batchs from the bucket and shuffle them.
Default: 262144
-bucket_size_init, --bucket_size_init
The bucket is initalized with this awith this
amount of examples (optional)
Default: -1
-bucket_size_increment, --bucket_size_increment
The bucket size is incremented with this
amount of examples (optional)
Default: 0
-prefetch_factor, --prefetch_factor
number of mini-batches loaded in advance to avoid the
GPU waiting during the refilling of the bucket.
Default: 200
Quant options
–quant_layers, -quant_layers
list of layers to be compressed in 4/8bit.
Default: []
–quant_type, -quant_type
Possible choices: bnb_8bit, bnb_FP4, bnb_NF4
Type of compression.
Default: “bnb_8bit”