Train job hyperparameters

The tables below describe the hyperparameters used in train jobs for text-based large language models (LLMs).

Parameter and Description Usage Notes Runtime Impact Quality Impact

batch_size

Number of data samples processed simultaneously by each worker. The number of RDUs per worker is defined by model_parallel_rdus.

None

Larger batches improve training speed but consume more memory.

Larger batches promotes gradient stability but may reduce generalization.

debug_mode

Enables/disables debug mode, which generates additional logs to diagnose training issues. Keep disabled unless advised by your SambaNova admin.

None

Slows training due to verbose logging.

Aids in troubleshooting, with no direct quality impact.

do_eval

Decides if checkpoints should be evaluated at each evaluation_interval.

Interacts with evaluation_strategy. Requires a validation split in the training dataset.

Increases runtime due to additional evaluation steps.

Ensures proper monitoring of validation performance during training.

dump_inputs

Writes (dumps) inputs to a file for debugging purposes.

Generates several MB of data per step, slowing down training. Use only if requested by SambaNova for debugging.

Increases runtime due to I/O operations.

None

evaluation_interval

Specifies the frequency, in training steps/epochs, at which the model will be evaluated during training. Requires do_eval to be True.

Interacts with do_eval and evaluation_strategy.

More frequent evaluation increases training time.

Provides better visibility into training progress.

evaluation_strategy

Strategy to validate the model during training. Determines how often evaluations are triggered.

Enabled only when do_eval is True. Accepts no, steps, or epochs. Use steps for frequent, fine-grained validation or epochs for evaluation after each pass through the dataset.

Frequent evaluations increase runtime.

Improves oversight during training but may be excessive for stable training runs.

fix_rank_rdu_mapping

Locks rank-to-RDU assignments for data-parallel jobs—use only for reproducibility on 8-RDU multiples.

Interacts with model_parallel_rdus.

Minimal runtime impact.

Ensures consistent parallel resource utilization, improving training stability.

grad_accumulation_steps

Number of steps to accumulate gradients before a weight update.

Interacts with batch_size and learning_rate. Important in calculating global batch size (GBS).

Helps manage memory by splitting large batches. However, slows down training time per step.

Simulates larger batch sizes, improving stability, but at cost of more runtime.

logging_steps

Defines how often (in training steps) logs are recorded during training. These logs typically include metrics like loss, accuracy, and other relevant training statistics. This parameter is useful for tracking and debugging training progress. E.g., setting logging_steps = 10 will log performance metrics every 10 steps.

None

Logging too frequently may slow down training, especially on systems with limited I/O bandwidth.

Logging is only for monitoring purposes. While it doesn’t directly affect model performance, it can help you spot issues like overfitting or divergence early.

lr_schedule

Specifies the type of learning rate scheduler to use, controlling how the learning rate changes during training. Available values are polynomial_decay_schedule_with_warmup, cosine_schedule_with_warmup, and fixed_lr.

Interacts with learning_rate.

Minimal runtime impact.

Can significantly affect final model quality.

learning_rate

Learning rate to use in optimizer.

None

Minimal direct impact on runtime.

Key to convergence—too high causes instability; too low slows learning.

max_seq_length

Maximum sequence length for inputs; inputs are truncated or padded to this value.

None

Quadratic impact on memory and computation time.

Longer sequences capture more context, but increase computational cost.

max_epochs

Number of epochs to run the training job for.
Note: Only available for E5 large model.

None

Directly proportional to training time.

Increasing the number of iterations can improve training, but setting it too high (greater than 1) risks overfitting, as each sample is trained on multiple times.

model_parallel_rdus

Sets the number of RDUs allocated per model instance. Enables model parallelism across multiple RDUs.

Important in calculating global batch size (GBS).

Can speed up training with proper configuration.

No direct impact on quality.

model_parameter_count

Total number of parameters in the model.

None

Larger models require more memory and compute.

More parameters increase expressiveness, with diminishing returns.

num_iterations

Total number of training steps.

User can set either num_iterations or num_train_epochs.

Directly proportional to training time.

More iterations improve training, but risk overfitting.

num_train_epochs

Number of epochs to run.

User can set either num_iterations or num_train_epochs.

Directly proportional to training time.

More epochs improve training, but risk overfitting.

prompt_loss_weight

Float scaling the loss on prompt tokens.

Interacts with use_token_type_ids.

Minimal runtime impact unless used extensively.

Affects how well the model learns from prompts.

run_mode

Defines the calculation precision. High throughput uses only BF16, while balanced uses BF16 with FP32 for attention, logits, and skip-add operations to enhance accuracy.

None.

High throughput offers faster performance, while balanced is slightly slower.

High throughput may result in slightly lower quality and an increased risk of training divergence, whereas a balanced approach tends to offer higher quality and greater stability.

save_optimizer_state

Determines whether to save the optimizer state when saving a checkpoint.

Interacts with save_interval.

Increases runtime slightly due to extra I/O.

Allows resuming training with consistent optimizer settings. Off reduces memory use.

save_interval

Frequency of saving model checkpoints.

Interacts with save_optimizer_state

More frequent saves increase I/O overhead.

No direct impact on quality.

save_strategy

Specifies when model checkpoints are saved during training. This setting controls how often the model’s state is stored, which is critical for recovery, rollback, and analysis.

Combine with save_interval to define exact save frequency when using "steps" or "epoch". Accepted values are no, epoch, and steps. no disables checkpoint saving, epoch saves a checkpoint at the end of each training epoch, and steps saves a checkpoint every save_interval.

Avoid overly frequent checkpoints unless necessary for your use case. Ensure there’s enough storage available if saving frequently.

More frequent checkpointing improves recovery options and allows for easier rollback or performance inspection, but can slow training and consume large amounts of NFS storage.

seed

Random seed used for reproducibility.
Note: Only available for E5 large model.

None

No impact on runtime.

Ensures reproducibility.

subsample_eval

Subsample rate for validation split.

Only used if do_eval is True.

Reduces evaluation time by using fewer samples.

Trade-off between evaluation speed and accuracy.

subsample_eval_seed

Random seed used for subsample evaluation.

Requires both do_eval and subsample_eval to be True.

Faster evaluation due to fewer samples.

Ensures consistency and reproducibility of results.

train_n_passages

Number of passages per sample during training.
Note: Only available for E5 large model.

None

Increases runtime as the number of passages increases.

Improves performance on multi-passage tasks.

use_token_type_ids

Specifies whether to use token type IDs for each token when data has been prepared using the Data Prep repository. This hyperparameter is available for backward compatibility.

Set to true when using the Data Prep repository. This setting is also required to enable prompt loss weighting during training.

Minimal impact on runtime.

None

vocab_size

Maximum size of the model vocabulary.

None

Affects embedding layer size and memory.

Larger vocabulary improves tokenization, but increases model size.

warmup_steps

Number of steps to ramp up learning rate from 0 to max.

Has no effect when lr_schedule is fixed_lr.

Minimal impact on runtime.

Helps stabilize early training.

weight_decay

Weight decay is a regularization technique used to prevent overfitting by penalizing large model weights. This hyperparameter sets the weight decay rate in the optimizer. It helps ensure that, especially when the learning rate is high, the model’s weights do not change too rapidly, leading to more stable and generalizable training.

None.

Minimal runtime impact.

Helps reduce overfitting, improves generalization.

Calculate the number of checkpoints

Checkpoints are saved at regular intervals during training. To find out how many checkpoints will be created, divide the total training steps or epochs by the save_interval.

Run type How to calculate checkpoints

Based on steps

max_steps / save_interval

If max_steps is not available, use num_iterations / save_interval

Based on epochs

num_train_epochs / save_interval

Divide the total amount of work (steps or epochs) by how often you want to save.