Train job hyperparameters
The tables below describe the hyperparameters used in train jobs for text-based large language models (LLMs).
Parameter and Description | Usage Notes | Runtime Impact | Quality Impact |
---|---|---|---|
Number of data samples processed simultaneously by each worker. The number of RDUs per worker is defined by |
None |
Larger batches improve training speed but consume more memory. |
Larger batches promotes gradient stability but may reduce generalization. |
Enables/disables debug mode, which generates additional logs to diagnose training issues. Keep disabled unless advised by your SambaNova admin. |
None |
Slows training due to verbose logging. |
Aids in troubleshooting, with no direct quality impact. |
Decides if checkpoints should be evaluated at each |
Interacts with |
Increases runtime due to additional evaluation steps. |
Ensures proper monitoring of validation performance during training. |
Writes (dumps) inputs to a file for debugging purposes. |
Generates several MB of data per step, slowing down training. Use only if requested by SambaNova for debugging. |
Increases runtime due to I/O operations. |
None |
Specifies the frequency, in training steps/epochs, at which the model will be evaluated during training. Requires |
Interacts with |
More frequent evaluation increases training time. |
Provides better visibility into training progress. |
Strategy to validate the model during training. Determines how often evaluations are triggered. |
Enabled only when |
Frequent evaluations increase runtime. |
Improves oversight during training but may be excessive for stable training runs. |
Locks rank-to-RDU assignments for data-parallel jobs—use only for reproducibility on 8-RDU multiples. |
Interacts with |
Minimal runtime impact. |
Ensures consistent parallel resource utilization, improving training stability. |
Number of steps to accumulate gradients before a weight update. |
Interacts with |
Helps manage memory by splitting large batches. However, slows down training time per step. |
Simulates larger batch sizes, improving stability, but at cost of more runtime. |
Defines how often (in training steps) logs are recorded during training. These logs typically include metrics like loss, accuracy, and other relevant training statistics. This parameter is useful for tracking and debugging training progress. E.g., setting logging_steps = 10 will log performance metrics every 10 steps. |
None |
Logging too frequently may slow down training, especially on systems with limited I/O bandwidth. |
Logging is only for monitoring purposes. While it doesn’t directly affect model performance, it can help you spot issues like overfitting or divergence early. |
Specifies the type of learning rate scheduler to use, controlling how the learning rate changes during training. Available values are polynomial_decay_schedule_with_warmup, cosine_schedule_with_warmup, and fixed_lr. |
Interacts with |
Minimal runtime impact. |
Can significantly affect final model quality. |
Learning rate to use in optimizer. |
None |
Minimal direct impact on runtime. |
Key to convergence—too high causes instability; too low slows learning. |
Maximum sequence length for inputs; inputs are truncated or padded to this value. |
None |
Quadratic impact on memory and computation time. |
Longer sequences capture more context, but increase computational cost. |
Number of epochs to run the training job for. |
None |
Directly proportional to training time. |
Increasing the number of iterations can improve training, but setting it too high (greater than 1) risks overfitting, as each sample is trained on multiple times. |
Sets the number of RDUs allocated per model instance. Enables model parallelism across multiple RDUs. |
Important in calculating global batch size (GBS). |
Can speed up training with proper configuration. |
No direct impact on quality. |
Total number of parameters in the model. |
None |
Larger models require more memory and compute. |
More parameters increase expressiveness, with diminishing returns. |
Total number of training steps. |
User can set either |
Directly proportional to training time. |
More iterations improve training, but risk overfitting. |
Number of epochs to run. |
User can set either |
Directly proportional to training time. |
More epochs improve training, but risk overfitting. |
Float scaling the loss on prompt tokens. |
Interacts with |
Minimal runtime impact unless used extensively. |
Affects how well the model learns from prompts. |
Defines the calculation precision. High throughput uses only BF16, while balanced uses BF16 with FP32 for attention, logits, and skip-add operations to enhance accuracy. |
None. |
High throughput offers faster performance, while balanced is slightly slower. |
High throughput may result in slightly lower quality and an increased risk of training divergence, whereas a balanced approach tends to offer higher quality and greater stability. |
Determines whether to save the optimizer state when saving a checkpoint. |
Interacts with |
Increases runtime slightly due to extra I/O. |
Allows resuming training with consistent optimizer settings. Off reduces memory use. |
Frequency of saving model checkpoints. |
Interacts with |
More frequent saves increase I/O overhead. |
No direct impact on quality. |
Specifies when model checkpoints are saved during training. This setting controls how often the model’s state is stored, which is critical for recovery, rollback, and analysis. |
Combine with |
Avoid overly frequent checkpoints unless necessary for your use case. Ensure there’s enough storage available if saving frequently. |
More frequent checkpointing improves recovery options and allows for easier rollback or performance inspection, but can slow training and consume large amounts of NFS storage. |
Random seed used for reproducibility. |
None |
No impact on runtime. |
Ensures reproducibility. |
Subsample rate for validation split. |
Only used if |
Reduces evaluation time by using fewer samples. |
Trade-off between evaluation speed and accuracy. |
Random seed used for subsample evaluation. |
Requires both |
Faster evaluation due to fewer samples. |
Ensures consistency and reproducibility of results. |
Number of passages per sample during training. |
None |
Increases runtime as the number of passages increases. |
Improves performance on multi-passage tasks. |
Specifies whether to use token type IDs for each token when data has been prepared using the Data Prep repository. This hyperparameter is available for backward compatibility. |
Set to true when using the Data Prep repository. This setting is also required to enable prompt loss weighting during training. |
Minimal impact on runtime. |
None |
Maximum size of the model vocabulary. |
None |
Affects embedding layer size and memory. |
Larger vocabulary improves tokenization, but increases model size. |
Number of steps to ramp up learning rate from 0 to max. |
Has no effect when |
Minimal impact on runtime. |
Helps stabilize early training. |
Weight decay is a regularization technique used to prevent overfitting by penalizing large model weights. This hyperparameter sets the weight decay rate in the optimizer. It helps ensure that, especially when the learning rate is high, the model’s weights do not change too rapidly, leading to more stable and generalizable training. |
None. |
Minimal runtime impact. |
Helps reduce overfitting, improves generalization. |
Calculate the number of checkpoints
Checkpoints are saved at regular intervals during training. To find out how many checkpoints will be created, divide the total training steps or epochs by the save_interval
.
Run type | How to calculate checkpoints |
---|---|
Based on steps |
If |
Based on epochs |
|
Divide the total amount of work (steps or epochs) by how often you want to save. |