-
Notifications
You must be signed in to change notification settings - Fork 106
feat: Self-Rewarding Algorithm with TRT Support #321
base: main
Are you sure you want to change the base?
Changes from 250 commits
92c19f6
7088f54
472a56c
032bf35
d5f55f5
c7cdca1
04d02c8
56ccacf
d23865f
3c21c81
2c99dcb
9e8526d
c1daeb9
e16c357
410eaf5
e405432
07cfa67
b2dfee0
63cd6b3
6901348
74a0bb1
b6a05fd
db2701b
8dd5c59
b5d6f88
5464827
00e4298
fe6864b
80579ec
f81f55a
4a034f4
66b5a54
7841381
1779c51
e357ef9
6b606e8
a669837
ada5f45
393acc6
b6a4d59
977e6e7
621718d
6109b8b
866c22b
02aa2b8
e6f27c5
c689d2a
cd4aaa5
993e358
28fcaf3
ef347e5
752d0bd
78e6536
4de3eeb
8a39881
666e969
3bec1bc
3e7ca5f
12a0aae
3956b6d
e090663
af83947
cc03b76
605bda1
b3dedfd
9fb90ff
bf62bcc
aea50ad
0a9416e
41ffeb5
e3d85bf
1225041
5315fc5
9c6141b
2cef93e
f5bd8c5
cbd7095
c72e4c1
220cf17
1ed3c46
aa4249d
22615b7
95a9507
0fdb124
031287c
fa37930
c83a98d
a9f722a
15460d2
f6c09ea
1d85152
b3078c5
1eb0823
6203e90
e51c45f
1b220f1
0815b56
8d75cbf
ab4c549
82fb3a1
12f85d2
f90f4d6
527557d
fe9d288
f3124d3
efadcae
35a2895
4487932
7e7f27b
e9c7b39
c6f6da4
ebb69f4
2775e81
fe0399f
ffa253f
7ca9e34
8181168
4d0853d
ec548b8
ce7a07f
bb2fc48
606f690
f48dc29
708bc24
48ad685
d053475
0b4a92d
b72a5ec
984acaa
c11e1d7
14a9926
7c2fc3e
fe02867
02ad2fa
24c53be
8a25e5e
24f138a
9c72c53
5ed9cd8
8b6627a
56032c8
1e17f8b
e0a94d0
280ad36
3af712d
835b3b3
2b95331
261269a
83ba660
f3912e7
09d2783
5105ed9
a2bf8a0
d9d45d6
f00d09e
830e599
09c357c
c41cc08
ab8c97b
92fec51
7f7d4f9
395cf04
3a8da14
4174248
36d8ab4
82e8793
7506be0
c8b88c3
1d1f051
aae1cd3
b4afcf6
dab8c61
5651e20
ff3bc1c
aadc662
314c217
2fdb5b0
31f4ba3
7dd511c
11d22e8
de8dda8
4f81524
723f55f
acd4d07
a249a44
ce687ed
005702b
a35e359
fc1def7
eab62af
8805db5
fb61b86
a7817b3
71029d9
9051695
35f8ee4
ab2d3ce
294afcd
c465f30
2092084
99c7bcd
4e435f7
3d429fb
b460a14
780e8ab
cc487fb
83e830a
5b7aae3
a1f9620
01aced0
224de3d
82ff16d
c608520
e4d36b6
34e4994
4314347
92c6ee3
260eb90
f93acab
e4b1712
e18d2fc
d646801
de7b8aa
449bee7
0e35eb4
7178c88
b93549f
0b04388
0b06d95
a5c9614
cf58ead
bbaf99f
139d739
1794fc1
5e163ef
deb3604
58a9c4c
d895625
f4179c5
8239056
8925c02
6365f73
8c9d4f7
cfe86fa
18f17f5
d3e55ee
bfc81c6
8fa17bb
24e857a
de145a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -53,6 +53,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) | |
| ### New Features and Optimizations | ||
| - Implement Kahneman-Tversky Optimization (KTO). | ||
| - Sequence packing is now supported when running SFT with SFTChatDataset. | ||
| - Implemented the self rewarding and meta self rewarding algorithms. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix punctuation. Implemented the self-rewarding and meta self-rewarding algorithms.
trias702 marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Breaking Changes | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,214 @@ | ||
| .. include:: /content/nemo.rsts | ||
|
|
||
| Model Alignment by Self-Rewarding Language Models | ||
| @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | ||
|
|
||
| Original paper: https://arxiv.org/abs/2401.10020 | ||
| Meta Self-Rewarding paper: https://arxiv.org/abs/2407.19594 | ||
|
|
||
| The NeMo framework supports efficient model alignment via the NeMo Aligner codebase. | ||
|
|
||
| All algorithms in NeMo Aligner will work with any GPT based model that is from mcore(i.e in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Self-Rewarding pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__. This same tutorial also works for other GPT models(such as LLaMa2) of any size. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reorder text (move references to end of section), spell out Megatron Core, fix punctuation, fix capitalization, revise links. Suggested revision: The NeMo Framework supports efficient model alignment using the NeMo Aligner codebase. All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with For more information, see the original
trias702 marked this conversation as resolved.
Outdated
|
||
|
|
||
| Obtaining a pretrained model | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change title to an imperative verb and fix capitalization. Obtain a Pretrained Model |
||
| ############################ | ||
| To start, we must first get a pretrained model to align. There are 2 models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes we will use the smaller 2B model. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Spell out number "two" and fix punctuation. To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. |
||
|
|
||
| .. tab-set:: | ||
|
|
||
| .. tab-item:: 2B GPT | ||
| :sync: key1 | ||
|
|
||
| #. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo`` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a period and end of sentence. |
||
| #. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint`` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a period at end of sentence: #. Extract the NeMo File to a folder with |
||
| #. And then run the script to convert from old NeMo checkpoint to Megatron-Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/86b198ff93438d454f9c7f3550bcfb7d4e59feab/scripts/nlp_language_modeling/convert_nemo_gpt_to_mcore.py>`__. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revise sentence (remove "And then", add definitive article "the", use proper name for Megatron Core (no hyphen). #. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located |
||
| .. code-block:: bash | ||
|
|
||
| python convert_nemo_gpt_to_mcore.py \ | ||
| --in-folder ./model_checkpoint \ | ||
| --out-file ./mcore_gpt.nemo | ||
|
|
||
| .. tab-item:: LLaMa2 7B | ||
| :sync: key2 | ||
|
|
||
| #. Download the `Llama 2 7B LLM model and tokenizer <https://huggingface.co/meta-llama/Llama-2-7b>`__ into the models folder. | ||
| #. Convert the LLaMa2 LLM into ``.nemo`` format | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a period at end of the sentence. #. Convert the LLaMa2 LLM into |
||
| .. code-block:: bash | ||
|
|
||
| python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \ | ||
| --input_name_or_path /path/to/llama --output_path /output_path/mcore_gpt.nemo | ||
|
|
||
| After these steps you should have a file ``mcore_gpt.nemo`` to use in NeMo-Aligner. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix punctuation. After these steps, you should have a file |
||
|
|
||
| .. note:: | ||
| Mcore models use TransformerEngine as a backend, and it tries to find efficient kernels. But depending on the GPU you have it may not find them. If you ever face errors that relate to kernel finding set these variables on top of your script. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use Megatron Core name and revise sentence. Megatron Core models use TransformerEngine as a backend, which attempts to find efficient kernels. However, depending on your GPU, it may not always succeed. If you encounter errors related to kernel finding, set these variables at the top of your script. |
||
|
|
||
| .. code-block:: bash | ||
|
|
||
| export NVTE_MASKED_SOFTMAX_FUSION=0 | ||
| export NVTE_FLASH_ATTN=0 | ||
| export NVTE_FUSED_ATTN=0 | ||
|
|
||
| Additionally, TransformerEngine is non-deterministic by default, meaning subsequent runs of SPIN using identical parameters will produce different results, which is not ideal for parameter perturbation. | ||
| Helpfully, TransformerEngine exposes a flag to set if you want to guarantee deterministic training runs: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| export NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 | ||
| export NVTE_MASKED_SOFTMAX_FUSION=0 | ||
|
|
||
| SFT vs Foundational (base) model for Self-Rewarding Training | ||
| ############################################################ | ||
| Self-Rewarding can be run on either base/foundational models, that is, models which have only been trained on autoregressive language prediction tasks and not on instruction following tasks, | ||
| or, you can also run Self-Rewarding on models which have been SFTed on instruction-based datasets as well, similar to DPO/PPO. Either type of model will work well with Self-Rewarding. If you would like to start with a supervised fine tuned model instead of a base model, please see our full guide on how to perform SFT on a Megatron GPT model :ref:`SFT guide <sft>`. | ||
|
Comment on lines
+62
to
+
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise sentences and fix capitalization. I think "self-rewarding" when used in body text (not a title) should be lower case as it is not a proper noun. I have made this same correction throughout the edits. Please verify that edits are correct. Self-rewarding training can be run on either base/foundational models, which have only been trained on autoregressive language prediction tasks and not on instruction-following tasks, or on models that have been Supervised Fine-Tuned on instruction-based datasets, similar to DPO/PPO. Both types of models work well with self-rewarding training. If you prefer to start with a SFT model instead of a base model, please see our full guide on how to perform SFT on a Megatron GPT model :ref: |
||
|
|
||
| Self-Rewarding Model Training | ||
| ############################# | ||
|
|
||
| Self-Rewarding training uses the exact same dataset formatting and files as the NeMo-Aligner SFT trainer. Please see the data formatting section of SFT to understand the data format necessary for SPIN :ref:`SFT guide <sft>` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization and punctuation. Self-rewarding training uses the exact same dataset formatting and files as the NeMo Aligner SFT trainer. Please see the data formatting section of SFT to understand the data format necessary for SPIN :ref: |
||
|
|
||
| Once your data is processed into the correct format you are ready to begin Self-Rewarding training. You must start with a pretrained or SFT trained model. For this section we will use the SFT model trained in the previous step to train the Self-Rewarding model. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization and punctuation. Once your data is processed into the correct format, you are ready to begin self-rewarding training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the self-rewarding model. |
||
| For the purposes of the following sections, we'll assume your training jsonl file is located in ``/path/to/train_sft_format.jsonl`` and your validation jsonl file is located in ``/path/to/valid_sft_format.jsonl``. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization. For the purposes of the following sections, we'll assume your training JSONL file is located in |
||
|
|
||
| Due to some limitations of the Nemo Aligner system and reusing code files, the parameters for Self-Rewarding share the same parameter namespace as SPIN, so these parameters are labelled as ``spin``, but they apply to the self-rewarding algorithm. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization, fix spelling, revise sentence. Due to some limitations of the NeMo Aligner system and reusing code files, the parameters for self-rewarding training share the same parameter namespace as SPIN, so these parameters are labeled as |
||
|
|
||
| For the parameters below, the ``model.spin.ref_policy_kl_penalty`` corresponds to the beta parameter in the Self-Rewarding paper, and ``trainer.self_rewarding.max_iterations`` corresponds to number of iterations. | ||
|
|
||
| Self-Rewarding is a very generation-heavy algorithm, with N*k generations per sample in the training data. As such, it is highly advisable to enable TRTLLM in order to vastly speedup training generation times (5-7X speedup). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization and revise sentence. Suggested revision. Self-rewarding training is a very generation-heavy algorithm, with N*k generations per sample in the training data. Therefore, it is highly advisable to enable TRTLLM to significantly speed up training generation times (5-7X speedup). |
||
| You can enable TRT by setting ``trainer.self_rewarding.trt_llm.enable=true`` along with ``trainer.self_rewarding.trt_llm.model_type`` (set this to ``gptnext`` for Nemotron models, and ``llama`` for the llama family of models). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix punctuation and revise sentence. Suggested revision. You can enable TRT by setting |
||
| If you want to train using Meta-Self-Rewarding instead of the original Self-Rewarding, you need to set ``model.spin.use_meta_judge=true``. When using meta mode, you need to also set ``model.spin.meta_judge_pcnt`` which controls the maximum percent of any GBS which can be populated by meta-judge training samples. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization, revise sentences, use "that" restrictive clause instead of "which". Suggested revision. If you want to train using the Meta-Self-Rewarding model, instead of the original Self-Rewarding model, you need to set
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not correct because Meta-Self-Rewarding isn't a model, it's an algorithm |
||
| If you want to use Length Control (Meta-Self-Rewarding paper, section 2.1, last paragraph), you can set that with ``model.spin.length_control``. This parameter accepts either a scalar, or a list of size == number of iterations, where | ||
| each iteration will apply its corresponding length control value. This allows you to create a schedule of different length control values for each iteration. This logic will work for both Self-Rewarding and Meta Self-Rewarding. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix punctuation and revise sentence. Suggested revision. If you want to use Length Control (Meta-Self-Rewarding paper, section 2.1, last paragraph), you can set that with |
||
| You can also control which variant of DPO loss is used for training using the ``model.spin.preference_loss`` parameter. Valid entries are: dpo, scale, rpo_bwd_kl, rpo_fwd_kl, ipo, and rpo_sq. Default is dpo. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please check use of "dpo" instead of "DPO" in last sentence. Suggested revision. The default is DPO. |
||
|
|
||
|
|
||
| .. tab-set:: | ||
|
|
||
| .. tab-item:: Terminal | ||
| :sync: key3 | ||
|
|
||
| To run Self-Rewarding model training on the terminal directly | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix punctuation. To run Self-Rewarding model training on the terminal directly: |
||
|
|
||
| .. code-block:: bash | ||
|
|
||
| export GPFS="/path/to/nemo-aligner-repo" | ||
| export TRAIN_DATA_PATH="/path/to/train_sft_format.jsonl" | ||
| export VALID_DATA_PATH="/path/to/valid_sft_format.jsonl" | ||
|
|
||
| python -u ${GPFS}/examples/nlp/gpt/train_gpt_self_rewarding.py \ | ||
| trainer.num_nodes=1 \ | ||
| trainer.devices=8 \ | ||
| model.micro_batch_size=1 \ | ||
| model.global_batch_size=64 \ | ||
| pretrained_checkpoint.restore_from_path=/path/to/megatron_gpt_sft.nemo \ | ||
| "model.data.train_ds.file_path=${TRAIN_DATA_PATH}" \ | ||
| "model.data.validation_ds.file_path=${VALID_DATA_PATH}" \ | ||
| exp_manager.create_wandb_logger=false \ | ||
| exp_manager.wandb_logger_kwargs.project=spin_training \ | ||
| exp_manager.wandb_logger_kwargs.name=spin_training \ | ||
| exp_manager.explicit_log_dir=/results \ | ||
| ++model.sequence_parallel=false \ | ||
| ++model.apply_rope_fusion=false \ | ||
| trainer.self_rewarding.max_iterations=3 \ | ||
| trainer.self_rewarding.max_epochs=1 \ | ||
| model.spin.ref_policy_kl_penalty=0.1 \ | ||
| model.spin.use_meta_judge=false \ | ||
| model.spin.length_params.max_length=2048 \ | ||
| model.data.train_ds.max_seq_length=4096 | ||
|
|
||
| .. tab-item:: Slurm | ||
| :sync: key4 | ||
|
|
||
| To run SPIN model training using Slurm. The script below uses 4 nodes, but you can change the node count to something different. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggested revision. To run SPIN model training with Slurm, use the script below. The script uses 4 nodes, but you can change the node count to something different. |
||
|
|
||
| .. code-block:: bash | ||
|
|
||
| #!/bin/bash | ||
| #SBATCH -A <<ACCOUNT NAME>> | ||
| #SBATCH -p <<PARTITION NAME>> | ||
| #SBATCH -N 4 | ||
| #SBATCH -t 4:00:00 | ||
| #SBATCH -J <<JOB NAME>> | ||
| #SBATCH --ntasks-per-node=8 | ||
| #SBATCH --gpus-per-node 8 | ||
| #SBATCH --exclusive | ||
| #SBATCH --overcommit | ||
|
|
||
| GPFS="/path/to/nemo-aligner-repo" | ||
| PRETRAINED_CHECKPOINT_NEMO_FILE="/path/to/megatron_gpt_sft.nemo" | ||
|
|
||
| TRAIN_DATA_PATH="/path/to/train_sft_format.jsonl" | ||
| VALID_DATA_PATH="/path/to/valid_sft_format.jsonl" | ||
|
|
||
| PROJECT="<<WANDB PROJECT>>" | ||
|
|
||
| CONTAINER=<<<CONTAINER>>> # use the latest NeMo Training container, Aligner will work there | ||
| MOUNTS="--container-mounts=${GPFS}:${GPFS},${TRAIN_DATA_PATH}:${TRAIN_DATA_PATH},${VALID_DATA_PATH}:${VALID_DATA_PATH},${PRETRAINED_CHECKPOINT_NEMO_FILE}:${PRETRAINED_CHECKPOINT_NEMO_FILE}" | ||
|
|
||
| RESULTS_DIR="/path/to/result_dir" | ||
|
|
||
| OUTFILE="${RESULTS_DIR}/rm-%j_%t.out" | ||
| ERRFILE="${RESULTS_DIR}/rm-%j_%t.err" | ||
| mkdir -p ${RESULTS_DIR} | ||
|
|
||
| read -r -d '' cmd <<EOF | ||
| echo "*******STARTING********" \ | ||
| && echo "---------------" \ | ||
| && echo "Starting training" \ | ||
| && cd ${GPFS} \ | ||
| && export PYTHONPATH="${GPFS}:${PYTHONPATH}" \ | ||
| && export NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 \ | ||
| && export NVTE_MASKED_SOFTMAX_FUSION=0 \ | ||
| && export HYDRA_FULL_ERROR=1 \ | ||
| && python -u ${GPFS}/examples/nlp/gpt/train_gpt_self_rewarding.py \ | ||
| trainer.num_nodes=${SLURM_JOB_NUM_NODES} \ | ||
| trainer.devices=8 \ | ||
| pretrained_checkpoint.restore_from_path='${PRETRAINED_CHECKPOINT_NEMO_FILE}' \ | ||
| "model.data.train_ds.file_path=${TRAIN_DATA_PATH}" \ | ||
| "model.data.validation_ds.file_path=${VALID_DATA_PATH}" \ | ||
| model.micro_batch_size=1 \ | ||
| model.global_batch_size=64 \ | ||
| exp_manager.explicit_log_dir=${RESULTS_DIR} \ | ||
| exp_manager.create_wandb_logger=True \ | ||
| exp_manager.wandb_logger_kwargs.name=${NAME} \ | ||
| exp_manager.wandb_logger_kwargs.project=${PROJECT} \ | ||
| trainer.self_rewarding.max_iterations=3 \ | ||
| trainer.self_rewarding.max_epochs=1 \ | ||
| model.spin.ref_policy_kl_penalty=0.1 \ | ||
| model.spin.use_meta_judge=false \ | ||
| model.spin.length_params.max_length=2048 \ | ||
| model.data.train_ds.max_seq_length=4096 | ||
| EOF | ||
|
|
||
| srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" | ||
| set +x | ||
|
|
||
| During Self-Rewarding training, there will be several metrics recorded to WandB which you can monitor, the following of which are specific to Self-Rewarding: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise sentence, fix capitalization. Suggested revision. During self-rewarding training, several metrics are recorded to WandB for you to monitor. The following metrics are specific to self-rewarding: |
||
|
|
||
| - chosen_lengths: average token length of chosen responses (average taken across GBS) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix punctuation, fix capitalization. chosen_lengths: Average token length of chosen responses (average taken across GBS). |
||
| - reject_lengths: as above but for rejected responses | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise reject_lengths: Same as above, but for rejected responses. |
||
| - chosen_generated_rewards: the average reward (across GBS) generated by the LLM-as-a-judge for chosen responses | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise chosen_generated_rewards: The average reward (across GBS) generated by the LLM-as-a-judge for chosen responses. |
||
| - rejected_generated_rewards: as above but for rejected responses | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise rejected_generated_rewards: Same as above, but for rejected responses. |
||
| - rewards_chosen_mean: see below for a definition of what reward means in this context | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise rewards_chosen_mean: See below for a definition of what reward means in this context. |
||
| - rewards_rejected_mean: as above but for rejected responses | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise rewards_rejected_mean: Same as above, but for rejected responses. |
||
| - bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise bad_samples_per_GBS: The percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (this could be caused by parse errors, or all responses being judge with the same score, etc.). |
||
| - bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revvise bad_ends_per_GBS: Only valid if using TRT. This tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%). |
||
| - preference_loss: the raw DPO variant loss | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise preference_loss: The raw DPO variant loss. |
||
| - sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise sft_loss: If adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here. |
||
|
|
||
| The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix punctuation. The |
||
| During training, the acc should generally be increasing, but don't worry if its absolute value remains low, as it doesn't correlate to finalised MTBench or MMLU scores. It should just be generally increasing. | ||
| All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively. | ||
| You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations. | ||
|
|
||
| When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization, revise sentence. When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data. Therefore, there is no one-size-fits-all parameter set that will work in all cases. |
||
| Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization, revise. Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult. |
||
| Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix capitalization, revise sentence. Below are some observations from the NVIDIA Alignment team regarding parameters that we have found to work well: |
||
|
|
||
| * global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise global_batch_size: We recommend using 64, and increasing to 128 only for large models (70B+) that are also training with large datasets. |
||
| * iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases. |
||
| * learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7 is recommended. |
||
| * ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise ef_policy_kl_penalty: We did not see large changes from perturbations to this value. We recommend 0.1 - 0.001. |
||
| * length_control: depends very much on model size and data, but we found good results with [0,0,0.1] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise length_control: This parameter depends very much on model size and data, but we found good results with [0,0,0.1]. |
||
| * use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise use_meta_judge: We found stronger results when setting this parameter to |
||
| * meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise meta_judge_pcnt: We recommend not setting this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5). |
||
Uh oh!
There was an error while loading. Please reload this page.