Skip to content
This repository was archived by the owner on Nov 19, 2025. It is now read-only.

Commit e3d1192

Browse files
committed
Merge branch 'adithyare/dpo_data_refac' of https://github.com/NVIDIA/NeMo-Aligner into adithyare/dpo_data_refac
2 parents a76c29a + 613a63a commit e3d1192

28 files changed

Lines changed: 2198 additions & 130 deletions

.github/workflows/cicd-main.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ jobs:
9090
matrix:
9191
test_case:
9292
- ppo-llama3-pp2-reshard
93+
- reinforce-llama3-pp2-reshard
9394
- dpo-llama3
9495
- kd-llama3
9596
- sft-llama3

.github/workflows/release.yaml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,15 @@ on:
2020
description: Ref (SHA or branch name) to release
2121
required: true
2222
type: string
23+
dry-run:
24+
description: Do not publish a wheel and GitHub release.
25+
required: true
26+
default: true
27+
type: boolean
2328

2429
jobs:
2530
release:
26-
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.12.3
31+
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.15.0
2732
with:
2833
release-ref: ${{ inputs.release-ref }}
2934
image-name: nemo_aligner_container
@@ -36,8 +41,10 @@ jobs:
3641
python-package: nemo_aligner
3742
container-workdir: /opt/NeMo-Aligner
3843
library-name: NeMo-Aligner
44+
dry-run: ${{ inputs.dry-run }}
3945
secrets:
4046
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
4147
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
4248
SLACK_RELEASE_ENDPOINT: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
4349
PAT: ${{ secrets.PAT }}
50+
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
3636
durations = timer.consume_durations()
3737
```
3838
- Add code and instructions for replicating Reward Modeling training in HelpSteer2 and HelpSteer2-Preference
39+
- Implement REINFORCE algorithm.
3940

4041
### Breaking Changes
4142
- Upgrade TRTLLM dependency from v0.10.0 to v0.12.0 and migrate from `GPTSession` cpp runtime to `ModelRunner` python runtime. Please use the latest Dockerfile.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ The toolkit is currently in it's early stages. We are committed to improving the
2323
* **Reward Model Training**
2424
* **Reinforcement Learning from Human Feedback using the [PPO](https://arxiv.org/pdf/1707.06347.pdf) Algorithm**
2525
* [Llama3-70B-PPO-Chat](https://huggingface.co/nvidia/Llama3-70B-PPO-Chat) aligned with NeMo-Aligner using TRT-LLM.
26+
* **Reinforcement Learning from Human Feedback using the REINFORCE Algorithm**
27+
* [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct) aligned with NeMo-Aligner using TRT-LLM.
2628
* **Direct Preference Optimization** as described in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/pdf/2305.18290)
2729
* [Llama3-70B-DPO-Chat](https://huggingface.co/nvidia/Llama3-70B-DPO-Chat) aligned with NeMo Aligner.
2830
* **Self-Play Fine-Tuning (SPIN)** as described in [Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/pdf/2401.01335)
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
.. important::
2+
Before starting this tutorial, be sure to review the :ref:`introduction <nemo-aligner-getting-started>` for tips on setting up your NeMo-Aligner environment.
3+
4+
If you run into any problems, refer to NeMo's `Known Issues page <https://docs.nvidia.com/nemo-framework/user-guide/latest/knownissues.html>`__. The page enumerates known issues and provides suggested workarounds where appropriate.

docs/user-guide/cai.rst

Lines changed: 37 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
.. include:: /content/nemo.rsts
22

3-
.. _model-aligner-cai:
3+
.. include:: aligner-algo-header.rst
4+
5+
.. _nemo-aligner-cai:
46

57
Constitutional AI: Harmlessness from AI Feedback
68
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@ -14,12 +16,12 @@ CAI allows training a harmless, but non-evasive AI assistant that engages with h
1416
.. _Constitutional AI (CAI): https://arxiv.org/abs/2212.08073
1517

1618
CAI
17-
###############
18-
The basic steps of CAI are described in this section and illustrated in the figure below (`Figure 1 <https://arxiv.org/abs/2212.08073>`_).
19+
###
20+
The basic steps of CAI are described in this section and illustrated in the `figure below <nemo-aligner-cai-flow-diagram>`_.
1921

2022
(Supervised Stage) Critique → Revision → Supervised Learning: The AI generates responses to harmfulness prompts using a helpful-only AI assistant, then critiques and revises its own responses according to a principle in the constitution, and then fine-tunes the original model on the revised responses.
2123

22-
(RL Stage) AI Comparison Evaluations → Reward Model → Reinforcement Learning: The AI generates pairs of responses to harmfulness prompts using the finetuned model, then evaluates which response is better according to a principle in the constitution, and then trains a reward model based on this dataset of AI preferences and a human helpfulness preferences. The AI then trains with RL using the learned reward model.
24+
(RL Stage) AI Comparison Evaluations → Reward Model → Reinforcement Learning: The AI generates pairs of responses to harmfulness prompts using the fine-tuned model, then evaluates which response is better according to a principle in the constitution, and then trains a reward model based on this dataset of AI preferences and a human helpfulness preferences. The AI then trains with RL using the learned reward model.
2325

2426
.. image:: ../assets/cai_diagram.png
2527
:alt: basic steps of the CAI process
@@ -29,25 +31,22 @@ The basic steps of CAI are described in this section and illustrated in the figu
2931
Critiques, revisions, and AI harmlessness feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model. It gives some control over the initial behavior at the start of the RL phase, while addressing potential exploration problems. The RL stage significantly improves performance and reliability.
3032

3133
Motivation
32-
###############
34+
##########
3335
Constitutional AI motivation refers to designing AI systems in such a way that their objectives and behaviors are guided by a set of predefined rules or principles. It includes the following:
3436

35-
Scaling supervision: using AI to help humans supervise other AIs more efficiently and effectively, especially for tasks where AI capabilities may exceed human ones.
36-
37-
A harmless but non-evasive assistant: reducing the tension between helpfulness and harmlessness, and avoiding evasive responses that reduce transparency and helpfulness.
38-
39-
Simplicity and transparency: encoding the training goals in a simple list of natural language instructions or principles, and using chain-of-thought reasoning to make AI decision making explicit and understandable.
37+
- Scaling Supervision: Use AI to assist humans in supervising other AIs more efficiently and effectively, particularly for tasks where AI capabilities may surpass human ones.
38+
- A Harmless but Non-Evasive Assistant: Minimize the tension between helpfulness and harmlessness, and avoid evasive responses that reduce transparency and helpfulness.
39+
- Simplicity and Transparency: Encode training goals in a straightforward list of natural language instructions or principles, and employ chain-of-thought reasoning to make AI decision-making explicit and understandable.
40+
- Reducing Iteration Time: Eliminate the need to collect new human feedback labels when modifying objectives or testing different behaviors.
4041

41-
Reducing iteration time: obviating the need to collect new human feedback labels when altering the objective or testing different behaviors.
42-
43-
Train a CAI model
44-
#####################
42+
Train a CAI Model
43+
#################
4544

4645
This section is a step-by-step tutorial that walks you through how to run a full CAI pipeline with a ``Mistral-7B`` LLM model. It includes the following:
4746

48-
1. Data download and preprocessing.
47+
1. Download the models and datasets.
4948

50-
2. Generate responses to harmfulness prompts using a helpful-only AI assistant. Ask the model to critique its response according to a principle in the constitution, and then revise the original response in light of the critique.
49+
2. Generate and revise responses to harmful prompts creating the SL-CAI dataset. Ask the model to critique its response according to a principle in the constitution, and then revise the original response in light of the critique.
5150

5251
3. Fine-tune ``Mistral-7B`` with SFT on the revised responses to create a ``Mistral-7B-SL-CAI`` model.
5352

@@ -56,24 +55,22 @@ This section is a step-by-step tutorial that walks you through how to run a full
5655
b. Formulate each prompt and pair into a multiple choice question, where we ask ``Mixtral-8x7B`` which response is best according to the constitution.
5756
c. Blend the AI feedback preference dataset (prompts and pairs) with human feedback helpfulness dataset.
5857

59-
5. Train a Reward Model (RM).
58+
5. Train the Reward Model (RM).
6059

6160
6. Fine-tune the ``Mistral-7B-SL-CAI`` with Proximal Policy Optimization (PPO) and the RM to train a ``Mistral-7B-RL-CAI`` model.
6261

6362
7. Run inference.
6463

65-
.. note::
66-
Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
67-
68-
If you run into any problems, refer to NeMo's `Known Issues page <https://docs.nvidia.com/nemo-framework/user-guide/latest/knownissues.html>`__. The page enumerates known issues and provides suggested workarounds where appropriate.
64+
.. _nemo-aligner-cai-flow-diagram:
6965

7066
.. image:: ../assets/cai_flow.png
7167

72-
Step 1: Download models and datasets
73-
#############################################################################
74-
1. Download ``Mistral-7B-Instruct`` and ``Mistral-7B`` LLM models from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 and https://huggingface.co/mistralai/Mistral-7B-v0.1 into the models folder.
68+
Step 1: Download the models and datasets
69+
########################################
70+
71+
1. Download the ``Mistral-7B-Instruct`` and ``Mistral-7B`` LLM models from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 and https://huggingface.co/mistralai/Mistral-7B-v0.1 into the models folder.
7572

76-
Then, convert into .nemo format:
73+
Then, convert them into .nemo format:
7774

7875
.. code-block:: bash
7976
@@ -92,7 +89,7 @@ Step 1: Download models and datasets
9289
This command will download the dataset to ``/path/to/anthropic_red_team_attempts_train.json``
9390

9491

95-
3. Download SFT helpfulness dataset:
92+
3. Download the SFT helpfulness dataset:
9693

9794
.. code-block:: bash
9895
@@ -101,7 +98,7 @@ Step 1: Download models and datasets
10198
This command will download the dataset to ``/path/to/nvidia_sft_datablend_v1_train.json``
10299

103100

104-
4. Download and process preference helpfulness dataset:
101+
4. Download and process the preference helpfulness dataset:
105102

106103
.. code-block:: bash
107104
@@ -112,7 +109,7 @@ Step 1: Download models and datasets
112109
113110
114111
Step 2: Generate and revise responses to harmful prompts creating the SL-CAI dataset
115-
###################################################################################################
112+
####################################################################################
116113

117114
Run an inference server in the background using the following command:
118115

@@ -158,16 +155,16 @@ Please wait for the server to be ready before proceeding.
158155
--apply_chat_template False \
159156
--response_extract_pattern "[/INST]"
160157
161-
This will generate an SL-CAI dataset of prompts and revised responses as ``cai_revisions_aligner_chat_template.json``
158+
This will generate an SL-CAI dataset of prompts and revised responses as ``cai_revisions_aligner_chat_template.json``.
162159

163-
The few-shot samples should be provided following the template in ``few_shot_samples_example.json`` (filling in the `content` tags, and choosing how many samples to use), and should include a red teaming prompt, a response from the helpful model (e.g. ``Mistral-7B`` in this tutorial), critique and revision requests and responses. An example is shown in the `Anthropic repo <https://github.com/anthropics/ConstitutionalHarmlessnessPaper/blob/main/prompts/CritiqueRevisionFewShotPrompts.json>`_.
160+
The few-shot samples should be provided following the template in ``few_shot_samples_example.json``. Fill in the `content` tags and choose how many samples to use. The samples should include a red teaming prompt, a response from the helpful model (e.g., ``Mistral-7B`` in this tutorial), critique and revision requests, and responses. An example is shown in the `Anthropic repo <https://github.com/anthropics/ConstitutionalHarmlessnessPaper/blob/main/prompts/CritiqueRevisionFewShotPrompts.json>`_.
164161

165-
*NOTE: The tokenizer file can be found by extracting the .nemo checkpoint using `tar -xf /models/mistral/mistral-7b-Instruct.nemo`.
166-
There are 2 tokenizer files that end with `.model` in the model checkpoint and they are the same, so you can use either one for data processing.*
162+
.. note::
163+
The tokenizer file can be found by extracting the .nemo checkpoint using `tar -xf /models/mistral/mistral-7b-Instruct.nemo`. There are two tokenizer files that end with `.model` in the model checkpoint, and they are identical. You can use either one for data processing.
167164

168165

169166
Step 3: Fine-tune Mistral-7B on the revised responses to create a Mistral-7B-SL-CAI model
170-
######################################################################################################
167+
#########################################################################################
171168

172169
Note that you would need to set up multi-node training run in your cluster env, depending on the type of cluster you use. For details, please refer to https://lightning.ai/docs/pytorch/stable/clouds/cluster.html .
173170

@@ -199,10 +196,9 @@ Note that you would need to set up multi-node training run in your cluster env,
199196
200197
201198
Step 4: Generate the RL-CAI (preference) dataset for RM and PPO training
202-
##############################################################################################################
199+
########################################################################
203200

204-
The following section runs an inference server with the SL-CAI model that we've previously trained, and queries it with red teaming prompts asking for several responses per prompt.
205-
The responses will then be ranked by a judge LLM being run from NVIDIA's NGC. An NGC API key can be acquired `here`_.
201+
The following section runs an inference server with the SL-CAI model that we've previously trained. It queries the server with red teaming prompts, requesting several responses per prompt. These responses will then be ranked by a judge LLM running from NVIDIA's NGC. You can acquire an NGC API key `here`_.
206202

207203
The following command will run the inference server:
208204

@@ -257,8 +253,8 @@ Using a different terminal, run the following command to start the RL-CAI datase
257253
This command will create the ``rl-cai`` dataset files in the defined output folder with the given output filename prefix.
258254

259255

260-
Step 5: Train the RM
261-
#####################
256+
Step 5: Train the Reward Model (RM)
257+
###################################
262258

263259
Run the following command to train the RM:
264260

@@ -285,7 +281,7 @@ Run the following command to train the RM:
285281
286282
The trained RM checkpoint will be saved to output dir given by ``exp_manager.explicit_log_dir``.
287283

288-
Step 6: Fine-tune Mistral-7B-SL-CAI with PPO and the RM to train a Mistral-7B-RL-CAI model
284+
Step 6: Fine-tune the Mistral-7B-SL-CAI with PPO and the RM to train a Mistral-7B-RL-CAI model
289285
##############################################################################################
290286
Run the following command in the background to launch a RM and PPO critic training server:
291287

@@ -329,8 +325,8 @@ Run the following command to launch actor training and a reference policy server
329325
330326
The trained LLM policy checkpoint will be saved to the output dir given by ``exp_manager.explicit_log_dir``.
331327

332-
Step 7: Inference
333-
##################
328+
Step 7: Run inference
329+
#####################
334330
To start inference, run an inference server in the background using the following command:
335331

336332
.. code-block:: bash

0 commit comments

Comments
 (0)