Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
1 change: 1 addition & 0 deletions experiments/cicd/cicd_1778628652/_CONFIG
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
eJzdWV1r2zAU_SvFL9ugmCZpkjK2wTY2GOzrYW-lCMW-SbUokpHkrNnof58k27FiyXbahNL2ISX6uufqSveeU-VfJDhX0euTf5HaZKC_RALm0elJtISNacFNBoKsgCk0iG51P5_9hkTJYkmeUdD9zvLaTrYpLa14mlPbNcsJVYRJ08vwyvZZG9YyUbAydi8voy8shZuXxPx9e_bKTE9hli-iq9OT3cGBHZSbFSVsieaCr5CAFVeAUiKiqysDDwqnWGFtmeWUGqSUJOpAt42JXa8bWEeNSGgf240Mn8FGzPIUz_RtUgKvQUgQqJp3h5OakzSlECMpkjjhbE4Wrj8fKpBfFca3ypXG_XuvlHhplr19MWeIC5RQ-cJctU50pm8eEjnT0ALiDCdLvCBsEc-wBNePn3YERHRr77MDhsUit7lmmk3E3cSscq_dhsKLbhNlGrRbuCZScbHpNzLURtquzXHPyRxUVsbvaBfjY9HTfgls9WmEgQK2IGtMc9Mxx1SC9U5dWwvRG1Nb38VwA0mu4xhXfsdlMfMC31LHeoBNEu2H21onO05v97R78vR2Lz80XlW7Rk-8dnXuF9garbGQxX6LhD2sWO9JlnWgw3xJcc6S69D40I7XjjfHR3ZcgBIEAsPnxfJaM5DUnzS2k37zWXH3msMTO8wUlkuJtCHEeAr-tGkxTY_Jbp4_f-JXbLuR8TPYSGf92D83HhPPF-lOOIspTzB1HflqOj6V5eAIbD88mO3Pj8H24wdlextVVBXVO3C-c0x3JvyqgnYGw9EhXlSrEnsod9el2oOoqnT_gY18_8oS3uPeWadvWxrwXXMZ4NAQNOjEAyuZpAcmijpBtnTkmW8wUQ_MoBPFYzUfzRDaYRgFJR5H0g27JZ2rbA6TcnsqG0UUDYiBQak7Kq_CsiYkRgpBQ_kCUVgDbZM0iT4DhlZBJTIuNZON4YLzdLYBHUEpdW1okzfqWgBOM84p-sPFEkJaq1A4ckkyJBVWuURY6TpIlD_1opiq0wRT8hfq85vriikTQTLVo5Imz0Bc2I1Mn8FGOlNy_zx7JCrJfOoa7rrwqe49XBwFSO6O4mhyDHE0fVBxtPM8-xDCyBbgPqpNSJIG6bYs3wG9UFTuzvB6QtAz06s1rGdoMJ1eTIYXk_Ew6GRQZGwZog_hy_fPP4JmHY7xrNf0ct9HJpegAkoxxE09UErk4eesFqLzUH2O69M1k-CdCVCl_3Lms-R9IxkkXB-wk2vvGdo-Aj-OphsFNZ22fmufduYJz5n3A8_g1PnZZNsYlo3e5_tCsjrPtlsTo7JRP49th87dxngfqGp54H_GGmPkmp24jek-GNXyRukdmOiZuYQzcwRn8VmsS9R_wW7TrQ==
1 change: 1 addition & 0 deletions experiments/cicd/cicd_1778628652/_TASKS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[["eJzdXG2PnDgS_itRS6fsnUIz_TbdM7o5KXvJ7e1L7iIl-ymKEAPubieAWWMmM7fKfz8bA223C0wDM5mMtDtKQ1FVLttV9ZQL_pxQQtjk8tmfE3aXIv6PCUXbyYtnk8_oTvz6RK692eQrv0CuP6GAZQVthDPGLyuPHZ5P70oOMQnzqLh0neOI4SQTVxM_Lq4JFgVfzFAsuH74MPk5CdHtD1j8vTr7K6c-cI2QXzC98aO8uMC5o4ikzP2buJz6bF8wmfxdDOgfU3SLgpwROk394LO_Q3SKkyDKQ-RxUoZo8uHs4-Tj148vnuliZx3FehQFOEVZb_EzUPzcIp6hjPWXOQdlLiwy0a0fp9GAoS5AsUuL2PQupUQsuSkjcdRT9BIUvbJZmZAocyM_T4I9om5A4pgkvYe_AnU4P02HwZNwDmqx7qRFX5nrQuZH4QsQ80Of-fz5JI-iXuw4rxfSb3jzb-p6XBanboqSDKMb5OBEKOdwjRPm_LHZfMlS5LvMpzvEOtqNoshn-KYY6b6nY7pnpfq5q3tWqp8_u2el-nm7e1aqnx-8Z6X6OcZ7Vqqfn7xnpQY6Uo2ZdKMs59FkYApX8LA60kmIrvPdBHZok-wujnDy2dtSEvOcKiYMeSGmJnnhaiZGSAA3_-RowIDhhNYhDoZmsYKFboEjWaNaFxpHPZBhMfFxDEQ8HvrXfGUy6t8gmiHqVXQnzNQWh2GEpl5Gg2lAki3eqfr8WAl5X8l4U6lytJZfMkZ_EI9dPd8mHqFeEGXPdU8ASE_4KvZonnDRFJWbECe7abVmFVXeyktvy406kY5HEctdRh5zL-KJn8eydaRW7ehmHszftbMoN0Qzhz3PnAi9szOZTwx3dU8z9lW6QGFGr3J4oy2Vf8orzcui8G3P24PD1o8y1M1Rl67SmIAGL2kR3D1ANHphQ5MjB9y6EsoqgcFD8852Dq1rSae3-I-vnayh4IzFN8UZyQ27bU8XKng4TbK7TIw5QD3hQ5CH_umyZr3yEoCTYvPlN7W542wJDZBD-Or5QjmTK0b5nRPGg275iDzuMrOeU-E4gZ-ynCKH-skOXYmpeUvJFkeIvkxxT136AbgjXRyUhFc8BKQ9legH2LgS3ATOjvrp3imWy1VCwr6T0g-flTrw7J6n9FKHwq_3VGI5wsY5sFNT-mFpYMeUvpCfyjXZlNkXNFsShagxm5cknEchviGXV50EOGsTwxogzpVkuzQXEYHiIGvHBcNc_-NIp4uBDPOnj2MgrXG9-4p_TLhAbmpMknpXq7r8Vl0bjgjmgxHBYgxEsHxQRFDZ9OGQgOoSBwMC3dHXjtaYBsXHWmTqLvsE4ZUHb5BdOm-b9FKw95eTModDaIClF1HBDiCA9XsUMuw8lg08lHgy8qxrkWocBDS3IKDaE32so8fqO48ereNFyY1349NMQR_Dxjv4ZGm_dSIS-NFl_a_2_cIdFfNxIiaZ5AnrCzncPYmRS-IEx5FXLJWFG-AgPPzxZuv15ny-OV_NL90soD4L9lkqkrLT9Ot5jnSKfm5MSJLtCRfs_opj7Pw6n66cV_-K_GzvvX7502-vFzxshSFKvDCPU-_MDTiacOXMZu4bccLv_Ddl_NH_IepWJ_6Xbp5Rt5gTN8LXbnrH9iRZTGdzN-TT7pTlg6x-4ETL9DzMegSWqXohTrZQ9eCJlup5wtZmqcvi74l69ANxx2wU53P-TZ2P4yTEqbVzCu0cYbR2s2Q8pVXLHafao35eRbHD0FdHFCt6m2DkKRFsO3il_p0XkR3PqdAW37aDyWET--2jaD2Q9RMYSGuK1H3hPU4wmUU5jVVF3okLv5DrV1w8jrIRAGUbHOwGKM_HAJTrBwWUhV094TFCaciHQ5aVmxp62qMoPz24PmMmOsHJk4Q1QkjdhdowJKd22kORKvTYPY-DoBYWBKVooICozXfuNVuHzPIkQdHhnFtsEi2WDytAnhDLwb4RGcr3hOdGDYE8z6AgL4vPx6OCC9ApoQB3WXXGPJtmmN2Z989lA8weRZF5cy0ZU-SJ7lM_CdsTjIvvfIXVA5mdPYGRtLqQ7tvhMWUY5S4PIsyXs5ZfvPv3--LeCJkFUGQ7MbO4GCOz4GvwQVOLbC-t-7Aphb1_5LTiQntoLtdP7aaNaRIe2qZPJxHS1Rv8hZe3ZhjiXZguMmTIMGTo0cK-0DbmahWhxKLmfN4pHhcxyezfKcPR0PyulHKIbmbPkghsI4kpg6RprUN8HEmSFnHHyRmXlpxRSlbKQMNAtlkGOrUiE6IU8SQkCTDSErlhtfGOidyhCwvO5A5nFHA2d6jpN7UGM1qMC07l0G2KKC7iBQ6bcrrGZFOmdH4QiNpZY1LnU75lOFg3CTZaXarGLsdkFwUZw2DpquzQICE0yFnZm8H87HPm8aF6RSeNSSeNGaS5pBL0ANVCp9qlOUAkDburiBoErnSqBoHSwH8QaGjSujGKgXub6l6LnhcaTQDRzKVxhYNA0ATPZ-X9JOOOAIMk5TK9LfoogfvSpjsKTd9c2jLDu8QHQMN8deCd4RuIe7lCKYfHwF1pQsK3vxeD0zTfVAMsa9g45psVoLs4oqtq3camPDsi5PsXoJJ29cOw2Dh-5PFNxHcIQ-BGlyZWCssGhTTyXjAgO5QgkkNUy2pF8LSR-3M_yynUpqQ4hQr6GzTS7pWrN26vyzclVMdrEG10ojuv8MAmXekeCA32wgRJSknQsvWWcgq--Jhn2dyneFsOEXaU5KlIYwB6ORkxSTCzUM5hSq-WBTwjpyZDPF54EV8SkKteVtNX8uSXcQBSrkpuO3jDLuW8JIh9IRRwOMt1XevnQSuNfAapvCmrAFHkoYwPi1OFXsa4ny8H2QbdZ0_mDZTZk38FpXsG9Jiwu-104HWZ-I2A4FfDXz8Z5f2T2cO-gCJPB6oE-uFwfJUstxoEeDvGsHCVVbc3Qh2a6gwGVdptnxpgiZQ5uQXCnbWCljqvN1VTU3pbCeCkosYRWDAkdy2zdH551D1IzMbp8GgfYHO5psQ4wyo2NVACKiklRrIJuBaNSO1SFLwFzpACtcY49dKQmyFQ5AO2MZ0tL8_O-H-WUhSGGyMF8LNImLUyLqGjyVkHjRYZm3YZxwDUkKZBz6HzcoRjm4VxZDiarAJlGqI0HDzIhkeIulnSGBbcWSzIIflQEQWqNxhz9GXdL-0bpagIQHzHmnGtrNAoKBhPUAAKKqsSg9dvVdwABFRljeEyDgUSIF4XpZGhMuoKi7kxqNVD2vcDBX2krMwMZV7Vd2DTiMrOKMaRJSIg2-Zwd6iAssJkMK9rS7ZN7aeizNG-s5VCFbRY1RKV9fWIm4BOMXGTGxxinm7xbIxQypwoil2KOHmGLmfT-dTia4zCWItesiRmf_UASJi1ctkIm1GtvZnrAaq6DV4fcCnP3FBVEc9uJ6CrTCvwDX4_46hcCHl6pVA4WNxx3RGGF7Li2GodqHfNTEyL0mQ7H-Wg2nherV3a52rdwkDWNa3-YSsm43P7djRqpeaw4Srp4NPEpuKroUBD2dWiwKIdDDfWcs01e1ybtQgWb8O3r1qzLGwXeigIW8Sft4-7tdJs-pZDjXlwxFbL1ZAb0gvVQ8UBlW9gdLsx0sG6dG7iQVk0Hyqgrr2D7aFV1b1Dc-g0y6-nnywlG72SD7RfNNTwh26LlsOBcboMVu1dBkp7wbBu1N7tBU1dBcPejejYVQCd5ctjLPiguTqkrYwH9xLs_SSEPoCwVDsVGrsI-BqATpPkiVQa5Tsx1IYeAhHGRXszQHB0SggeJcoTQtkA3no29WS-gjB78p9B6L6PHsnZlPhf5AeK7F_4z-GHUG2vl3Q8hBrlkwezh_3mgfxUd9dFoBwRnnzsZD9GGeUYAjxTGVDMbChdVl6-A4ZSTvjMJK8IBr0ORKo40nQsZzN3I5Y8hCAzPWRmVgUskwLI3E6zFAXZ1Jdfn6rWycs0fSe4TH__z7vff3zz8_v3r1-B4ysjnZltySDXczrrEAmskTI69szalOg6EOcCHb4dXuWCV0n1vtYY2eI5mC1y7uIy5xPIdi3tW_-zKoGclz8O35Cd1WFW_VHRWb_qWZ5Hmd9wrEUu1B9LTf5cFblQfyy7yK8e174YU7NYqXK1H-eaEprcc_XHuosS1ePwW4Y1r40mc6neutBm4ayL0Op5_f2DeoBrTdhK46_Pead5XmlD1FolapH6-DQzzjQLzzpNbcWgCowzsbgFkTgLFudG07Mpd-T_B4lEPFY=", "eJzdV11P2zAU_StVXmASaUjaTQyNSV0BUfWLQfdUVZZJ3MabY2eJw6gQ_32225Q2sWs2eNleqsa-Pude-_jYfnQyxrhz2nh0-DJF4o-Toblz1HB-oKX8ysMMpxz4zpNoY3ffUchzFU5wLpu3Rj5DpMs1SMKigqimuwITjmkuWylMVJuEULiYo0SiTqdOj0bo4RDL37PjdyL6GZUgqEDvISlUg-timhbcjSCHDU9kCnkY5ykMkSebZGwKeayQnU-y0M9NmC3y6fHMmT3Njhq7dL6VjhVc8eGsQsfmc4IpAjGOIkRBziFHuZnf1_IHVv4EPrg5-ukSRBsn_sfAzBBoGVpWBp42TsyoLS1q2543Qy7ai9xWyDMpGcShWr3TBi0IMY0Q4aKLFylBr1ShwrDK0Lm6BMPx-cUAdPvXE0cvHmdy05uMR6Db6V5dgPPeTT0uKMGuxsOLener7J6M-xejer-aa2c4GILbfu8a9Ea3k85gUI97r-IGnW-j7hW47d70ZM6a6ZVlRzh87U6WENUp7KPloTCRs4OdmTuwiMWL5y5hISTe11-IegljNI8Zh9jr4wS7_aD53j2_JDCPtWJC9H5aIdxodpNQdZWsOYU4jDyeYc6oG8IwRmbuGnadfr34L2MV02FhLOG0REpGNqa92CuEOnhVgzYS38xSg6qz7SjZRpUuecyomW8XTOc8ptOw9JsnA_bKldSGCt54Q5kyfAu_MxhDrcJNzJb5vq7OF5qvzMTkuZjKo9fktKuzQm-zasH0Dosoz5Ypw5SbvDWpd3xYdWwmaZ_ftv5xeWwKaf8HhcjhEbwTauYZvEdZjjJQxv3ByTgXtz-CmiDPwmbI6BwvtvP5UpJMSo5hmUpF_x3Os0M57OxgTgHLQEjyiudp2ClKGMgKqqG-VZd4Z-WpW-hicxSJkDqQn1UKnfMFezA4XOyHWOvejBCL9wDLlnaQtrPHsd90YeTKbJ5Ab6OD7qrFvObS7GxHXMiShFEPwQVBLS8qkhSUbxCZeNNwPVr5aG0FVhb6V9eE0n11q7pfDusHZG2gcGW7ArTjSsu2FXJnvj0-u34NP7HAziHJkfbU1IGtFWavNNin9d14i43pjnSRmmwWOCEraPVp7-88sfztt8LmI9iJC7a7Wtsf7fWH1Wv93S3ny_xkHBZXOlHkcfO4KWr5DXQjLSI="]]
1 change: 1 addition & 0 deletions experiments/cicd/cicd_1778628652/_VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.9.0
198 changes: 198 additions & 0 deletions experiments/cicd/cicd_1778628652/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
"""NMM Sandbox CI/CD orchestrator — internal alternative to Model-Optimizer/tools/launcher/launch.py.

Shares core logic (dataclasses, executor builders, run loop) with the public launcher
via modules/Model-Optimizer/tools/launcher/core.py. This file adds internal cluster factories,
CI batch mode (job_yaml), test_level filtering, and internal defaults.
"""

import getpass
import os
import sys
sys.path.insert(0, '/usr/local/lib/python3.12/site-packages')
import warnings

import nemo_run as run

# Add the launcher to sys.path so we can import core.py
sys.path.insert(1, os.path.join(os.path.dirname(__file__), "tools", "launcher"))

from core import SandboxPipeline, SandboxTask, run_jobs, set_slurm_config_type, register_factory, get_default_env # noqa: E402
from slurm_config import ( # noqa: E402
SlurmConfig,
slurm_factory,
)

set_slurm_config_type(SlurmConfig)

# Register the slurm factory so task_configs YAMLs can reference it by name
register_factory("slurm_factory", slurm_factory)

# ---------------------------------------------------------------------------
# nmm-sandbox-specific configuration
# ---------------------------------------------------------------------------

EXPERIMENT_TITLE = "cicd"
DEFAULT_SLURM_ENV, DEFAULT_LOCAL_ENV = get_default_env(EXPERIMENT_TITLE)


def _ensure_launcher_nvrx_install() -> None:
"""Idempotently rewrite the Model-Optimizer launcher's service_utils.sh
util_install_extra_dep so it (a) installs nvidia-resiliency-ext from
HEAD (container's pinned version privatized get_write_results_queue;
HEAD keeps the public alias), (b) falls back to SLURM_LOCALID when
OMPI/PMIX vars aren't set so only one rank per node installs, and
(c) uses a /tmp marker as a barrier so other ranks wait. We keep
Model-Optimizer at upstream main, so patch the working-tree file at
startup — the PatternPackager ships the patched version.
"""
path = os.path.join(
os.path.dirname(os.path.abspath(__file__)),
"modules/Model-Optimizer/tools/launcher/common/service_utils.sh",
)
if not os.path.exists(path):
return
with open(path) as f:
content = f.read()
if "nmm_extra_dep_installed" in content:
return # Already patched.
rank_old = (
"mpi_rank=${PMIX_RANK:-$native_mpi_rank}\\n"
"mpi_local_rank=${PMIX_LOCAL_RANK:-$native_mpi_local_rank}"
)
rank_new = (
"mpi_rank=${PMIX_RANK:-${native_mpi_rank:-${SLURM_PROCID:-0}}}\\n"
"mpi_local_rank=${PMIX_LOCAL_RANK:-${native_mpi_local_rank:-${SLURM_LOCALID:-0}}}"
)
func_old = (
"function util_install_extra_dep {\\n"
" if [[ \\\"$mpi_local_rank\\\" -eq 0 ]]; then\\n"
" pip install diskcache\\n"
" fi\\n"
"}"
)
func_new = (
"function util_install_extra_dep {\\n"
" local _marker=/tmp/.nmm_extra_dep_installed\\n"
" if [[ -f \\\"$_marker\\\" ]]; then\\n"
" return 0\\n"
" fi\\n"
" if [[ \\\"$mpi_local_rank\\\" -eq 0 ]]; then\\n"
" pip install diskcache\\n"
" local _nvrx_dir\\n"
" _nvrx_dir=\\\"$(mktemp -d)/nvidia-resiliency-ext\\\"\\n"
" git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext \\\"${_nvrx_dir}\\\" \\\\\\n"
" && pip install \\\"${_nvrx_dir}\\\"\\n"
" touch \\\"$_marker\\\"\\n"
" else\\n"
" local _waited=0\\n"
" while [[ ! -f \\\"$_marker\\\" && $_waited -lt 600 ]]; do\\n"
" sleep 1\\n"
" _waited=$((_waited + 1))\\n"
" done\\n"
" fi\\n"
"}"
)
if rank_old not in content or func_old not in content:
return # Upstream layout changed; don't patch blindly.
content = content.replace(rank_old, rank_new, 1).replace(func_old, func_new, 1)
with open(path, "w") as f:
f.write(content)


_ensure_launcher_nvrx_install()


packager = run.PatternPackager(
include_pattern=[
"modelopt/*",
"modelopt_recipes/*",
"tests/*",
"examples/*",
"pyproject.toml",
"tools/launcher/common/*",
"tools/launcher/examples/*",
"tools/*",
],
relative_path=[
os.getcwd(), # modelopt/*
os.getcwd(), # modelopt_recipes/*
os.getcwd(), # tests/*
os.getcwd(), # examples/*
os.getcwd(), # pyproject.toml
os.getcwd(), # tools/launcher/common/*
os.getcwd(), # tools/launcher/examples/*
os.getcwd(), # tools/*
],
)

MODELOPT_SRC_PATH = os.path.join(os.getcwd(), "modelopt")


# ---------------------------------------------------------------------------
# Entrypoint
# ---------------------------------------------------------------------------


@run.cli.entrypoint
def cicd(
job_name: str = "01_job",
job_dir: str = os.environ.get(
"SLURM_JOB_DIR",
"/lustre/fsw/portfolios/coreai/users/{}/experiments".format(getpass.getuser()),
),
task: SandboxTask = None,
pipeline: SandboxPipeline = None,
hf_local: str = None,
user: str = getpass.getuser(),
identity: str = None,
test_level: int = 0,
detach: bool = False,
) -> None:
"""NMM Sandbox CI/CD orchestrator.

Args:
job_name: Name of the job.
job_dir: Directory of the job.
user: User name for SSH tunnel.
identity: Identity file for SSH tunnel.
test_level: Test level.
"""
if "NEMORUN_HOME" not in os.environ:
warnings.warn(
"NEMORUN_HOME is not set. Run 'source .sandbox_credentials.sh' to set it. "
"Defaulting to current working directory."
)
run.config.set_nemorun_home(os.environ.get("NEMORUN_HOME", os.getcwd()))

if hf_local is not None:
job_dir = os.getcwd() + "/local_experiments"

job_table = {}

if task is not None:
job_table[job_name] = SandboxPipeline(tasks=[task])
elif pipeline is not None:
job_table[job_name] = pipeline
else:
print("No task or pipeline provided. Use task=@<yaml> or pipeline=@<yaml>.")
print("For multi-job YAML files, use: bash tools/run_job_yaml.sh <yaml> [args...]")
return

run_jobs(
job_table=job_table,
hf_local=hf_local,
user=user,
identity=identity,
job_dir=job_dir,
packager=packager,
default_slurm_env=DEFAULT_SLURM_ENV,
default_local_env=DEFAULT_LOCAL_ENV,
experiment_title=EXPERIMENT_TITLE,
detach=detach,
test_level=test_level,
modelopt_src_path=MODELOPT_SRC_PATH,
)


if __name__ == "__main__":
run.cli.main(cicd)
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Model Optimizer Benchmark Reference

This document summarizes performance and accuracy measurements of [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) for a few popular models.
The benchmark in the following tables is provided as reference points and **should not be considered as the peak
performance** that can be delivered by Model Optimizer. All performance numbers are tested with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) or [TensorRT](https://developer.nvidia.com/tensorrt-getting-started).

## 1. Post-training quantization (PTQ) for LLMs

### 1.1 Performance

Config: H200, nvidia-modelopt v0.21.1, TensorRT-LLM v0.15, latency measured with [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md#for-non-gh200-systems-1).
Inference speedup are compared to the BF16 baseline. **Speedup is normalized to the GPU count**.

> Benchmark scenario: Input tokens 2048, output tokens 128. Real performance may vary based on the target usecases and flags used to build the TensorRT-LLM engine.

> Memory saving is not reported here as TensorRT-LLM occupies all the remaining available GPU memory for KV caching.

> If the GPU memory is the limitation, lower bit quantization may have better GPU-count-normalized throughput gain with fewer TP.

| | | BF16 (8B:TP1, 70B:TP2) | | FP8 (TP1) | | |INT4 AWQ (TP1)| | |W4A8 AWQ (TP1)| |
|:------------:|:----------:|:----------------------:|:-:|:------------:|:-------:|:-:|:------------:|:-------:|:-:|:------------:|:-------:|
| Model | Batch Size | Tokens/sec | | Tokens/sec | Speedup | | Tokens/sec | Speedup | | Tokens/sec | Speedup |
| Llama3.1-8B | 1 | 173.80 | | 245.03 | 1.41x | | 231.75 | 1.33x | | 239.70 | 1.38x |
| | 8 | 803.11 | | 1,051.17 | 1.31x | | 599.72 | 0.75x | | 801.72 | 1.00x |
| | 64 | 1,679.74 | | 2,190.93 | 1.30x | | 1,392.78 | 0.83x | | 1,930.86 | 1.15x |
| Llama3.1-70B | 1 | 45.81 | | 43.46 | 1.90x | | 44.10 | 1.93x | | 46.31 | 2.02x |
| | 8 | 182.61 | | 182.07 | 1.99x | | 93.98 | 1.03x | | 140.02 | 1.53x |
| | 64 | 401.50 | | 420.64 | 2.10x | | 176.68 | 0.88x | | 345.43 | 1.72x |

### 1.2 Accuracy

The table below shows the MMLU loss in percentage compared to BF16 baseline.
Config: H100, nvidia-modelopt v0.21.1, TenorR-LLM v0.15.
Note that typically FP8 is the go-to choices for H100. 4-bit AWQ methods is recommended when GPU memory is a constraint.
More benchmark with earlier version of Model Optimizer can be found in this [TensorRT-LLM README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#benchmark).

| Model | MMLU loss FP8 |MMLU loss INT4 AWQ|MMLU loss W4A8 AWQ|
|:-----------------------:|:-------------:|:----------------:|:----------------:|
| Llama3.1-8B (instruct) | 1.50% | 5.66% | 6.00% |
| Llama3.1-70B (instruct) | 0.38% | 1.07% | 1.20% |

## 2. PTQ for Stable Diffusion

The following table shows inference speedup for INT8 and FP8 on a Stable Diffusion XL 1.0 base model compared to the FP16 baseline.
Config: Image resolution=1024×1024, 30 steps. TensorRT v9.3. num-warmup-runs=1. Batch size=1.

| GPU | INT8 Latency (ms) | FP8 Latency (ms) | Speedup (INT8 v.s. FP16) | Speedup (FP8 v.s. FP16) |
|:--------------:|:-----------------:|:----------------:|:------------------------:|:-----------------------:|
| RTX 6000 Ada | 2,479.19 | 2,441.16 | 1.43x | 1.45x |
| RTX 4090 | 2,058.11 | 2,161.38 | 1.20x | 1.14x |
| L40S | 2,338.88 | 2,167.82 | 1.25x | 1.35x |

## 3. Quantization-aware training

The below table demonstrates the validation loss of Quantization-aware training (QAT) compared to PTQ of a Llama 2 7B model using nvidia-modelopt v0.11.0.
The baseline is fine-tuned on the target dataset. Note that we use INT4 to showcase that QAT can better preserve model accuracy at low precision. This implies that QAT can be applied with a low training cost, enabling generative AI applications that are sensitive to accuracy drop to preserve accuracy even at ultra-low precisions where both weight and activations are 4-bit for [NVIDIA Blackwell platform](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/).

| Method | Dataset | Val loss - BF16 Baseline | Val loss - PTQ | Val loss - QAT (lower is better) |
|:----------------------------:|:--------------------:|:------------------------:|:--------------:|:--------------:|
| INT4 Weight, FP16 Activation | samsum | 1.036 | 1.059 | **1.044** |
| INT4 Weight, INT8 Activation | samsum | 1.036 | 3.321 | **1.294** |
| INT4 Weight, FP16 Activation | databricks-dolly-15k | 1.151 | 1.305 | **1.172** |
| INT4 Weight, INT8 Activation | databricks-dolly-15k | 1.151 | 2.313 | **1.640** |

## 4. Sparsity

### 4.1 Performance

The table shows the inference speedup of a sparsified Llama 2 70B model compared to the baseline dense model in different batch sizes.
The benchmark with batch_size=896 is part of [MLPerf Inference v4.0](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/).
Config: NVIDIA H100 80GB GPU. FP8, TP=1, PP=1 for all sparsified models. The dense model needs TP=2 due to larger weight sizes.

| Batch Size | Inference speedup (compared to the FP8 dense model) |
|:----------:|:---------------------------------------------------:|
| 32 | 1.62x |
| 64 | 1.52x |
| 128 | 1.35x |
| 896 | 1.30x |

### 4.2 Accuracy

We recommend using sparsity with fine-tuning to avoid accuracy degradation.
The following table shows the comparison of validation loss of a Llama 2 70B using sparsity with and without fine-tuning. Finetuning and validation are done on the Open-Orca dataset.

| Method | Validation loss (lower is better) |
|:--------------------------------:|:---------------------------------:|
| FP8 (baseline) | 0.721 |
| FP8 + SparseGPT, no fine-tuning | 2.724 |
| FP8 + Sparsity, with fine-tuning | **1.01** |
Loading
Loading