A modern, no-throw C++ library for high-performance GPU inference of CNN models with NVIDIA TensorRT — with optional zero-copy Python bindings.
tensorrt_cpp_api turns an ONNX model into a cached, optimized TensorRT engine and runs it with
a small, leak-free API: name-keyed tensors at the boundary, caller-owned CUDA streams, explicit
host/device transfers, and a Status/Result<T> error model — no exceptions, no OpenCV or
TensorRT types in the public headers. It targets TensorRT ≥ 10 (built to the TensorRT 11
surface), CUDA 12, C++20, Linux.
#include <tensorrt_cpp_api/all.h>
using namespace trtcpp;
int main() {
// Build an FP16 engine from ONNX, or load it from the on-disk cache if one is already current.
BuildOptions opt;
opt.precision = Precision::kFp16;
opt.engineCacheDir = "engines";
auto engine = EngineBuilder{}.buildAndLoad("model.onnx", opt);
if (!engine) {
std::fprintf(stderr, "%s\n", engine.status().message().c_str());
return 1;
}
Stream stream; // owns a CUDA stream — or Stream::wrap(existingHandle) to use yours
auto input = Tensor::allocate(DType::kFloat32, Shape{1, 3, 640, 640}, Device::kCuda).value();
// ... fill `input` (e.g. via the fused preproc kernel) ...
auto output = engine->inferSingle({{engine->inputNames().front(), input.view()}}, stream);
if (!output) return 1;
auto host = output->toHost(stream).value(); // explicit D2H + sync; never implicit
std::span<const float> scores = host.as<float>().value();
// ... post-process `scores` ...
}- Engine cache that's actually safe. Build-or-load keyed by ONNX content hash + build options
- TensorRT version + GPU UUID, with a JSON sidecar and atomic writes. A stale cache (changed model, options, driver, or GPU) is detected and rebuilt instead of silently misused.
- Dynamic shapes, done right. Per-input min/opt/max optimization profiles;
-1-awareShape; one optimization profile per execution context for concurrent dynamic-shape inference. - Concurrency.
EnginePoolleases execution contexts for multi-stream inference; every call runs on a caller-providedStream. The engine is thread-compatible; the pool is thread-safe. - No leaky abstractions. No
nvinfer1, OpenCV, or spdlog types in any public header (PImpl + version-gated, generatedbuild_config.h). Consumers need TensorRT at runtime, not compile time. - Quantization without surprises.
Precision::kFp16/kInt8Qdq/kFp8…; precision is version-aware and never a silent no-op (it errors clearly when a mode isn't achievable). - Optional fused preprocessing (
tensorrt_cpp_api::preproc): one CUDA kernel does letterbox-resize → BGR↔RGB → per-channel normalize → HWC→NCHW → cast, no intermediate buffers. - Optional zero-copy Python bindings (
trtcpp): feed CuPy / PyTorch / Numba GPU arrays in and get them back via__cuda_array_interface__/ DLPack — no host round-trips, GIL released during inference. Seeexamples/python. - Installable.
cmake --installproduces afind_package(tensorrt_cpp_api)-consumable package.
Single-stream inference latency on an RTX 3080 Laptop GPU (preallocated, zero-copy enqueue
loop — examples/benchmark), TensorRT 10:
| Model | Precision | Latency | Throughput |
|---|---|---|---|
| YOLOv8n | FP16 | 1.07 ms | 937 inf/s |
| YOLOv8n | FP32 | 2.00 ms | 499 inf/s |
| MobileNetV2 | FP16 | 0.31 ms | 3199 inf/s |
Inference time is TensorRT-bound — it is the enqueueV3 cost of the engine itself, so the wrapper
adds no measurable inference overhead. The library's work is everything around that call:
zero-copy name-keyed IO with no per-call allocations or nested-vector copies, a stream-ordered
allocator, and the no-throw Status/Result API. The Python bindings run the same path within
~13% of C++ (examples/python/benchmark_parity.py).
TensorRT and CUDA are system/externally provided. In brief:
cmake -S . -B build -DTRT_CPP_API_BUILD_PREPROC=ON # add -DTensorRT_DIR=<root> for a tarball
cmake --build build -j$(nproc)
cmake --install build --prefix /opt/trtcppThen in a downstream project:
find_package(tensorrt_cpp_api REQUIRED)
target_link_libraries(myapp PRIVATE tensorrt_cpp_api::tensorrt_cpp_api tensorrt_cpp_api::preproc)Python: pip install . (builds the trtcpp wheel via scikit-build-core). Full details —
apt vs tarball TensorRT, build options, Python — are in docs/install.md.
examples/ has four runnable reference programs, each consuming the installed package:
classification (ImageNet top-5), detection (YOLOv8n + NMS), segmentation (DeepLabV3),
and a zero-copy Python demo with a C++/Python perf-parity benchmark. examples/download_models.sh
fetches the models.
- Quickstart & core concepts
- Installation
- Upgrading from v6
- API reference:
doxygen Doxyfile(HTML indocs/api/)
This library is the inference backend for YOLOv8-TensorRT-CPP and YOLOv9-TensorRT-CPP (object detection, segmentation, pose).
Linux, CUDA 12, TensorRT ≥ 10, CNN-style vision models. Windows and LLM/transformer-specific features are out of scope.
Issues and PRs welcome. Install the hooks with pre-commit install (clang-format + cmake-format);
CI runs the build, the CPU test suite, sanitizers, and a Python wheel build. If this project helps
you, a ⭐ is appreciated — connect on LinkedIn.
Loic Tetrel 💻 |
thomaskleiven 💻 |
WiCyn 💻 |
This project follows the all-contributors specification.
See LICENSE. Version history is in CHANGELOG.md.