Skip to content

Conversation

@b-pass
Copy link
Collaborator

@b-pass b-pass commented Jan 13, 2026

Description

pybind11 leaks three PyTypes (metaclass, function_record, static_property) when an interpreter or sub-interpreter is finalized. This is about 1KB each. These should be released during finalization so that simply creating interpreters does not leak memory.

This fixes #5794 (Python itself leaks more memory than this, but we can fix what we're doing at least...)

If a pybind11 module is loaded into a non-pybind11-owned sub-interpreter, then the pybind11 internals are leaked when the sub-interpreter shuts down. This PR also fixes that, by having the internals destruct and free during the interpreter finalization.

Suggested changelog entry:

  • Deallocate pybind11 internals during (sub-)interpreter shutdown to avoid memory leaks.

📚 Documentation preview 📚: https://pybind11--5958.org.readthedocs.build/

shutdown can safely DECREF Python objects owned by the internals.
@b-pass b-pass requested a review from oremanj January 13, 2026 01:45
@oremanj
Copy link
Collaborator

oremanj commented Jan 13, 2026

I think for shared internals you should go ahead and deallocate the internals struct, but not the pointer to it, from the capsule destructor. That is, call get_pp()->reset() followed by unref() on the pp_manager, but don't call destroy(). (For local internals, go ahead and call destroy().) This reduces the leak for the main interpreter to 8 bytes (the std::unique_ptr<internals> itself, but not its contents), and means you can straightforwardly put internals cleanup code in the destructor of struct internals.

The internals state dict is cleared very late in interpreter shutdown - AFAICT it's the last place that calls user-controlled code where the Python C API is available. C-level Py_AtExit handlers are later but those aren't allowed to call into Python. Step 24 in this wonderful list, which I wish had a wider audience than a random GitHub comment: hudson-trading/pymetabind#2 (comment) (that list is for main interpreter shutdown, but I believe subinterpreters do a subset in basically the same order).

If anyone is trying to access internals after the state dict is cleared, it's overwhelmingly likely that whatever they're going to do with the result is invalid anyway.

If somehow someone does manage to ask for internals after they've been destroyed, there are two possibilities for what will occur, and I think both are acceptable.

  • If they are accessing the internals through a pp_manager that has had unref() called -- that is, if they are accessing local internals, or shared internals from the extension module that originally created them -- they will create and initialize a new internals struct and add a new entry for it to the interpreter state dict. If this occurred before the state dict was destroyed (the only possibility for that I can think of is if it's done from the destructor of a different entry in the state dict), then the new internals will promptly be destroyed again as PyDict_Clear continues its work. If it occurred after the state dict was destroyed, it will fail because the state dict doesn't exist - that's fine since you're really not supposed to run anything that touches the Python C API after that point.
  • If they are accessing through a different pp_manager (different extension module's reference to the same shared internals), they will see that the std::unique_ptr<internals> (which is still allocated) holds null, and fix that by creating and initializing a new internals struct but not registering it in the interpreter state dict. This has the outside potential to create a split-brain scenario where the extension modules in this category share one internals and the extension module in the previous category has a different internals. But everything is shutting down anyway, so it will be difficult for anyone to observe the mildly confusing results.

If we were OK with extension modules in the second category above crashing (since they hold a pointer to a deallocated std::unique_ptr<internals> in their pp_manager) we could go ahead and destroy() the pp_manager in the capsule destructor. But I think eight bytes' leak is a fair price to pay to allow them to not crash.

The perfect solution would be to store inside internals an intrusive linked list of all the pp_managers that refer to it. Then upon internals destruction we could automatically unref() all of them, so that it would be safe to deallocate the internals pointer too. This eliminates the leak and the split-brain possibility, but adding the list head would require an internals version bump. Maybe leave a TODO comment with a link to this thread.

@b-pass b-pass changed the title Free PyTypes owned by internals during interpreter finalization Destruct internals during interpreter finalization Jan 14, 2026
@b-pass
Copy link
Collaborator Author

b-pass commented Jan 14, 2026

Ok, I have updated the title and description and code to delete the internals itself in the capsule destructor. I didn't have it call unref currently, so @oremanj's second case applies. If something manages to use pybind11 during finalization after the internals capsule is destroyed, it will re-create (and subsequently might leak) its own new internals. This is an exceedingly rare case, and even in the event it happens the behavior won't result in a crash.

If something triggers internals to be created during finalization, it might end up being destroyed after finalization and we don't want to do the DECREF at that point, we need the leaky behavior.
// In odd finalization scenarios it might end up running after the interpreter has
// completely shut down, In that case, we should not decref these objects because pymalloc
// is gone.
if (Py_IsInitialized() != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check defeats the point. Py_IsInitialized begins returning false quite early in Py_Finalize, well before the state dict is cleared, so adding this check will bring your leak back (at least for the main interpreter).

internals can only be freed by a DECREF on the capsule (in which case its destructor runs at a time when it's safe to DECREF things) or by an explicit call to internals_pp_manager::destroy(). Do any of the latter happen at times when it's not safe to DECREF things?

Copy link
Collaborator Author

@b-pass b-pass Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If internals is re-created during finalization, then it will be set in the unique_ptr and the current pybind11 finalization code will subsequently call destroy (after PyFinalize has completed) and delete the re-created version.

We could, of course, change the pybind11 finalization process to not behave this way....

// destructed when going out of scope here, so the destructor will be called
// immediately, which will also free the storage.
/*destructor=*/[](void *ptr) -> void { delete static_cast<Payload *>(ptr); });
/*destructor=*/dtor);
Copy link
Collaborator

@oremanj oremanj Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a semantics-preserving change.

There are two possibilities for when the capsule gets destroyed:

  • It could be destroyed almost immediately, if another thread inserted the same key before we did. See the comment on line 639. In that case, no one else has seen our newly-allocated payload, so we should delete it.
  • It could be destroyed at interpreter finalization. That's when custom destruction logic might come into play.

If two threads concurrently try to create internals, your use of the final dtor from the start means that one of them will leak its unique_ptr. Not a terrible outcome, but easy enough to avoid.

Suggest you change the dtor parameter to dtor_if_inserted. Create the capsule initially with a destructor that deallocates the payload. If insertion succeeds and dtor_if_inserted is not null, then swap the capsule destructor to be dtor_if_inserted instead. Effectively, the old clear_destructor parameter is a special case of the dtor_if_inserted semantics: dtor_if_inserted is a no-op lambda if clear_destructor is true, or is null (i.e. continue with the original dtor that deletes the payload) if clear_destructor is false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I believe the latest change addresses this.

@henryiii henryiii self-requested a review as a code owner January 15, 2026 19:59
@rwgk
Copy link
Collaborator

rwgk commented Jan 17, 2026

I'm trying to get my head around the current state of this PR, starting with:

Cursor (Claude 4.5 Opus) generated, with minor manual edits:


Problem Being Solved

pybind11 leaks three PyTypeObjects (metaclass, function_record, static_property) when an interpreter or sub-interpreter is finalized (~1KB each). This PR fixes the leak by properly releasing these during interpreter shutdown.

Joshua's (@oremanj) Concerns and Their Resolution

1. Py_IsInitialized() check defeats the point

  • Joshua noted that Py_IsInitialized() returns false early in Py_Finalize, before the state dict is cleared.
  • Resolved: The code now uses is_interpreter_alive() which checks Py_IsInitialized() || Py_IsFinalizing(). During finalization, Py_IsFinalizing() is true, so the cleanup runs.

2. Use Py_CLEAR instead of Py_XDECREF + set to null

  • Resolved: The current code uses Py_CLEAR().

3. Capsule destructor race condition

  • Joshua pointed out that if two threads concurrently create internals, one would leak its unique_ptr.
  • Resolved: The capsule is created with a simple delete destructor, and only swapped to internals_shutdown after successful insertion.

4. Don't destroy the unique_ptr itself (main architectural suggestion)

  • Joshua suggested calling pp->reset() to destroy contents but not deleting the unique_ptr itself (8 bytes), since other modules may still hold references to it.
  • Resolved: The internals_shutdown() function does exactly this - calls pp->reset() but leaves the unique_ptr allocated.

Current Implementation

The implementation:

  1. Adds destructors to both internals and local_internals that call Py_CLEAR() on their PyType pointers
  2. Uses is_interpreter_alive() to safely check if Python APIs can still be called
  3. Properly handles the capsule destructor race condition
  4. Accepts an 8-byte leak (the unique_ptr shell) to avoid crashes when other modules hold references

Assessment

The PR looks ready to merge. All critical concerns have been addressed. The remaining 8-byte leak per interpreter is an acceptable tradeoff for safety, as Joshua explicitly endorsed. The last commit (by Henry Schreiner) adds a CI cleanup step, suggesting maintainer involvement.

@rwgk
Copy link
Collaborator

rwgk commented Jan 17, 2026

EDIT: Please see #5961 for the continuation of the leak investigation


Cursor (Claude 4.5 Opus) generated:


Subinterpreter Memory Leak Investigation

I ran tests to measure memory usage with repeated subinterpreter create/destroy cycles, comparing pure C API against pybind11.

Summary of Results

Test Per iteration Notes
Pure C API (Py_NewInterpreter) ~3-6 kB Simple create/destroy
Pure C API (Py_NewInterpreterFromConfig + PyRun) ~4 kB With config matching pybind11's defaults
pybind11 upstream/master ~1,698 kB Before this PR
pybind11 PR #5958 ~1,697 kB After this PR

Key Finding

The massive leak (~1.7 MB per subinterpreter cycle) is NOT primarily from Python itself. Pure C API subinterpreter create/destroy cycles leak only ~4-6 kB, while pybind11 leaks approximately 400x more.

This PR correctly fixes pybind11's leak of the three PyTypeObjects (~3 kB), but there appears to be a much larger leak somewhere else in pybind11's subinterpreter handling that this PR does not address.

Raw Test Output

Pure C API (simple Py_NewInterpreter):

Iterations: 1000
RSS before: 8704 kB
RSS after:  11776 kB
Increase:   3072 kB
Per iter:   3.07 kB

Pure C API (with PyInterpreterConfig + PyRun_SimpleString):

Iterations: 500
RSS before: 8704 kB
RSS after:  10744 kB
Increase:   2040 kB
Per iter:   4.08 kB

pybind11 upstream/master:

Iterations: 1000
RSS before: 11264 kB
RSS after:  1709568 kB
Increase:   1698304 kB
Per iter:   1698.3 kB

pybind11 PR #5958:

Iterations: 1000
RSS before: 11264 kB
RSS after:  1708032 kB
Increase:   1696768 kB
Per iter:   1696.77 kB

Test Environment

  • Python: 3.14 (default/GIL build, commit df793163d58)
  • pybind11 upstream/master: commit e44aae2 ("chore: bump CMake max policy to 4.2 (chore: bump CMake max policy to 4.2 #5944)")
  • pybind11 PR branch: commit 5f9c30b ("Add cleanup step to CI workflow")
  • Compiler: g++ -std=c++20 -O0 -g (pybind11) / gcc (pure C)
  • Platform: Ubuntu 24.04 (WSL2)
  • Memory measurement: /proc/self/status VmRSS

Test Code

pybind11 Test (measure_subinterpreter_leak.cpp)

// Memory leak measurement for subinterpreter create/destroy cycles
//
// Compile:
//   g++ -std=c++20 -O0 -g \
//       -I<pybind11>/include \
//       -I<python>/include/python3.14 \
//       -o measure_leak measure_subinterpreter_leak.cpp \
//       -L<python>/lib -Wl,-rpath,<python>/lib \
//       -lpython3.14 -lpthread -ldl -lutil
//
// Run:
//   ./measure_leak 500

#include <pybind11/embed.h>

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
#include <pybind11/subinterpreter.h>
#endif

#include <iostream>
#include <fstream>
#include <string>

namespace py = pybind11;

long get_rss_kb() {
    std::ifstream status("/proc/self/status");
    std::string line;
    while (std::getline(status, line)) {
        if (line.rfind("VmRSS:", 0) == 0) {
            // Format: "VmRSS:     12345 kB"
            long rss = 0;
            for (char c : line) {
                if (c >= '0' && c <= '9') {
                    rss = rss * 10 + (c - '0');
                }
            }
            return rss;
        }
    }
    return -1;
}

int main(int argc, char* argv[]) {
    int iterations = 10000;
    if (argc > 1) {
        iterations = std::stoi(argv[1]);
    }

    py::scoped_interpreter guard{};

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
    long rss_before = get_rss_kb();
    
    for (int i = 0; i < iterations; ++i) {
        {
            py::scoped_subinterpreter ssi;
        }
    }
    
    long rss_after = get_rss_kb();
    
    std::cout << "Iterations: " << iterations << "\n";
    std::cout << "RSS before: " << rss_before << " kB\n";
    std::cout << "RSS after:  " << rss_after << " kB\n";
    std::cout << "Increase:   " << (rss_after - rss_before) << " kB\n";
    std::cout << "Per iter:   " << (double)(rss_after - rss_before) / iterations << " kB\n";
#else
    std::cout << "Subinterpreter support: NO\n";
#endif

    return 0;
}

Pure C API Test - Simple (measure_leak_pure_c.c)

// Pure C API subinterpreter leak measurement (no pybind11)
// Uses simple Py_NewInterpreter/Py_EndInterpreter
//
// Compile:
//   gcc -o measure_leak_pure_c measure_leak_pure_c.c \
//       -I<python>/include/python3.14 \
//       -L<python>/lib -Wl,-rpath,<python>/lib \
//       -lpython3.14 -lpthread -ldl -lutil
//
// Run:
//   ./measure_leak_pure_c 500

#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long get_rss_kb() {
    FILE *f = fopen("/proc/self/status", "r");
    if (!f) return -1;
    
    char line[256];
    long rss = -1;
    while (fgets(line, sizeof(line), f)) {
        if (strncmp(line, "VmRSS:", 6) == 0) {
            rss = 0;
            for (char *p = line; *p; p++) {
                if (*p >= '0' && *p <= '9') {
                    rss = rss * 10 + (*p - '0');
                }
            }
            break;
        }
    }
    fclose(f);
    return rss;
}

int main(int argc, char *argv[]) {
    int iterations = 500;
    if (argc > 1) {
        iterations = atoi(argv[1]);
    }

    // Initialize main interpreter
    Py_Initialize();
    
    PyThreadState *main_tstate = PyThreadState_Get();
    
    long rss_before = get_rss_kb();
    
    for (int i = 0; i < iterations; i++) {
        // Create subinterpreter
        PyThreadState *sub_tstate = Py_NewInterpreter();
        if (!sub_tstate) {
            fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
            break;
        }
        
        // Destroy subinterpreter
        Py_EndInterpreter(sub_tstate);
        
        // Switch back to main
        PyThreadState_Swap(main_tstate);
    }
    
    long rss_after = get_rss_kb();
    
    printf("Iterations: %d\n", iterations);
    printf("RSS before: %ld kB\n", rss_before);
    printf("RSS after:  %ld kB\n", rss_after);
    printf("Increase:   %ld kB\n", rss_after - rss_before);
    printf("Per iter:   %.2f kB\n", (double)(rss_after - rss_before) / iterations);
    
    Py_Finalize();
    return 0;
}

Pure C API Test - With Config (measure_leak_pure_c_active.c)

This version uses Py_NewInterpreterFromConfig with settings matching pybind11's defaults, and runs a simple Python statement in each subinterpreter.

// Pure C API subinterpreter leak measurement - with PyInterpreterConfig
// More comparable to pybind11's scoped_subinterpreter
//
// Compile:
//   gcc -o measure_leak_pure_c_active measure_leak_pure_c_active.c \
//       -I<python>/include/python3.14 \
//       -L<python>/lib -Wl,-rpath,<python>/lib \
//       -lpython3.14 -lpthread -ldl -lutil
//
// Run:
//   ./measure_leak_pure_c_active 500

#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long get_rss_kb() {
    FILE *f = fopen("/proc/self/status", "r");
    if (!f) return -1;
    
    char line[256];
    long rss = -1;
    while (fgets(line, sizeof(line), f)) {
        if (strncmp(line, "VmRSS:", 6) == 0) {
            rss = 0;
            for (char *p = line; *p; p++) {
                if (*p >= '0' && *p <= '9') {
                    rss = rss * 10 + (*p - '0');
                }
            }
            break;
        }
    }
    fclose(f);
    return rss;
}

int main(int argc, char *argv[]) {
    int iterations = 500;
    if (argc > 1) {
        iterations = atoi(argv[1]);
    }

    // Initialize main interpreter
    Py_Initialize();
    
    PyThreadState *main_tstate = PyThreadState_Get();
    
    long rss_before = get_rss_kb();
    
    for (int i = 0; i < iterations; i++) {
        // Create subinterpreter with config (like pybind11 does)
        PyInterpreterConfig cfg = {
            .use_main_obmalloc = 0,
            .allow_fork = 0,
            .allow_exec = 0,
            .allow_threads = 1,
            .allow_daemon_threads = 0,
            .check_multi_interp_extensions = 1,
            .gil = PyInterpreterConfig_OWN_GIL,
        };
        
        PyThreadState *sub_tstate;
        PyStatus status = Py_NewInterpreterFromConfig(&sub_tstate, &cfg);
        if (PyStatus_Exception(status)) {
            fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
            break;
        }
        
        // We're now on the subinterpreter - run a tiny bit of code
        PyRun_SimpleString("x = 1");
        
        // Destroy subinterpreter
        Py_EndInterpreter(sub_tstate);
        
        // Switch back to main
        PyThreadState_Swap(main_tstate);
    }
    
    long rss_after = get_rss_kb();
    
    printf("Iterations: %d\n", iterations);
    printf("RSS before: %ld kB\n", rss_before);
    printf("RSS after:  %ld kB\n", rss_after);
    printf("Increase:   %ld kB\n", rss_after - rss_before);
    printf("Per iter:   %.2f kB\n", (double)(rss_after - rss_before) / iterations);
    
    Py_Finalize();
    return 0;
}

Analysis

The ~400x difference between pure C API (~4 kB/iter) and pybind11 (~1.7 MB/iter) suggests the leak is in pybind11's subinterpreter infrastructure, not in Python itself.

Looking at pybind11/subinterpreter.h, each scoped_subinterpreter cycle does:

  1. subinterpreter::create() - Creates the subinterpreter and calls detail::get_internals() to initialize pybind11's per-interpreter state
  2. subinterpreter_scoped_activate - Creates a new PyThreadState, switches to it, and stores it in internals.tstate
  3. Destructor - Calls Py_EndInterpreter(), then destroy() on both internals_pp_manager and local_internals_pp_manager

The get_internals() call creates the internals struct which includes:

  • static_property_type (PyTypeObject)
  • default_metaclass (PyTypeObject)
  • instance_base (PyObject)
  • Various maps and containers

This PR fixes the leak of the PyTypeObjects (~3 kB), but the remaining ~1.7 MB leak suggests something else in the internals infrastructure is not being properly cleaned up.

Possible areas to investigate:

  • The various maps in internals (registered_types_cpp, registered_types_py, etc.)
  • The instance_base object
  • Thread-local storage structures
  • Something in local_internals

Conclusion

This PR is correct and should be merged - it fixes the leak it claims to fix. However, there's a much larger leak in pybind11's subinterpreter handling that warrants separate investigation.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-pass I'd say go ahead and merge this PR, it looks solid. (It'd be great if you could adopt the suggested expanded comment.)

I'll try to find out more about the "massive leak" as measured with the Cursor-generated code (#5958 (comment)). If you see anything suspicious in the measurements, please let me know.

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

@b-pass I asked Cursor (GPT-5.2 Codex Extra High) to run leak measurements for this PR, similar to what it did before under PR #5961. The test code and measurements are below.


Measurement: PR 5958 (75bd2925) with use_main_obmalloc=1

Repo state

  • Branch: b-pass→pytype-leaked-in-finalize
  • Commit: 75bd2925 (“Avoid recreating internals during type deallocation at shutdown.”)
  • Workspace: /wrk/forked/pybind11
  • Python: /wrk/cpython_installs/v3.14.2_df793163d58_default/bin/python3

Test program (C++ source)

Saved as /tmp/measure_subinterpreter_leak_use_main_obmalloc.cpp:

// Memory leak measurement using use_main_obmalloc=1

#include <pybind11/embed.h>

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
#include <pybind11/subinterpreter.h>
#endif

#include <fstream>
#include <iostream>
#include <string>

namespace py = pybind11;

static long get_rss_kb() {
    std::ifstream status("/proc/self/status");
    std::string line;
    while (std::getline(status, line)) {
        if (line.rfind("VmRSS:", 0) == 0) {
            long rss = 0;
            for (char c : line) {
                if (c >= '0' && c <= '9') {
                    rss = rss * 10 + (c - '0');
                }
            }
            return rss;
        }
    }
    return -1;
}

int main(int argc, char *argv[]) {
    int iterations = 100;
    if (argc > 1) {
        iterations = std::stoi(argv[1]);
    }

    py::scoped_interpreter guard{};

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
    PyInterpreterConfig cfg;
    std::memset(&cfg, 0, sizeof(cfg));
    cfg.allow_threads = 1;
    cfg.check_multi_interp_extensions = 1;
    cfg.gil = PyInterpreterConfig_OWN_GIL;
    cfg.use_main_obmalloc = 1;

    long rss_before = get_rss_kb();
    for (int i = 0; i < iterations; ++i) {
        {
            py::scoped_subinterpreter ssi(cfg);
        }
    }
    long rss_after = get_rss_kb();

    std::cout << "Iterations: " << iterations << "\n";
    std::cout << "RSS before: " << rss_before << " kB\n";
    std::cout << "RSS after:  " << rss_after << " kB\n";
    std::cout << "Increase:   " << (rss_after - rss_before) << " kB\n";
    std::cout << "Per iter:   " << (double)(rss_after - rss_before) / iterations << " kB\n";
#else
    std::cout << "Subinterpreter support: NO\n";
#endif

    return 0;
}

Build command

g++ -std=c++20 -O0 -g \
  -I/wrk/forked/pybind11/include \
  -I/wrk/cpython_installs/v3.14.2_df793163d58_default/include/python3.14 \
  -o /tmp/measure_leak_pybind11_use_main_obmalloc \
  /tmp/measure_subinterpreter_leak_use_main_obmalloc.cpp \
  -L/wrk/cpython_installs/v3.14.2_df793163d58_default/lib \
  -Wl,-rpath,/wrk/cpython_installs/v3.14.2_df793163d58_default/lib \
  -lpython3.14 -lpthread -ldl -lutil

Run command

/tmp/measure_leak_pybind11_use_main_obmalloc 100

Output (observed)

Iterations: 100
RSS before: 12232 kB
RSS after:  15052 kB
Increase:   2820 kB
Per iter:   28.2 kB

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

Follow-on measurement: revert 75bd2925 (everything else unchanged)

This is the same use_main_obmalloc=1 measurement as in the previous comment, but with commit 75bd292 reverted (i.e., measured at 75bd2925^ / 2848fd6).

Output (observed)

Iterations: 100
RSS before: 12232 kB
RSS after:  15224 kB
Increase:   2992 kB
Per iter:   29.92 kB

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

TL;DR (written by rwgk): It looks like use_main_obmalloc=1 vs default obmalloc is by far the biggest source of the leak. Full Cursor-generated details below.


Measurement: PR 5958 (75bd2925) with default use_main_obmalloc

Repo state

  • Branch: b-pass→pytype-leaked-in-finalize
  • Commit: 75bd2925 (“Avoid recreating internals during type deallocation at shutdown.”)
  • Workspace: /wrk/forked/pybind11
  • Python: /wrk/cpython_installs/v3.14.2_df793163d58_default/bin/python3

Test program (C++ source)

Saved as /tmp/measure_subinterpreter_leak_default_obmalloc.cpp:

// Memory leak measurement with default use_main_obmalloc (not set to 1)

#include <pybind11/embed.h>

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
#include <pybind11/subinterpreter.h>
#endif

#include <cstring>
#include <fstream>
#include <iostream>
#include <string>

namespace py = pybind11;

static long get_rss_kb() {
    std::ifstream status("/proc/self/status");
    std::string line;
    while (std::getline(status, line)) {
        if (line.rfind("VmRSS:", 0) == 0) {
            long rss = 0;
            for (char c : line) {
                if (c >= '0' && c <= '9') {
                    rss = rss * 10 + (c - '0');
                }
            }
            return rss;
        }
    }
    return -1;
}

int main(int argc, char *argv[]) {
    int iterations = 100;
    if (argc > 1) {
        iterations = std::stoi(argv[1]);
    }

    py::scoped_interpreter guard{};

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
    PyInterpreterConfig cfg;
    std::memset(&cfg, 0, sizeof(cfg));
    cfg.allow_threads = 1;
    cfg.check_multi_interp_extensions = 1;
    cfg.gil = PyInterpreterConfig_OWN_GIL;
    // Note: use_main_obmalloc left at default (0)

    long rss_before = get_rss_kb();
    for (int i = 0; i < iterations; ++i) {
        {
            py::scoped_subinterpreter ssi(cfg);
        }
    }
    long rss_after = get_rss_kb();

    std::cout << "Iterations: " << iterations << "\n";
    std::cout << "RSS before: " << rss_before << " kB\n";
    std::cout << "RSS after:  " << rss_after << " kB\n";
    std::cout << "Increase:   " << (rss_after - rss_before) << " kB\n";
    std::cout << "Per iter:   " << (double)(rss_after - rss_before) / iterations << " kB\n";
#else
    std::cout << "Subinterpreter support: NO\n";
#endif

    return 0;
}

Build command

g++ -std=c++20 -O0 -g \
  -I/wrk/forked/pybind11/include \
  -I/wrk/cpython_installs/v3.14.2_df793163d58_default/include/python3.14 \
  -o /tmp/measure_leak_pybind11_default_obmalloc \
  /tmp/measure_subinterpreter_leak_default_obmalloc.cpp \
  -L/wrk/cpython_installs/v3.14.2_df793163d58_default/lib \
  -Wl,-rpath,/wrk/cpython_installs/v3.14.2_df793163d58_default/lib \
  -lpython3.14 -lpthread -ldl -lutil

Run command

/tmp/measure_leak_pybind11_default_obmalloc 100

Output (observed)

Iterations: 100
RSS before: 12228 kB
RSS after:  183012 kB
Increase:   170784 kB
Per iter:   1707.84 kB

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR is about as much as we can/should do to reduce the leakage, even though the impact is small compared to the default obmalloc behavior.

My vote is to merge this PR asap.

I'll see if I can make a CPython reproducer to demonstrate the vast difference in obmalloc default vs use_main_obmalloc=1 behavior.

@b-pass
Copy link
Collaborator Author

b-pass commented Jan 18, 2026

I've been using KDE heaptrack (with PYTHONMALLOC=malloc) to look at leaks. With the latest changes here, it seems that something is holding a ref to default_metaclass member of internals ... but everything else appears to be being cleaned up properly now. So I'll merge it.

@b-pass b-pass merged commit da6e071 into pybind:master Jan 18, 2026
87 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jan 18, 2026
@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

@b-pass

I have interesting/surprising new observations I wanted to show here clearly, without going into interpretations quite yet.

Results (non-heaptrack)

All runs used the same binary built against the default CPython 3.14 branch install, and the same two runtime args:
use_main_obmalloc (0/1) and iterations.

use_main_obmalloc iterations RSS before RSS after Increase Per iter
0 10 9944 kB 11820 kB 1876 kB 187.60 kB
1 10 9912 kB 12872 kB 2960 kB 296.00 kB
0 100 9948 kB 11680 kB 1732 kB 17.32 kB
1 100 9900 kB 12920 kB 3020 kB 30.20 kB
0 1000 9948 kB 11680 kB 1732 kB 1.73 kB
1 1000 9944 kB 12984 kB 3040 kB 3.04 kB

I also ran the same six measurements under heaptrack and saw very similar results (same overall trend and comparable per-iteration magnitudes). I omitted those numbers here for simplicity.


Environment

  • CPython source commit: 23e3c0ae867cca0130e441e776c9955b9027c510 (3.14 branch)
  • Install used: /wrk/cpython_installs/3.14_branch_23e3c0ae867_default
  • Compiler: gcc -O0 -g
  • Platform: Linux (RSS from /proc/self/status)

Reproducer source (/tmp/cpython_obmalloc_repro.c)

// CPython subinterpreter leak repro: use_main_obmalloc (runtime arg)
#include <Python.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static long get_rss_kb(void) {
    FILE *f = fopen("/proc/self/status", "r");
    if (!f) {
        return -1;
    }

    char line[256];
    long rss = -1;
    while (fgets(line, sizeof(line), f)) {
        if (strncmp(line, "VmRSS:", 6) == 0) {
            rss = 0;
            for (char *p = line; *p; p++) {
                if (*p >= '0' && *p <= '9') {
                    rss = rss * 10 + (*p - '0');
                }
            }
            break;
        }
    }
    fclose(f);
    return rss;
}

int main(int argc, char *argv[]) {
    assert(argc == 3);
    int use_main_obmalloc = atoi(argv[1]);
    int iterations = atoi(argv[2]);
    assert(use_main_obmalloc == 0 || use_main_obmalloc == 1);
    assert(iterations >= 0);

    Py_Initialize();
    PyThreadState *main_tstate = PyThreadState_Get();

    long rss_before = get_rss_kb();

    for (int i = 0; i < iterations; i++) {
        PyInterpreterConfig cfg;
        memset(&cfg, 0, sizeof(cfg));
        cfg.allow_fork = 0;
        cfg.allow_exec = 0;
        cfg.allow_threads = 1;
        cfg.allow_daemon_threads = 0;
        cfg.check_multi_interp_extensions = 1;
        cfg.gil = PyInterpreterConfig_OWN_GIL;
        cfg.use_main_obmalloc = use_main_obmalloc;

        PyThreadState *sub_tstate = NULL;
        PyStatus status = Py_NewInterpreterFromConfig(&sub_tstate, &cfg);
        if (PyStatus_Exception(status)) {
            fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
            break;
        }

        // Execute a tiny statement to exercise allocator.
        PyRun_SimpleString("x = 1");

        Py_EndInterpreter(sub_tstate);
        PyThreadState_Swap(main_tstate);
    }

    long rss_after = get_rss_kb();

    printf("Iterations: %d\n", iterations);
    printf("RSS before: %ld kB\n", rss_before);
    printf("RSS after:  %ld kB\n", rss_after);
    printf("Increase:   %ld kB\n", rss_after - rss_before);
    printf("Per iter:   %.2f kB\n", (double)(rss_after - rss_before) / iterations);

    Py_Finalize();
    return 0;
}

Build command (default build)

gcc -O0 -g -o /tmp/cpython_obmalloc_repro_default \
  /tmp/cpython_obmalloc_repro.c \
  -I/wrk/cpython_installs/3.14_branch_23e3c0ae867_default/include/python3.14 \
  -L/wrk/cpython_installs/3.14_branch_23e3c0ae867_default/lib \
  -Wl,-rpath,/wrk/cpython_installs/3.14_branch_23e3c0ae867_default/lib \
  -lpython3.14 -lpthread -ldl -lutil

Run commands (default build)

/tmp/cpython_obmalloc_repro_default 0 10
/tmp/cpython_obmalloc_repro_default 1 10
/tmp/cpython_obmalloc_repro_default 0 100
/tmp/cpython_obmalloc_repro_default 1 100
/tmp/cpython_obmalloc_repro_default 0 1000
/tmp/cpython_obmalloc_repro_default 1 1000

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

Similar to the previous comment, but with a freethreaded build of the exact same CPython sources

I don't want to explore this dimension (default vs freethreaded) at the moment, but just report the measurements for potentially future reference.

Results (non-heaptrack, freethreaded)

All runs used the same binary built against the freethreaded CPython 3.14 branch install, and the same two runtime args:
use_main_obmalloc (0/1) and iterations.

use_main_obmalloc iterations RSS before RSS after Increase Per iter
0 10 11508 kB 44724 kB 33216 kB 3321.60 kB
1 10 11472 kB 39524 kB 28052 kB 2805.20 kB
0 100 11492 kB 339560 kB 328068 kB 3280.68 kB
1 100 11504 kB 286168 kB 274664 kB 2746.64 kB
0 1000 11496 kB 3288020 kB 3276524 kB 3276.52 kB
1 1000 11496 kB 2752216 kB 2740720 kB 2740.72 kB

Environment

  • CPython source commit: 23e3c0ae867cca0130e441e776c9955b9027c510 (3.14 branch)
  • Install used: /wrk/cpython_installs/3.14_branch_23e3c0ae867_freethreaded
  • Compiler: gcc -O0 -g
  • Platform: Linux (RSS from /proc/self/status)

Reproducer source

Same /tmp/cpython_obmalloc_repro.c as in the default-build measurements (no changes).


Build command (freethreaded)

gcc -O0 -g -o /tmp/cpython_obmalloc_repro_freethreaded \
  /tmp/cpython_obmalloc_repro.c \
  -I/wrk/cpython_installs/3.14_branch_23e3c0ae867_freethreaded/include/python3.14t \
  -L/wrk/cpython_installs/3.14_branch_23e3c0ae867_freethreaded/lib \
  -Wl,-rpath,/wrk/cpython_installs/3.14_branch_23e3c0ae867_freethreaded/lib \
  -lpython3.14t -lpthread -ldl -lutil

Run commands (freethreaded)

/tmp/cpython_obmalloc_repro_freethreaded 0 10
/tmp/cpython_obmalloc_repro_freethreaded 1 10
/tmp/cpython_obmalloc_repro_freethreaded 0 100
/tmp/cpython_obmalloc_repro_freethreaded 1 100
/tmp/cpython_obmalloc_repro_freethreaded 0 1000
/tmp/cpython_obmalloc_repro_freethreaded 1 1000

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

pybind11 subinterpreter RSS measurements (default build)

Rationale

We now have a CPython repro (cpython_obmalloc_repro.c) here that shows how RSS growth scales with
use_main_obmalloc and iteration count. To isolate pybind11-specific effects, I mirrored
that workflow using pybind11::scoped_subinterpreter with the same two runtime args and the same
tiny per‑interpreter workload. This keeps the workload aligned while exercising pybind11’s
subinterpreter path.


Results (non-heaptrack)

All runs used the same binary built against the default CPython 3.14 branch install, and the same two runtime args:
use_main_obmalloc (0/1) and iterations.

use_main_obmalloc iterations RSS before RSS after Increase Per iter
0 10 12296 kB 30484 kB 18188 kB 1818.8 kB
1 10 12304 kB 14940 kB 2636 kB 263.6 kB
0 100 12208 kB 183232 kB 171024 kB 1710.2 kB
1 100 12296 kB 15140 kB 2844 kB 28.4 kB
0 1000 12304 kB 1711404 kB 1699100 kB 1699.1 kB
1 1000 12268 kB 16756 kB 4488 kB 4.5 kB

Reproducer source (/tmp/pybind11_obmalloc_repro.cpp)

// pybind11 subinterpreter RSS repro: use_main_obmalloc (runtime arg)
#include <pybind11/embed.h>

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
#include <pybind11/subinterpreter.h>
#endif

#include <cassert>
#include <cstring>
#include <fstream>
#include <iostream>
#include <string>

namespace py = pybind11;

static long get_rss_kb() {
    std::ifstream status("/proc/self/status");
    std::string line;
    while (std::getline(status, line)) {
        if (line.rfind("VmRSS:", 0) == 0) {
            long rss = 0;
            for (char c : line) {
                if (c >= '0' && c <= '9') {
                    rss = rss * 10 + (c - '0');
                }
            }
            return rss;
        }
    }
    return -1;
}

int main(int argc, char *argv[]) {
    assert(argc == 3);
    int use_main_obmalloc = std::stoi(argv[1]);
    int iterations = std::stoi(argv[2]);
    assert(use_main_obmalloc == 0 || use_main_obmalloc == 1);
    assert(iterations >= 0);

    py::scoped_interpreter guard{};

#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
    PyInterpreterConfig cfg;
    std::memset(&cfg, 0, sizeof(cfg));
    cfg.allow_fork = 0;
    cfg.allow_exec = 0;
    cfg.allow_threads = 1;
    cfg.allow_daemon_threads = 0;
    cfg.check_multi_interp_extensions = 1;
    cfg.gil = PyInterpreterConfig_OWN_GIL;
    cfg.use_main_obmalloc = use_main_obmalloc;

    long rss_before = get_rss_kb();

    for (int i = 0; i < iterations; ++i) {
        py::scoped_subinterpreter ssi(cfg);
        PyRun_SimpleString("x = 1");
    }

    long rss_after = get_rss_kb();

    std::cout << "Iterations: " << iterations << "\n";
    std::cout << "RSS before: " << rss_before << " kB\n";
    std::cout << "RSS after:  " << rss_after << " kB\n";
    std::cout << "Increase:   " << (rss_after - rss_before) << " kB\n";
    std::cout << "Per iter:   " << (double)(rss_after - rss_before) / iterations << " kB\n";
#else
    std::cout << "Subinterpreter support: NO\n";
#endif

    return 0;
}

Environment

  • pybind11: current master in /wrk/forked/pybind11 (da6e071)
  • CPython source commit: 23e3c0ae867cca0130e441e776c9955b9027c510 (3.14 branch)
  • Install used: /wrk/cpython_installs/3.14_branch_23e3c0ae867_default
  • Compiler: g++ -std=c++20 -O0 -g
  • Platform: Linux (RSS from /proc/self/status)

Build command

g++ -std=c++20 -O0 -g -o /tmp/pybind11_obmalloc_repro \
  /tmp/pybind11_obmalloc_repro.cpp \
  -I/wrk/forked/pybind11/include \
  -I/wrk/cpython_installs/3.14_branch_23e3c0ae867_default/include/python3.14 \
  -L/wrk/cpython_installs/3.14_branch_23e3c0ae867_default/lib \
  -Wl,-rpath,/wrk/cpython_installs/3.14_branch_23e3c0ae867_default/lib \
  -lpython3.14 -lpthread -ldl -lutil

Run commands

/tmp/pybind11_obmalloc_repro 0 10
/tmp/pybind11_obmalloc_repro 1 10
/tmp/pybind11_obmalloc_repro 0 100
/tmp/pybind11_obmalloc_repro 1 100
/tmp/pybind11_obmalloc_repro 0 1000
/tmp/pybind11_obmalloc_repro 1 1000

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

pybind11 subinterpreter RSS measurements with PYTHONMALLOC=malloc

Purpose: sanity check whether the large RSS growth shown before is tied to pymalloc/obmalloc behavior by forcing CPython to use the system allocator.

Environment

  • pybind11: current master in /wrk/forked/pybind11 (da6e071)
  • CPython source commit: 23e3c0ae867cca0130e441e776c9955b9027c510 (3.14 branch)
  • Install used: /wrk/cpython_installs/3.14_branch_23e3c0ae867_default
  • Compiler: g++ -std=c++20 -O0 -g
  • Platform: Linux (RSS from /proc/self/status)
  • Reproducer: /tmp/pybind11_obmalloc_repro.cpp (exactly as before)
  • Build command: exactly as before

Run commands

PYTHONMALLOC=malloc /tmp/pybind11_obmalloc_repro 0 1000
PYTHONMALLOC=malloc /tmp/pybind11_obmalloc_repro 1 1000

Results (non-heaptrack)

use_main_obmalloc iterations RSS before RSS after Increase Per iter
0 1000 12288 kB 16964 kB 4676 kB 4.68 kB
1 1000 12320 kB 16660 kB 4340 kB 4.34 kB

Note

With PYTHONMALLOC=malloc, the two modes converge to ~4–5 kB/iter, which is much closer to the pure‑C measurements.

@rwgk
Copy link
Collaborator

rwgk commented Jan 18, 2026

FYI: I just gave the prompt below to Cursor (still using GPT-5.2 Codex Extra High). Let's see what it can do.


Now my mind is changed again, back to: there is probably a leaky path in cpython

How can we demonstrate that in a pure C reproducer?

Could you please try hard to work out a reproducer, conceptually converging cpython_obmalloc_repro.cpp and pybind11_obmalloc_repro.cpp, by reducing the latter and expanding the cpython_obmalloc_repro.cpp?

Always run with 1000 iterations, to be sure we're not getting mislead by noise.

I'm guessing the super-safe approach would be to run four combinations for each new reproducer version: (PYTHONMALLOC=malloc no, yes) x (use_main_obmalloc 0, 1)

But probably, it's reasonable to only do that at certain critical decision points (or when backtracking), and to mostlywork with just "PYTHONMALLOC=malloc no" and use_main_obmalloc=0

Could you please try to handle this as independently as possible, until you have a pure cpython reproducer? Each time you stop, could you please write ~/progress_$(date "+%Y-%m-%d+%H%M%S").md for me to read?

@rwgk
Copy link
Collaborator

rwgk commented Jan 19, 2026

The prompt reported before was not very successful. Curser was trying to expand cpython_obmalloc_repro.c, i.e. it had to guess what might trigger the leakage. When it stopped (pretty quickly) I decided to redirect it with the following prompt. Cursor worked on this for many iterations, but did not need any further directions, only confirmations to proceed. I'll post the pure C reproducer in the next comment.


I believe the most certain path to success will be to reduce, rather than to guess. I.e., to start from pybind11_obmalloc_repro.cpp

Conceptually:

  • remove everything from pybind11 that's not needed for pybind11_obmalloc_repro.cpp
  • replace pybind11-based client code with Python C API code

Expected intermediate result: a large pybind11_obmalloc_repro.cpp that no longer needs pybind11, but still needs C++

Continue by looking for these opportunities at each step:

  • reduce use of C++ features (replace with C)
  • prune the reproducer in general

Theoretically this must be possible, although taken to the extreme it'd be just inlining stdc++ library code and pybind11 code into the reproducer. However, I'd expect that a lot of pruning will be possible on the go. The end-game will be pruning-pruning-pruning going hand-in-hand with emulating C++ features in C. If I did this manually, I'd keep an eye out for safe shortcuts, although in my experience often shortcut guesses don't work out, and it's best to backtrack quickly to the systematic inline-and-prune strategy.

Could you please try this strategy?


@rwgk
Copy link
Collaborator

rwgk commented Jan 19, 2026

pybind11_obmalloc_repro_like_inlined.c (pure-C reproducer)

This reproducer is a pure C reduction of the pybind11_obmalloc_repro.cpp posted in a previous comment.

It keeps the same high-level loop and subinterpreter config, but replaces all pybind11 usage with Python C-API calls and inlines only the minimal pybind11 internals needed to preserve the behavior.

How this was produced (from pybind11_obmalloc_repro.cpp)

  1. Start from the original pybind11 repro

    • Loop over iterations and call py::scoped_subinterpreter(cfg) with
      cfg.use_main_obmalloc = {0,1}, then run a tiny workload
      (PyRun_SimpleString("x = 1")).
  2. Replace pybind11 client code with raw C API

    • Directly use Py_NewInterpreterFromConfig, Py_EndInterpreter,
      PyThreadState_New, PyThreadState_Swap, and explicit thread-state
      management around interpreter creation/use/destruction.
  3. Inline the minimal pybind11 internals

    • Implement a tiny internals struct and a get_internals() function.
    • Store an internals* pointer-to-pointer in the interpreter state dict
      via a PyCapsule (mirrors pybind11’s internals_pp_manager behavior).
    • Create the same two heap types that pybind11 creates:
      • pybind11_type (metaclass created via PyType_Type.tp_alloc)
      • pybind11_object (base type created via metaclass->tp_alloc)
    • The key lifecycle detail preserved: call destroy_internals_pp() after
      Py_EndInterpreter
      (this is the combination that keeps the large RSS
      growth when use_main_obmalloc=0).
  4. Remove pybind11 headers and C++ features

    • No pybind11 headers or C++ stdlib are required.
    • The file compiles as plain C (gcc -x c ...), with only Python headers
      and libc.

Results (1000 iterations)

Measured using pybind11_obmalloc_repro_like_inlined_build_and_run.sh
(gcc -x c -O0 -g ...), on the default CPython 3.14 build at commit
23e3c0ae867 (same installation used in previous measurements).

PYTHONMALLOC use_main_obmalloc RSS before (kB) RSS after (kB) Increase (kB) Per iter (kB)
default 0 9920 1705580 1695660 1695.66
default 1 9956 14184 4228 4.23
malloc 0 9972 14492 4520 4.52
malloc 1 9936 14140 4204 4.20

pybind11_obmalloc_repro_like_inlined.c

pybind11_obmalloc_repro_like_inlined_build_and_run.sh

@rwgk
Copy link
Collaborator

rwgk commented Jan 19, 2026

@b-pass, could you please glance through the reproducer attached to my previous comment (pybind11_obmalloc_repro_like_inlined.c)? Could you please let me know any suggestions? I want to create a cpython issue for it.

@rwgk
Copy link
Collaborator

rwgk commented Jan 19, 2026

User prompt:

"I think by chance I saw a message about the pybind11 metaclass while you worked on the reduction. IIUC it was something to the effect that using the default metaclass reduced the leakage to the 4kB range, did I catch that correctly?"

Assistant response:

"Short answer: close, but not 4 kB.

When I switched make_default_metaclass() to return the default metaclass (PyType_Type, i.e., no heap-allocated metaclass), the leak collapsed from ~1.7 MB/iter to ~1.7 kB/iter (baseline-ish), even with instance_base creation still enabled. Restoring the heap metaclass brought the ~1.7 MB/iter leak back.

So the “metaclass switch” effect is real, but the observed drop was to ~1.7 kB/iter, not ~4 kB/iter. The ~4 kB/iter numbers are from use_main_obmalloc=1 / PYTHONMALLOC=malloc."


using instance_map = std::unordered_multimap<const void *, instance *>;

inline bool is_interpreter_alive() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sad to report that (AFAICT) this is still not quite correct.

  • Py_IsInitialized and Py_IsFinalizing check whether the whole runtime is initialized/finalizing, not whether a specific interpreter is. It can be invalid to DECREF an object from a particular interpreter even though the runtime is still active.
  • Py_IsFinalizing remains true even after Py_Finalize returns. It does not reset to false unless Py_Initialize is later called again.

I think the correct solution is a internals::leak_detach() method which will set the relevant members to NULL, without decref'ing them, so that a later destructor invocation won't call into the CPython API. Then internals_pp_manager::destroy() can call that method if the internals_pp still contains a valid pointer, before destroying it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs changelog Possibly needs a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Simply creating subinterpreter leaks memory

5 participants