Skip to content

MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043

Open
svoj wants to merge 1 commit intoMariaDB:mainfrom
svoj:pr-main-MDEV-21423
Open

MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043
svoj wants to merge 1 commit intoMariaDB:mainfrom
svoj:pr-main-MDEV-21423

Conversation

@svoj
Copy link
Copy Markdown
Contributor

@svoj svoj commented May 5, 2026

TBD

@svoj svoj requested a review from dr-m May 5, 2026 22:10
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new rw_trx_ids_t class to manage read-write transaction IDs and serialization numbers, moving them from the hash elements into a centralized vector within the transaction system. This involves significant updates to transaction registration, deregistration, and snapshotting logic across InnoDB. Feedback highlights a critical thread-safety issue where the ids vector is accessed without a read lock during potential reallocations, which could lead to memory corruption. Additionally, the use of memset to initialize a synchronization primitive was identified as unsafe and should be removed in favor of the standard initialization call.

Comment thread storage/innobase/include/trx0sys.h
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/log/log0log.cc Outdated
@svoj svoj force-pushed the pr-main-MDEV-21423 branch from 3b998f0 to 6d2f280 Compare May 6, 2026 07:11
{
ut_ad(trx->rw_trx_ids_slot != std::numeric_limits<uint32_t>::max());
ut_ad(trx->id < trx->no);
latch.rd_lock(SRW_LOCK_CALL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding PERFORMANCE_SCHEMA instrumentation for this? That would remove some conditional branches in frequently executed code. It could make a measurable performance impact.

#4784 is normally using the type srw_lock_low for log_sys.latch. Given that this latch is private, I don’t think there is a need to use a different type in debug builds.

Comment on lines +866 to +872
trx_id_t snapshot_ids(trx_ids_t &view_ids,
const trx_id_t max_trx_id) const noexcept
{
trx_id_t min_trx_no{max_trx_id};
view_ids.clear();
latch.rd_lock(SRW_LOCK_CALL);
view_ids.reserve(ids.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector::reserve can throw std::bad_alloc. This would not be the only place where we’d happily crash when running out of memory. I’m not familiar with what would happen when a noexcept function fails to handle an exception. I guess we’d just crash. But would the diagnostics for that be acceptable? Could you test this with an artificially bloated allocation and a tight ulimit -m?

The parameters and the return value are not documented with Doxygen comments.

Comment on lines +897 to +907
void deregister_rw(const trx_t *trx) noexcept
{
ut_ad(trx->rw_trx_ids_slot != std::numeric_limits<uint32_t>::max());
latch.rd_lock(SRW_LOCK_CALL);
ut_ad(ids[trx->rw_trx_ids_slot].id != TRX_ID_MAX);
ut_ad(ids[trx->rw_trx_ids_slot].id == trx->id);
ut_ad(ids[trx->rw_trx_ids_slot].no == trx->no);
ids[trx->rw_trx_ids_slot].id= TRX_ID_MAX;
ids[trx->rw_trx_ids_slot].no= TRX_ID_MAX;
latch.rd_unlock();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I reviewed the code that GCC 16 -O2 generated for an earlier version of this patch, the address calculation for the assignments was being duplicated. I would suggest to rewrite this as follows:

  void deregister_rw(const trx_t *trx) noexcept
  {
    latch.rd_lock();
    ut_ad(trx->rw_trx_ids_slot < ids.size());
    rw_trx_id &trx_ids= ids[trx->rw_trx_ids_slot];
    ut_ad(trx_ids.id == trx->id);
    ut_ad(trx_ids.no == trx->no);
    trx_ids.id= TRX_ID_MAX;
    trx_ids.no= TRX_ID_MAX;
    latch.rd_unlock();
  }

Comment on lines +844 to +850
struct rw_trx_id
{
Atomic_relaxed<trx_id_t> id{TRX_ID_MAX};
Atomic_relaxed<trx_id_t> no{TRX_ID_MAX};
trx_t *trx;
rw_trx_id(trx_t *t): trx(t) {}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data members need to be documented with Doxygen comments. Do we strictly need all 3 fields, or are some of them a cache that allows us to avoid dereferencing the trx? The comment should explain when trx is dereferenced.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trx is needed for deregister_trx() to efficiently update trx->rw_trx_ids_slot of moved transaction, that's it. I will document it separately.

Comment on lines +887 to +894
void register_rw(const trx_t *trx) noexcept
{
ut_ad(trx->rw_trx_ids_slot != std::numeric_limits<uint32_t>::max());
ut_ad(trx->no == TRX_ID_MAX);
latch.rd_lock(SRW_LOCK_CALL);
ut_ad(ids[trx->rw_trx_ids_slot].id == TRX_ID_MAX);
ut_ad(ids[trx->rw_trx_ids_slot].no == TRX_ID_MAX);
ids[trx->rw_trx_ids_slot].id= trx->id;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace the first assertion with a tighter one, inside the critical section?

ut_ad(trx->rw_trx_ids_slot < ids.size());

Can we also assert the following inside the critical section?

ut_ad(&ids[trx->rw_trx_ids_slot] == trx);

Comment on lines 1152 to 1157
void assign_new_trx_no(trx_t *trx)
{
trx->rw_trx_hash_element->no= get_new_trx_id_no_refresh();
trx->no= get_new_trx_id_no_refresh();
rw_trx_ids.assign_new_trx_no(trx);
refresh_rw_trx_hash_version();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the field trx_t::no is duplicating a field in an element of rw_trx_ids.ids. We can just pass the data directly by value, not via a redundant data member of trx:

rw_trx_ids.assign_new_trx_no(trx, get_new_trx_id_no_refresh());

In addition to this, trx_t::write_serialisation_history() would have to be modified so that it’d pass the commit number to trx_purge_add_undo_to_history() by value.

Comment on lines -1092 to 1182
while ((arg.m_id= get_rw_trx_hash_version()) != get_max_trx_id())
while ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id())
ut_delay(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ut_delay() is always being invoked with the parameter 1. Could we merge it with LF_BACKOFF()? For some reason, the latter is missing HMT_low() and HMT_medium(). The impact of this would be good to test on POWER.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LF_BACKOFF() and ut_delay() are a mix of a warm and a soft. I'd say ut_delay() is more correct on POWER, because MY_RELAX_CPU() on POWER is a regular __builtin_ppc_get_timebase() instruction, which isn't anywhere similar to Intel PAUSE.

A good POWER implementation would probably make use of Hardware Transactional Memory (HTM) surrounded with HMT_low() / HMT_medium().

Less good, but still alright would be:

HMT_low();
while ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id()) /* no-op */;
HMT_medium();

Or even:

if ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id())
{
  HMT_low();
  while ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id()) /* no-op */;
  HMT_medium();
}

It is evident that both LF_BACKOFF() and ut_delay() were not properly tested on POWER. I don't like performing another blind attempt within the scope of this task.

Comment on lines -662 to +683
srw_spin_mutex mutex;
mutable srw_spin_mutex mutex;
#ifdef UNIV_DEBUG
/** The owner of mutex (0 if none); protected by mutex */
std::atomic<pthread_t> mutex_owner{0};
mutable std::atomic<pthread_t> mutex_owner{0};
#endif /* UNIV_DEBUG */
public:
void mutex_init() { mutex.init(); }
void mutex_destroy() { mutex.destroy(); }

/** Acquire the mutex */
void mutex_lock()
void mutex_lock() const
{
ut_ad(!mutex_is_owner());
mutex.wr_lock();
assert(!mutex_owner.exchange(pthread_self(),
std::memory_order_relaxed));
}
/** Release the mutex */
void mutex_unlock()
void mutex_unlock() const
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These trx_t member functions are no longer being used by the patch. Hence, these modifications could be removed, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants