MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043
MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043svoj wants to merge 1 commit intoMariaDB:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new rw_trx_ids_t class to manage read-write transaction IDs and serialization numbers, moving them from the hash elements into a centralized vector within the transaction system. This involves significant updates to transaction registration, deregistration, and snapshotting logic across InnoDB. Feedback highlights a critical thread-safety issue where the ids vector is accessed without a read lock during potential reallocations, which could lead to memory corruption. Additionally, the use of memset to initialize a synchronization primitive was identified as unsafe and should be removed in favor of the standard initialization call.
…_find and ut_delay TBD
| { | ||
| ut_ad(trx->rw_trx_ids_slot != std::numeric_limits<uint32_t>::max()); | ||
| ut_ad(trx->id < trx->no); | ||
| latch.rd_lock(SRW_LOCK_CALL); |
There was a problem hiding this comment.
Can we avoid adding PERFORMANCE_SCHEMA instrumentation for this? That would remove some conditional branches in frequently executed code. It could make a measurable performance impact.
#4784 is normally using the type srw_lock_low for log_sys.latch. Given that this latch is private, I don’t think there is a need to use a different type in debug builds.
| trx_id_t snapshot_ids(trx_ids_t &view_ids, | ||
| const trx_id_t max_trx_id) const noexcept | ||
| { | ||
| trx_id_t min_trx_no{max_trx_id}; | ||
| view_ids.clear(); | ||
| latch.rd_lock(SRW_LOCK_CALL); | ||
| view_ids.reserve(ids.size()); |
There was a problem hiding this comment.
std::vector::reserve can throw std::bad_alloc. This would not be the only place where we’d happily crash when running out of memory. I’m not familiar with what would happen when a noexcept function fails to handle an exception. I guess we’d just crash. But would the diagnostics for that be acceptable? Could you test this with an artificially bloated allocation and a tight ulimit -m?
The parameters and the return value are not documented with Doxygen comments.
| void deregister_rw(const trx_t *trx) noexcept | ||
| { | ||
| ut_ad(trx->rw_trx_ids_slot != std::numeric_limits<uint32_t>::max()); | ||
| latch.rd_lock(SRW_LOCK_CALL); | ||
| ut_ad(ids[trx->rw_trx_ids_slot].id != TRX_ID_MAX); | ||
| ut_ad(ids[trx->rw_trx_ids_slot].id == trx->id); | ||
| ut_ad(ids[trx->rw_trx_ids_slot].no == trx->no); | ||
| ids[trx->rw_trx_ids_slot].id= TRX_ID_MAX; | ||
| ids[trx->rw_trx_ids_slot].no= TRX_ID_MAX; | ||
| latch.rd_unlock(); | ||
| } |
There was a problem hiding this comment.
When I reviewed the code that GCC 16 -O2 generated for an earlier version of this patch, the address calculation for the assignments was being duplicated. I would suggest to rewrite this as follows:
void deregister_rw(const trx_t *trx) noexcept
{
latch.rd_lock();
ut_ad(trx->rw_trx_ids_slot < ids.size());
rw_trx_id &trx_ids= ids[trx->rw_trx_ids_slot];
ut_ad(trx_ids.id == trx->id);
ut_ad(trx_ids.no == trx->no);
trx_ids.id= TRX_ID_MAX;
trx_ids.no= TRX_ID_MAX;
latch.rd_unlock();
}| struct rw_trx_id | ||
| { | ||
| Atomic_relaxed<trx_id_t> id{TRX_ID_MAX}; | ||
| Atomic_relaxed<trx_id_t> no{TRX_ID_MAX}; | ||
| trx_t *trx; | ||
| rw_trx_id(trx_t *t): trx(t) {} | ||
| }; |
There was a problem hiding this comment.
The data members need to be documented with Doxygen comments. Do we strictly need all 3 fields, or are some of them a cache that allows us to avoid dereferencing the trx? The comment should explain when trx is dereferenced.
There was a problem hiding this comment.
trx is needed for deregister_trx() to efficiently update trx->rw_trx_ids_slot of moved transaction, that's it. I will document it separately.
| void register_rw(const trx_t *trx) noexcept | ||
| { | ||
| ut_ad(trx->rw_trx_ids_slot != std::numeric_limits<uint32_t>::max()); | ||
| ut_ad(trx->no == TRX_ID_MAX); | ||
| latch.rd_lock(SRW_LOCK_CALL); | ||
| ut_ad(ids[trx->rw_trx_ids_slot].id == TRX_ID_MAX); | ||
| ut_ad(ids[trx->rw_trx_ids_slot].no == TRX_ID_MAX); | ||
| ids[trx->rw_trx_ids_slot].id= trx->id; |
There was a problem hiding this comment.
Could we replace the first assertion with a tighter one, inside the critical section?
ut_ad(trx->rw_trx_ids_slot < ids.size());Can we also assert the following inside the critical section?
ut_ad(&ids[trx->rw_trx_ids_slot] == trx);| void assign_new_trx_no(trx_t *trx) | ||
| { | ||
| trx->rw_trx_hash_element->no= get_new_trx_id_no_refresh(); | ||
| trx->no= get_new_trx_id_no_refresh(); | ||
| rw_trx_ids.assign_new_trx_no(trx); | ||
| refresh_rw_trx_hash_version(); | ||
| } |
There was a problem hiding this comment.
As far as I can tell, the field trx_t::no is duplicating a field in an element of rw_trx_ids.ids. We can just pass the data directly by value, not via a redundant data member of trx:
rw_trx_ids.assign_new_trx_no(trx, get_new_trx_id_no_refresh());In addition to this, trx_t::write_serialisation_history() would have to be modified so that it’d pass the commit number to trx_purge_add_undo_to_history() by value.
| while ((arg.m_id= get_rw_trx_hash_version()) != get_max_trx_id()) | ||
| while ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id()) | ||
| ut_delay(1); |
There was a problem hiding this comment.
ut_delay() is always being invoked with the parameter 1. Could we merge it with LF_BACKOFF()? For some reason, the latter is missing HMT_low() and HMT_medium(). The impact of this would be good to test on POWER.
There was a problem hiding this comment.
LF_BACKOFF() and ut_delay() are a mix of a warm and a soft. I'd say ut_delay() is more correct on POWER, because MY_RELAX_CPU() on POWER is a regular __builtin_ppc_get_timebase() instruction, which isn't anywhere similar to Intel PAUSE.
A good POWER implementation would probably make use of Hardware Transactional Memory (HTM) surrounded with HMT_low() / HMT_medium().
Less good, but still alright would be:
HMT_low();
while ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id()) /* no-op */;
HMT_medium();
Or even:
if ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id())
{
HMT_low();
while ((max_trx_id= get_rw_trx_hash_version()) != get_max_trx_id()) /* no-op */;
HMT_medium();
}
It is evident that both LF_BACKOFF() and ut_delay() were not properly tested on POWER. I don't like performing another blind attempt within the scope of this task.
| srw_spin_mutex mutex; | ||
| mutable srw_spin_mutex mutex; | ||
| #ifdef UNIV_DEBUG | ||
| /** The owner of mutex (0 if none); protected by mutex */ | ||
| std::atomic<pthread_t> mutex_owner{0}; | ||
| mutable std::atomic<pthread_t> mutex_owner{0}; | ||
| #endif /* UNIV_DEBUG */ | ||
| public: | ||
| void mutex_init() { mutex.init(); } | ||
| void mutex_destroy() { mutex.destroy(); } | ||
|
|
||
| /** Acquire the mutex */ | ||
| void mutex_lock() | ||
| void mutex_lock() const | ||
| { | ||
| ut_ad(!mutex_is_owner()); | ||
| mutex.wr_lock(); | ||
| assert(!mutex_owner.exchange(pthread_self(), | ||
| std::memory_order_relaxed)); | ||
| } | ||
| /** Release the mutex */ | ||
| void mutex_unlock() | ||
| void mutex_unlock() const |
There was a problem hiding this comment.
These trx_t member functions are no longer being used by the patch. Hence, these modifications could be removed, right?
TBD