DAOS-18633 rebuild: global barrier for ec agg#17861
DAOS-18633 rebuild: global barrier for ec agg#17861wangshilong wants to merge 5 commits intomasterfrom
Conversation
|
Errors are Unable to load ticket data |
Pause EC agg globally before triggering rebuild and set rebuild fence after we confirmed all ec agg has been paused. After rebuild, we don't need trigger full scan for ec agg any more, because we gurantee there is no ec aggregation after rebuild fence. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
1be479b to
af424f3
Compare
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
| } | ||
|
|
||
| if (!dpc->spc_no_storage) { | ||
| dpc->spc_ec_agg_pause_gate = d_hlc_get(); |
There was a problem hiding this comment.
Instead of acquiring a new HLC, can't we use the 'rebuild version' (pool version) to identify this rebuild?
There was a problem hiding this comment.
rebuild_version is not safe, as retied might reuse same rebuild_version.
| dpc->spc_ec_agg_pause_gate = d_hlc_get(); | ||
| D_INFO(DF_RB ": set local EC agg pause gate token " DF_U64 "\n", DP_RB_RPT(rpt), | ||
| dpc->spc_ec_agg_pause_gate); | ||
| ds_cont_child_wait_ec_agg_pause(dpc, rebuild_wait_ec_pause); |
There was a problem hiding this comment.
This could lead to RPC timeout.
See agg_rate_ctl(), it should check the 'spc_ec_agg_pause_gate' as well? BTW, "rebuild_ver" looks a better name.
There was a problem hiding this comment.
Problem is we don't guarantee rebuild_ver always bump for retrying rebuild. I don't want to change that Semantics. Or I bump it locally to ensure it increase every time.
There was a problem hiding this comment.
adding spc_ec_agg_pause_gate check to ds_pool_is_rebuilding() can make it affect agg_rate_ctl().
| /* phase-2: broadcast scan RPC to all targets */ | ||
| D_INFO(DF_RB ": start rebuild scan\n", DP_RB_RGT(*p_rgt)); | ||
| rc = rebuild_scan_broadcast(pool, *p_rgt, &task->dst_tgts, task->dst_new_layout_version, | ||
| task->dst_rebuild_op, RB_SCAN_START); |
There was a problem hiding this comment.
Given that we turned to use two rounds of scan broadcast, getting the global 'rebuild epoch' could be easier now, @liuxuezhao ?
There was a problem hiding this comment.
right, I also thought that, I can check some details, may push another PR or another commit to this PR later (will confirm with shilong which one is better).
src/rebuild/rebuild_internal.h
Outdated
| rebuild_tgt_prepare_pause(crt_rpc_t *rpc, uint64_t *stable_epoch); | ||
|
|
||
| int | ||
| rebuild_tgt_cancel_pause(crt_rpc_t *rpc); |
There was a problem hiding this comment.
rebuild_tgt_stop_agg()/rebuild_tgt_resume_agg() sounds more straightforward.
src/container/srv_target.c
Outdated
| } else { | ||
| D_ALLOC(snapshots, (cont->sc_snapshots_nr + 1) * | ||
| sizeof(daos_epoch_t)); | ||
| D_ALLOC(snapshots, (cont->sc_snapshots_nr + 1) * sizeof(daos_epoch_t)); |
There was a problem hiding this comment.
as rebuild_fence removed now, so looks need not + 1, for both L382 and 385
| D_DEBUG(DB_MD, DF_UUID "[%d]: wait for pausing EC aggregation\n", | ||
| DP_UUID(pool_child->spc_uuid), dss_get_module_info()->dmi_tgt_id); | ||
| D_INFO(DF_UUID "[%d]: wait for local EC aggregation to pause\n", | ||
| DP_UUID(pool_child->spc_uuid), dss_get_module_info()->dmi_tgt_id); |
There was a problem hiding this comment.
looks fine to use INFO, then L976 can remove
src/rebuild/srv.c
Outdated
| dpc->spc_rebuild_end_hlc = d_hlc_get(); | ||
| if (rpt->rt_ec_pause_token == dpc->spc_ec_agg_pause_gate) { | ||
| dpc->spc_ec_agg_pause_gate = 0; | ||
| dpc->spc_rebuild_end_hlc = d_hlc_get(); |
There was a problem hiding this comment.
spc_rebuild_end_hlc is no usage now, looks fine to remove it also.
| dpc->spc_ec_agg_pause_gate = d_hlc_get(); | ||
| D_INFO(DF_RB ": set local EC agg pause gate token " DF_U64 "\n", DP_RB_RPT(rpt), | ||
| dpc->spc_ec_agg_pause_gate); | ||
| ds_cont_child_wait_ec_agg_pause(dpc, rebuild_wait_ec_pause); |
There was a problem hiding this comment.
adding spc_ec_agg_pause_gate check to ds_pool_is_rebuilding() can make it affect agg_rate_ctl().
| /* phase-2: broadcast scan RPC to all targets */ | ||
| D_INFO(DF_RB ": start rebuild scan\n", DP_RB_RGT(*p_rgt)); | ||
| rc = rebuild_scan_broadcast(pool, *p_rgt, &task->dst_tgts, task->dst_new_layout_version, | ||
| task->dst_rebuild_op, RB_SCAN_START); |
There was a problem hiding this comment.
right, I also thought that, I can check some details, may push another PR or another commit to this PR later (will confirm with shilong which one is better).
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
|
Replace by #17941 |
Pause EC agg globally before triggering rebuild and set rebuild fence after we confirmed all ec agg has been paused.
After rebuild, we don't need trigger full scan for ec agg any more, because we gurantee there is no ec aggregation after rebuild fence.
Steps for the author:
After all prior steps are complete: