Skip to content

DAOS-18891 object: retry if vos_update_end return -DER_AGAIN#18245

Open
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-18891
Open

DAOS-18891 object: retry if vos_update_end return -DER_AGAIN#18245
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-18891

Conversation

@Nasf-Fan
Copy link
Copy Markdown
Contributor

On server side, for an update operation, there may be CPU yield between related vos_update_begin() and vos_update_end(). During yield interval, the object that is held via vos_update_begin() maybe evicted by others, such as by another failed modification against the same object shard or evicted under md-on-ssd mode. So vos_update_end() logic will check such case and return -DER_AGAIN instead of -DER_TX_RESTART to the caller for notification. And then related caller needs to retry update instead of fail out.

The patch also adds initialization for some local varilables in object module to avoid random corruption when handle some failure cases.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Ticket title is 'osa/online_extend.py:OSAOnlineExtend.test_osa_online_extend_drain_after_rebuild - DER_TX_RESTART(-2025)'
Status is 'In Review'
Labels: 'ci_master_weekly,weekly_test'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18891

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from b48fc43 to 3746ad2 Compare May 15, 2026 03:38
@Nasf-Fan Nasf-Fan marked this pull request as ready for review May 16, 2026 08:18
@Nasf-Fan Nasf-Fan requested review from a team as code owners May 16, 2026 08:18
@Nasf-Fan Nasf-Fan requested review from NiuYawei and liuxuezhao May 16, 2026 08:19
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Ping reviewers, thanks!

Comment thread src/object/srv_obj.c Outdated
end:
rc = vos_update_end(ioh, ioc.ioc_map_ver, dkey, rc, &ioc.ioc_io_size, NULL);
if (rc) {
if (rc == -DER_AGAIN) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not accurate to retry on -DER_AGAIN error, the -DER_AGAIN could be returned in some other error code paths. For example: vos_obj_acquire(), vos_dtx_validation(), etc.

What about DER_OVERLOAD_RETRY or a new error code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should be -DER_AGAIN, because the object eviction needs not special handling than from vos_obj_acquire() or others, just yield and retry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to retry when vos_dtx_validation() returns -DER_AGAIN?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the DTX has been aborted by race, for example, the original RPC is timeout and leader abort it. Then when -DER_AGAIN returns back to the leader, leader will retry. If related DTX has already been processed via subsequent resent RPC, then when retry, that can be detected.

So, at least, in theory, it can be retried.

Comment thread src/object/srv_obj.c Outdated
if (rc == -DER_AGAIN) {
uint64_t now = daos_gettime_coarse();

if (now - ts > 30) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print warning if retry happened after 30 seconds? That doesn't make sense to me. Why don't we log warning on a certain number of retries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry number depends on system load and schedule, for example, retrying 100 times may take 10 seconds or may take 1 minute, that is not easy to control, instead, time based warning is more controllable.

Comment thread src/object/srv_obj.c Outdated
ts = now;
}

ABT_thread_yield();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to call this function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-DER_AGAIN also can be returned from other cases that someone may still hold reference agains the trouble object. So yield will give chance to them to release related reference. On the other hand, to be safe, yield will avoid system being blocked even if something wrong as to -DER_AGAIN repeatedly returned. So I prefer to keep ABT_thread_yield();.

Comment thread src/object/srv_obj_migrate.c Outdated
if (rc == 0)
rc = rc1;

if (rc == -DER_AGAIN) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirm that there are some calling of vos_obj_update(), which with call vos_obj_update_ex() -> vos_update_end(). Is it possible get DER_AGAIN for that vos_update_end() and need it handle the err code there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I will go through the code and enhance related logic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppose it will be refined in a following PR? thx

NiuYawei
NiuYawei previously approved these changes May 20, 2026
liuxuezhao
liuxuezhao previously approved these changes May 20, 2026
@Nasf-Fan Nasf-Fan dismissed stale reviews from liuxuezhao and NiuYawei via e0d2a2d May 20, 2026 07:17
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from 3746ad2 to e0d2a2d Compare May 20, 2026 07:17
@github-actions github-actions Bot added the priority Ticket has high priority (automatically managed) label May 20, 2026
@Nasf-Fan Nasf-Fan requested review from NiuYawei and liuxuezhao May 20, 2026 07:28
entry->ae_cur_stripe.as_hi_epoch, 0, VOS_OF_CRIT,
&entry->ae_dkey, 1, &iod, iod_csums, &sgl);

OBJ_CHECK_EAGAIN(rc, ts, "vos_obj_update", entry->ae_oid, again1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for vos_obj_update() and vos_obj_array_remove(), can it just retry inside vos_obj_update_ex() then need not change it everywhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, that is possible, but it is afraid that explicitly yield inside VOS maybe not good practice. That is why I put them inside object module for unification.

Comment thread src/object/srv_obj.c Outdated
recx.rx_idx = (oer->er_stripenum * recx.rx_nr) | PARITY_INDICATOR;
rc = vos_obj_array_remove(ioc.ioc_coc->sc_hdl, oer->er_oid, &oer->er_epoch_range, dkey,
&iod->iod_name, &recx);
OBJ_CHECK_EAGAIN(rc, ts, "vos_obj_array_remove", oer->er_oid, again);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here can retry from "remove_parity"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

Comment thread src/object/srv_obj.c Outdated
&oea->ea_epoch_range, dkey,
&iod->iod_name, &recx);

OBJ_CHECK_EAGAIN(rc1, ts, "vos_obj_array_remove", oea->ea_oid, again);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, seems only need to retry the vos_obj_array_remove()?

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from ab09902 to c4dc277 Compare May 21, 2026 02:39
liuxuezhao
liuxuezhao previously approved these changes May 21, 2026
@Nasf-Fan Nasf-Fan requested a review from gnailzenh May 22, 2026 03:00
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Ping reviewers! Thanks!

NiuYawei
NiuYawei previously approved these changes May 25, 2026
@NiuYawei NiuYawei dismissed stale reviews from liuxuezhao and themself via 302ecce May 25, 2026 02:15
NiuYawei
NiuYawei previously approved these changes May 25, 2026
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from 302ecce to ae50963 Compare May 25, 2026 02:24
@Nasf-Fan Nasf-Fan requested review from NiuYawei and liuxuezhao May 25, 2026 02:25
liuxuezhao
liuxuezhao previously approved these changes May 25, 2026
On server side, for an update operation, there may be CPU yield between
related vos_update_begin() and vos_update_end(). During yield interval,
the object that is held via vos_update_begin() maybe evicted by others,
such as by another failed modification against the same object shard or
evicted under md-on-ssd mode. So vos_update_end() logic will check such
case and return -DER_AGAIN instead of -DER_TX_RESTART to the caller for
notification. And then related caller needs to retry update instead of
fail out.

The patch also adds initialization for some local varilables in object
module to avoid random corruption when handle some failure cases.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

3 participants