DAOS-18891 object: retry if vos_update_end return -DER_AGAIN by Nasf-Fan · Pull Request #18245 · daos-stack/daos

Nasf-Fan · 2026-05-14T05:57:24Z

On server side, for an update operation, there may be CPU yield between related vos_update_begin() and vos_update_end(). During yield interval, the object that is held via vos_update_begin() maybe evicted by others, such as by another failed modification against the same object shard or evicted under md-on-ssd mode. So vos_update_end() logic will check such case and return -DER_AGAIN instead of -DER_TX_RESTART to the caller for notification. And then related caller needs to retry update instead of fail out.

The patch also adds initialization for some local varilables in object module to avoid random corruption when handle some failure cases.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-05-14T05:57:42Z

Ticket title is 'osa/online_extend.py:OSAOnlineExtend.test_osa_online_extend_drain_after_rebuild - DER_TX_RESTART(-2025)'
Status is 'In Review'
Labels: 'ci_master_weekly,weekly_test'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18891

Nasf-Fan · 2026-05-19T05:14:36Z

Ping reviewers, thanks!

NiuYawei · 2026-05-19T06:20:05Z

 end:
 	rc = vos_update_end(ioh, ioc.ioc_map_ver, dkey, rc, &ioc.ioc_io_size, NULL);
 	if (rc) {
+		if (rc == -DER_AGAIN) {


It's not accurate to retry on -DER_AGAIN error, the -DER_AGAIN could be returned in some other error code paths. For example: vos_obj_acquire(), vos_dtx_validation(), etc.

What about DER_OVERLOAD_RETRY or a new error code?

No, it should be -DER_AGAIN, because the object eviction needs not special handling than from vos_obj_acquire() or others, just yield and retry.

Is it ok to retry when vos_dtx_validation() returns -DER_AGAIN?

Yes, because the DTX has been aborted by race, for example, the original RPC is timeout and leader abort it. Then when -DER_AGAIN returns back to the leader, leader will retry. If related DTX has already been processed via subsequent resent RPC, then when retry, that can be detected.

So, at least, in theory, it can be retried.

NiuYawei · 2026-05-19T06:38:22Z

+		if (rc == -DER_AGAIN) {
+			uint64_t now = daos_gettime_coarse();
+
+			if (now - ts > 30) {


Print warning if retry happened after 30 seconds? That doesn't make sense to me. Why don't we log warning on a certain number of retries?

Retry number depends on system load and schedule, for example, retrying 100 times may take 10 seconds or may take 1 minute, that is not easy to control, instead, time based warning is more controllable.

NiuYawei · 2026-05-19T06:39:31Z

+				ts = now;
+			}
+
+			ABT_thread_yield();


It's not necessary to call this function.

-DER_AGAIN also can be returned from other cases that someone may still hold reference agains the trouble object. So yield will give chance to them to release related reference. On the other hand, to be safe, yield will avoid system being blocked even if something wrong as to -DER_AGAIN repeatedly returned. So I prefer to keep ABT_thread_yield();.

liuxuezhao · 2026-05-19T07:04:31Z

 	if (rc == 0)
 		rc = rc1;

+	if (rc == -DER_AGAIN) {


just confirm that there are some calling of vos_obj_update(), which with call vos_obj_update_ex() -> vos_update_end(). Is it possible get DER_AGAIN for that vos_update_end() and need it handle the err code there?

Right. I will go through the code and enhance related logic.

suppose it will be refined in a following PR? thx

liuxuezhao · 2026-05-20T07:34:05Z

 				    entry->ae_cur_stripe.as_hi_epoch, 0, VOS_OF_CRIT,
 				    &entry->ae_dkey, 1, &iod, iod_csums, &sgl);
+
+		OBJ_CHECK_EAGAIN(rc, ts, "vos_obj_update", entry->ae_oid, again1);


for vos_obj_update() and vos_obj_array_remove(), can it just retry inside vos_obj_update_ex() then need not change it everywhere?

In theory, that is possible, but it is afraid that explicitly yield inside VOS maybe not good practice. That is why I put them inside object module for unification.

liuxuezhao · 2026-05-20T07:39:28Z

 	recx.rx_idx = (oer->er_stripenum * recx.rx_nr) | PARITY_INDICATOR;
 	rc = vos_obj_array_remove(ioc.ioc_coc->sc_hdl, oer->er_oid, &oer->er_epoch_range, dkey,
 				  &iod->iod_name, &recx);
+	OBJ_CHECK_EAGAIN(rc, ts, "vos_obj_array_remove", oer->er_oid, again);


here can retry from "remove_parity"?

I will fix it.

liuxuezhao · 2026-05-20T07:40:24Z

 				  &oea->ea_epoch_range, dkey,
 				  &iod->iod_name, &recx);
+
+	OBJ_CHECK_EAGAIN(rc1, ts, "vos_obj_array_remove", oea->ea_oid, again);


same here, seems only need to retry the vos_obj_array_remove()?

Nasf-Fan · 2026-05-22T15:19:47Z

Ping reviewers! Thanks!

On server side, for an update operation, there may be CPU yield between related vos_update_begin() and vos_update_end(). During yield interval, the object that is held via vos_update_begin() maybe evicted by others, such as by another failed modification against the same object shard or evicted under md-on-ssd mode. So vos_update_end() logic will check such case and return -DER_AGAIN instead of -DER_TX_RESTART to the caller for notification. And then related caller needs to retry update instead of fail out. The patch also adds initialization for some local varilables in object module to avoid random corruption when handle some failure cases. Signed-off-by: Fan Yong <fan.yong@hpe.com>

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from b48fc43 to 3746ad2 Compare May 15, 2026 03:38

Nasf-Fan marked this pull request as ready for review May 16, 2026 08:18

Nasf-Fan requested review from a team as code owners May 16, 2026 08:18

Nasf-Fan requested review from NiuYawei and liuxuezhao May 16, 2026 08:19

NiuYawei reviewed May 19, 2026

View reviewed changes

liuxuezhao reviewed May 19, 2026

View reviewed changes

NiuYawei previously approved these changes May 20, 2026

View reviewed changes

liuxuezhao previously approved these changes May 20, 2026

View reviewed changes

Nasf-Fan dismissed stale reviews from liuxuezhao and NiuYawei via e0d2a2d May 20, 2026 07:17

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from 3746ad2 to e0d2a2d Compare May 20, 2026 07:17

github-actions Bot added the priority Ticket has high priority (automatically managed) label May 20, 2026

Nasf-Fan requested review from NiuYawei and liuxuezhao May 20, 2026 07:28

liuxuezhao reviewed May 20, 2026

View reviewed changes

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from ab09902 to c4dc277 Compare May 21, 2026 02:39

liuxuezhao previously approved these changes May 21, 2026

View reviewed changes

Nasf-Fan requested a review from gnailzenh May 22, 2026 03:00

NiuYawei previously approved these changes May 25, 2026

View reviewed changes

NiuYawei dismissed stale reviews from liuxuezhao and themself via 302ecce May 25, 2026 02:15

NiuYawei previously approved these changes May 25, 2026

View reviewed changes

Nasf-Fan dismissed NiuYawei’s stale review via ae50963 May 25, 2026 02:24

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from 302ecce to ae50963 Compare May 25, 2026 02:24

Nasf-Fan requested review from NiuYawei and liuxuezhao May 25, 2026 02:25

liuxuezhao previously approved these changes May 25, 2026

View reviewed changes

Nasf-Fan dismissed liuxuezhao’s stale review via 8296fcb May 25, 2026 02:47

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18891 branch from ae50963 to 8296fcb Compare May 25, 2026 02:47

NiuYawei approved these changes May 25, 2026

View reviewed changes

liuxuezhao approved these changes May 25, 2026

View reviewed changes

Conversation

Nasf-Fan commented May 14, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nasf-Fan commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nasf-Fan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 14, 2026 •

edited

Loading