fix(#10875): sentinel wait for transitions infodoc to be valid by witash · Pull Request #11135 · medic/cht-core

witash · 2026-06-03T09:12:57Z

Another attempt to resolve the api/sentinel race.
Add invalid_rev to infodocs for synchronous transitions to mark that transitions are in progress, and the infodoc is not yet valid for that revision.
after transitions are run and the document is saved, the infodoc is saved again, with transitions and with invalid_rev removed.

A previous version also added a valid_rev with explicitly the revision that the infodoc is valid for, but removed it since it's never used; the abscence of an invalid_rev implies that the infodoc is valid for the current revision.
If saving the medic doc fails, it clears invalid_rev from the infodoc to allow retry.

There is no cleanup for other stale infodocs where invalid_rev is never cleared, but this can only happen if there's an error other than 409 saving the infodoc the second time, or if its interrupted between saving the medic doc and the infodoc.

To actually fix the race, if sentinel finds an infodoc with invalid_rev set (doesn't actually use the rev, just the presence of the flag), it waits 100ms and then tries again until the infodoc had invalid_rev removed, for a maximum of 5 times. after 5 retries, it logs an error and continues to the next change without running transitions.

invalid_rev is only set by processDocs (synchronous changes from api), and only processChanges (from sentinel) will wait for it. It also applies only to transitions. This is to avoid adding extra db chatter where it isn't really needed.

A previous commit also add invalid_rev and waiting to processChanges, but removed because it does not actually fix anything; sentinel does not support concurrency, and there isn't currently a case where api should wait for sentinel to finish.

dianabarsan · 2026-06-03T10:42:55Z

This one sounds like a pretty sweet solution! nice work!

witash · 2026-06-08T09:27:58Z

I think this is ready for review.
Note that there are some changes to that appear unrelated but were necessary for code quality

witash · 2026-06-08T10:23:02Z

last minute changed from invalid_rev to transitions_started with a timestamp, since invalid_rev will always be null, _rev isn't set by the time the infoDoc is saved. It still uses the presence of the field to decide whether to wait, rather than the value, but adding the timestamp just for information/debugging (hopefully there will never be a need to debug).

dianabarsan

Nice work! Some comments below.

dianabarsan · 2026-06-09T10:49:22Z

+    return await db.sentinel.get(getInfoDocId(id));
  } catch (err) {
-    if (err.status !== 404) {
+    if (err.status !== 404 || !fallback) {


I think this would be easier to understand if reversed:

Suggested change

if (err.status !== 404 || !fallback) {

if (err.status === 404 && fallback) {

return fallback;

}

throw err;

There also seems to be a behavior change here, where if fallback is not passed, the 404 would throw compared to before when it did not throw. Is there a reason for this shift?

reversed the condition for readability.

the intention was to not change existing behavior for the existing functions, but allow the new functions to raise errors if the infodoc doesn't exist instead of silently creating it.

But it's more consistent with the existing functions to silently create a blank infoDoc, so changed to do that instead.

dianabarsan · 2026-06-09T10:56:16Z

+      });
+    });
+
+    it('retries on 409 conflict', () => {


Thanks for adding this test.
I believe that we now have infinite retries? If someone adds a limit to the number of retries, this test would still pass. I think it would be useful to capture the unlimited retry count in this test somehow. MAybe adding 100 calls would suffice?

stubbed db.sentinel.put to reject with 409 on the first 100 calls, then resolve on the 101st

dianabarsan · 2026-06-09T12:01:14Z

    .then(doc => {
      change.doc = doc;
-      return infodoc.get(change).then(infoDoc => {
+      return getConsistentInfoDoc(change, MAX_INFODOC_WAIT).then(infoDoc => {


This is the central change in the PR , but I don't see any test covering getConsistentInfoDoc itself — i.e. sentinel waiting while transitions_started is set and proceeding once it clears, and the 5-retry exhaustion path that logs the warning and skips. The new tests all cover the API write side and the infodoc plumbing. Could you add a processChange test that exercises the wait/retry/skip directly?

added 'should wait for transitions_started to clear before processing the change' and 'should skip processing the change when transitions_started never clears'.
used fake time which makes them more complicated but hopefully avoids flakiness

witash · 2026-06-24T09:24:07Z

hi @dianabarsan please re-review, I added a commit with some new changes;

There is another related, but not the same, race condition that is pre-existing and the changes in timing apparently now trigger it reliably in tests.

It's a little complicated, here is claude's summary:

The infodocs integration test finds and migrates data from the medic infodoc failed reliably: it expected a migrated initial_replication_date of 1000 but got a fresh "now" timestamp.

Root cause was a race between two unsynchronized writers to the same sentinel infodoc:

API (recordDocumentWrite, fired async after the doc write) creates the infodoc with initial_replication_date = now.

Sentinel (resolveInfoDocs) migrates the legacy medic-db infodoc, setting initial_replication_date to the original value.

When sentinel's migrating bulkUpdate conflicted with API's write, the conflict handler adopted the fresh "now" value, discarding the migrated date.

So its related to migrating legacy infodocs that were in the medic, which I think means the racing code and tests can just be removed? Because we can expect all infodocs to be in the sentinel db by now, so is there really still a need to check the medic db just in case?

there are still two writes to the same infodoc, but without the conflicting data from legacy infodocs, that doesn't matter; so no need to overcomplicate the fix.

this isn't the same to 10875, but I think it's related enough to also include the fix in this branch to get the tests to pass.

dianabarsan

Thanks for the changes. Looking good!. I left some comments, but nothing major.

dianabarsan · 2026-06-25T14:16:34Z

-            }
+      missingTransitions.forEach(infoDoc => {
+        const change = changeForInfoDoc(infoDoc._id);
+        infoDoc.transitions = infoDoc.transitions || change.doc?.transitions;


The history of the transitions changes was:

2.x - transitions were saved on the document itself (doc.transitions)

3.x - migrated from storing transitions on document to storing transitions on infodoc, and stored in medic database

3.x - infodoc transition from medic database to medic-sentinel database

So here, where you're deleting the migration of infodoc from medic to medic-sentinel but keeping the possibility of having transitions saved on the doc itself is like removing the middle migration and keeping the first one. If we assume that we are past infodocs in medic database, that we can absolutely assume that transitions are no longer properties of medic documents.

We either delete all or none.

dianabarsan · 2026-06-25T14:20:24Z

+  const modify = infoDoc => {
+    infoDoc.transitions = change.info?.transitions || {};
+    // Clear the in-progress marker in the same write that commits the transitions (API branch only)
+    if (clearStarted) {


If nobody triggers this (so somehow API messes up and crashes or some other thing) and the transitions_started remains on the document, this will mean that it will permanently be skipped by Sentinel.

Can we add a fallback where sentinel checks the time when transitions_started was set, and if it's over some prescribed interval, it will delete it itself and resume transition processing for this doc.

dianabarsan · 2026-06-25T14:22:47Z

+  const modify = infoDoc => {
+    infoDoc.transitions = change.info?.transitions || {};
+    // Clear the in-progress marker in the same write that commits the transitions (API branch only)
+    if (clearStarted) {


It feels a bit redundant that we have a clearStarted parameter but also a clearTransitionsStarted function, both deleting the same property. is it possible for this function to call clearTransitionsStarted instead of this annex to the modify function?

I saw that just inlining the next call generates an extra round trip to the database. I asked claude about it and:

The cleaner option — drop the param without reusing clearTransitionsStarted and without an extra write: have saveTransitions always delete the marker. This is safe at all three callers: the two non-API callers have no marker, so delete is a harmless no-op; the API caller gets exactly today's behavior. It's strictly better than both alternatives — one write, no param, and it enforces a clean invariant ("transitions committed ⇒ no in-progress marker"). It even opportunistically clears a stale marker if one ever reaches this path, which slightly mitigates the crash-poisoning gap from finding #2. The one tradeoff: saveTransitions now always touches transitions_started, so it's coupled to the marker concept even on the sentinel path. That coupling is mild and arguably correct — but if witash wants saveTransitions to stay marker-agnostic, then keeping the explicit clearStarted param is the more honest expression of intent, and "redundant" isn't quite the right framing since the param is what folds two mutations into a single write. Worth raising as a suggestion either way; I'd lead with the unconditional-delete version.

Maybe worth considering.

dianabarsan · 2026-06-25T14:33:54Z

+
+      return lib.saveTransitions(change, true).then(() => {
+        const saved = db.sentinel.put.args[0][0];
+        // transitions are written and the mid-write marker removed; nothing else is touched


~~claude~~ captain obvious has spoken

dianabarsan · 2026-06-25T14:36:13Z

+        assert.deepEqual(saved, {
+          _id: 'some-info',
+          doc_id: 'some',
+          transitions_started: saved.transitions_started,
+        });


This is a weird assertion :D asserting a property of a doc is equal to a property of a doc.

We could use .excludingEvery here and omitting the transitions_started assert specifically. or even better, we should mock time and assert exactly on the format we expect.

dianabarsan · 2026-06-25T14:37:25Z

+      const put = sinon.stub(db.sentinel, 'put');
+      // conflict on the first 100 attempts, succeed on the 101st - a retry limit below this would fail
+      for (let i = 0; i < 100; i++) {
+        put.onCall(i).rejects({ status: 409 });


I believe you don't need this for loop. You can just say: put.rejects() and add the put.onCall(100).resolves(); and it will just work.

dianabarsan · 2026-06-25T14:47:23Z

      });
    });
+
+    // fake timers so the 100ms retry waits don't cost real time or flake near mocha's timeout;


Please remove this comment.

dianabarsan · 2026-06-25T14:49:54Z

+      });
+      // one extra tick past MAX_RETRIES is harmless and guarantees the change settles either way
+      for (let i = 0; i <= MAX_RETRIES; i++) {
+        await clock.tickAsync(RETRY_INTERVAL);


To make the test complete, please add assertions of numbers of calls after every tick. Otherwise three calls can be chained within the same tick and this test will not detect it.

dianabarsan · 2026-06-25T14:51:10Z

+        // the first write only marks the infodoc as mid-write, with no transitions yet
+        let startedSave = infodocSaves.find(args => args[0].transitions_started)[0];
+        assert.isString(startedSave.transitions_started);
+        assert.deepEqual(startedSave, {


Please use excludingEvery (or excluding) here instead of asserting equality of the same object property. Applies to the other test as well.
Ideally please use fakeTimers everywhere and actually assert on the value of transitions_started.

witash added 5 commits April 23, 2026 13:22

fix(#10875): race condition for sync transitions

656eb5f

fix(#10875): adding valid/invalid rev to infodoc

3ef5980

fix(#10875): refactoring

3de6f0d

fix(#10875): readability

2add51a

fix(#10875): removing valid_rev

800b60f

witash mentioned this pull request Jun 8, 2026

fix(#10875): race condition for sync transitions #10910

Closed

witash added 2 commits June 8, 2026 12:14

fix(#10875): sonar and small fiddles

5def568

Merge branch 'master' into 10875-minimal-locking

5b419a4

witash requested a review from dianabarsan June 8, 2026 09:27

witash commented Jun 8, 2026

View reviewed changes

Comment thread shared-libs/infodoc/src/infodoc.js

witash commented Jun 8, 2026

View reviewed changes

Comment thread shared-libs/transitions/src/transitions/index.js

fix(#10875): changing from invalid_rev to transitions_started

08acb4c

dianabarsan requested changes Jun 9, 2026

View reviewed changes

witash and others added 7 commits June 10, 2026 11:59

fix(#10875): adding tests, small changes to infodoc

ffc737b

fix(#10875): sonar

34e75b0

fix(#10875): fix other race condition

98c7671

Merge branch 'master' into 10875-minimal-locking

604abdf

fix(#10875): removing legacy infodoc migration to fix tests

395278b

Merge branch 'master' into 10875-minimal-locking

feecb0b

fix(#10875): Sonar

aa6947f

witash requested a review from dianabarsan June 24, 2026 09:24

dianabarsan requested changes Jun 25, 2026

View reviewed changes

-    if (err.status !== 404 || !fallback) {
+if (err.status === 404 && fallback) {
+  return fallback;
+}
+throw err;

Uh oh!

Uh oh!

Conversation

witash commented Jun 3, 2026

Uh oh!

dianabarsan commented Jun 3, 2026

Uh oh!

witash commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

witash commented Jun 8, 2026

Uh oh!

dianabarsan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

witash commented Jun 24, 2026

Uh oh!

dianabarsan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants