fix(#10875): sentinel wait for transitions infodoc to be valid#11135
fix(#10875): sentinel wait for transitions infodoc to be valid#11135witash wants to merge 15 commits into
Conversation
|
This one sounds like a pretty sweet solution! nice work! |
|
I think this is ready for review. |
|
last minute changed from |
dianabarsan
left a comment
There was a problem hiding this comment.
Nice work! Some comments below.
| return await db.sentinel.get(getInfoDocId(id)); | ||
| } catch (err) { | ||
| if (err.status !== 404) { | ||
| if (err.status !== 404 || !fallback) { |
There was a problem hiding this comment.
I think this would be easier to understand if reversed:
| if (err.status !== 404 || !fallback) { | |
| if (err.status === 404 && fallback) { | |
| return fallback; | |
| } | |
| throw err; |
There also seems to be a behavior change here, where if fallback is not passed, the 404 would throw compared to before when it did not throw. Is there a reason for this shift?
There was a problem hiding this comment.
reversed the condition for readability.
the intention was to not change existing behavior for the existing functions, but allow the new functions to raise errors if the infodoc doesn't exist instead of silently creating it.
But it's more consistent with the existing functions to silently create a blank infoDoc, so changed to do that instead.
| }); | ||
| }); | ||
|
|
||
| it('retries on 409 conflict', () => { |
There was a problem hiding this comment.
Thanks for adding this test.
I believe that we now have infinite retries? If someone adds a limit to the number of retries, this test would still pass. I think it would be useful to capture the unlimited retry count in this test somehow. MAybe adding 100 calls would suffice?
There was a problem hiding this comment.
stubbed db.sentinel.put to reject with 409 on the first 100 calls, then resolve on the 101st
| .then(doc => { | ||
| change.doc = doc; | ||
| return infodoc.get(change).then(infoDoc => { | ||
| return getConsistentInfoDoc(change, MAX_INFODOC_WAIT).then(infoDoc => { |
There was a problem hiding this comment.
This is the central change in the PR , but I don't see any test covering getConsistentInfoDoc itself — i.e. sentinel waiting while transitions_started is set and proceeding once it clears, and the 5-retry exhaustion path that logs the warning and skips. The new tests all cover the API write side and the infodoc plumbing. Could you add a processChange test that exercises the wait/retry/skip directly?
There was a problem hiding this comment.
added 'should wait for transitions_started to clear before processing the change' and 'should skip processing the change when transitions_started never clears'.
used fake time which makes them more complicated but hopefully avoids flakiness
|
hi @dianabarsan please re-review, I added a commit with some new changes; There is another related, but not the same, race condition that is pre-existing and the changes in timing apparently now trigger it reliably in tests. It's a little complicated, here is claude's summary:
So its related to migrating legacy infodocs that were in the medic, which I think means the racing code and tests can just be removed? Because we can expect all infodocs to be in the sentinel db by now, so is there really still a need to check the medic db just in case? there are still two writes to the same infodoc, but without the conflicting data from legacy infodocs, that doesn't matter; so no need to overcomplicate the fix. this isn't the same to 10875, but I think it's related enough to also include the fix in this branch to get the tests to pass. |
dianabarsan
left a comment
There was a problem hiding this comment.
Thanks for the changes. Looking good!. I left some comments, but nothing major.
| } | ||
| missingTransitions.forEach(infoDoc => { | ||
| const change = changeForInfoDoc(infoDoc._id); | ||
| infoDoc.transitions = infoDoc.transitions || change.doc?.transitions; |
There was a problem hiding this comment.
The history of the transitions changes was:
- 2.x - transitions were saved on the document itself (doc.transitions)
- 3.x - migrated from storing transitions on document to storing transitions on infodoc, and stored in medic database
- 3.x - infodoc transition from medic database to medic-sentinel database
So here, where you're deleting the migration of infodoc from medic to medic-sentinel but keeping the possibility of having transitions saved on the doc itself is like removing the middle migration and keeping the first one. If we assume that we are past infodocs in medic database, that we can absolutely assume that transitions are no longer properties of medic documents.
We either delete all or none.
| const modify = infoDoc => { | ||
| infoDoc.transitions = change.info?.transitions || {}; | ||
| // Clear the in-progress marker in the same write that commits the transitions (API branch only) | ||
| if (clearStarted) { |
There was a problem hiding this comment.
If nobody triggers this (so somehow API messes up and crashes or some other thing) and the transitions_started remains on the document, this will mean that it will permanently be skipped by Sentinel.
Can we add a fallback where sentinel checks the time when transitions_started was set, and if it's over some prescribed interval, it will delete it itself and resume transition processing for this doc.
| const modify = infoDoc => { | ||
| infoDoc.transitions = change.info?.transitions || {}; | ||
| // Clear the in-progress marker in the same write that commits the transitions (API branch only) | ||
| if (clearStarted) { |
There was a problem hiding this comment.
It feels a bit redundant that we have a clearStarted parameter but also a clearTransitionsStarted function, both deleting the same property. is it possible for this function to call clearTransitionsStarted instead of this annex to the modify function?
There was a problem hiding this comment.
I saw that just inlining the next call generates an extra round trip to the database. I asked claude about it and:
The cleaner option — drop the param without reusing clearTransitionsStarted and without an extra write: have saveTransitions always delete the marker.
This is safe at all three callers: the two non-API callers have no marker, so delete is a harmless no-op; the API caller gets exactly today's behavior. It's strictly better than both alternatives — one write, no param, and it enforces a clean invariant ("transitions committed ⇒ no in-progress marker"). It even opportunistically clears a stale
marker if one ever reaches this path, which slightly mitigates the crash-poisoning gap from finding #2.
The one tradeoff: saveTransitions now always touches transitions_started, so it's coupled to the marker concept even on the sentinel path. That coupling is mild and arguably correct — but if witash wants saveTransitions to stay marker-agnostic, then keeping the explicit clearStarted param is the more honest expression of intent, and "redundant"
isn't quite the right framing since the param is what folds two mutations into a single write. Worth raising as a suggestion either way; I'd lead with the unconditional-delete version.
Maybe worth considering.
|
|
||
| return lib.saveTransitions(change, true).then(() => { | ||
| const saved = db.sentinel.put.args[0][0]; | ||
| // transitions are written and the mid-write marker removed; nothing else is touched |
There was a problem hiding this comment.
claude captain obvious has spoken
| assert.deepEqual(saved, { | ||
| _id: 'some-info', | ||
| doc_id: 'some', | ||
| transitions_started: saved.transitions_started, | ||
| }); |
There was a problem hiding this comment.
This is a weird assertion :D asserting a property of a doc is equal to a property of a doc.
We could use .excludingEvery here and omitting the transitions_started assert specifically. or even better, we should mock time and assert exactly on the format we expect.
| const put = sinon.stub(db.sentinel, 'put'); | ||
| // conflict on the first 100 attempts, succeed on the 101st - a retry limit below this would fail | ||
| for (let i = 0; i < 100; i++) { | ||
| put.onCall(i).rejects({ status: 409 }); |
There was a problem hiding this comment.
I believe you don't need this for loop. You can just say: put.rejects() and add the put.onCall(100).resolves(); and it will just work.
| }); | ||
| }); | ||
|
|
||
| // fake timers so the 100ms retry waits don't cost real time or flake near mocha's timeout; |
There was a problem hiding this comment.
Please remove this comment.
| }); | ||
| // one extra tick past MAX_RETRIES is harmless and guarantees the change settles either way | ||
| for (let i = 0; i <= MAX_RETRIES; i++) { | ||
| await clock.tickAsync(RETRY_INTERVAL); |
There was a problem hiding this comment.
To make the test complete, please add assertions of numbers of calls after every tick. Otherwise three calls can be chained within the same tick and this test will not detect it.
| // the first write only marks the infodoc as mid-write, with no transitions yet | ||
| let startedSave = infodocSaves.find(args => args[0].transitions_started)[0]; | ||
| assert.isString(startedSave.transitions_started); | ||
| assert.deepEqual(startedSave, { |
There was a problem hiding this comment.
Please use excludingEvery (or excluding) here instead of asserting equality of the same object property. Applies to the other test as well.
Ideally please use fakeTimers everywhere and actually assert on the value of transitions_started.
Another attempt to resolve the api/sentinel race.
Add invalid_rev to infodocs for synchronous transitions to mark that transitions are in progress, and the infodoc is not yet valid for that revision.
after transitions are run and the document is saved, the infodoc is saved again, with transitions and with invalid_rev removed.
A previous version also added a valid_rev with explicitly the revision that the infodoc is valid for, but removed it since it's never used; the abscence of an invalid_rev implies that the infodoc is valid for the current revision.
If saving the medic doc fails, it clears invalid_rev from the infodoc to allow retry.
There is no cleanup for other stale infodocs where invalid_rev is never cleared, but this can only happen if there's an error other than 409 saving the infodoc the second time, or if its interrupted between saving the medic doc and the infodoc.
To actually fix the race, if sentinel finds an infodoc with invalid_rev set (doesn't actually use the rev, just the presence of the flag), it waits 100ms and then tries again until the infodoc had invalid_rev removed, for a maximum of 5 times. after 5 retries, it logs an error and continues to the next change without running transitions.
invalid_revis only set byprocessDocs(synchronous changes from api), and onlyprocessChanges(from sentinel) will wait for it. It also applies only to transitions. This is to avoid adding extra db chatter where it isn't really needed.A previous commit also add invalid_rev and waiting to
processChanges, but removed because it does not actually fix anything; sentinel does not support concurrency, and there isn't currently a case where api should wait for sentinel to finish.