Related
Problem
PipelineInterpreterStateUpdater.reloadAndSave() can silently replace a valid pipeline state with an empty one. This happens because MongoDbRuleService.loadAll() and MongoDbPipelineService.loadAll() catch MongoException and return Collections.emptySet() instead of propagating the error. A transient MongoDB hiccup during any event-triggered reload (startup, shutdown, or normal operation) produces a structurally valid but semantically empty PipelineInterpreter.State — zero pipelines, zero stream connections. Messages processed in this window bypass all pipeline rules and land in the default stream.
The issue is exacerbated during shutdown because ClusterEventPeriodical (which can trigger reloads via serverEventBus) keeps running until after the process buffer is drained, and during startup because unconsumed cluster events trigger reloads during the MongoDB contention window.
reloadAndSave() has no exception handling, no validation of the resulting state, and no retry mechanism. PipelineInterpreter.process() and IlluminateMessageProcessor.process() have no null check on getLatestState().
Proposed Fix
Four independent changes, ordered by impact:
1. Let MongoException propagate from loadAll()
Remove the try/catch(MongoException) in MongoDbRuleService.loadAll() and MongoDbPipelineService.loadAll() (and similar methods: loadBySourcePattern, loadAllByTitle, loadAllByScope, loadNamed). Let exceptions propagate so that buildState() fails loudly instead of producing empty state. MongoDbPipelineStreamConnectionsService.loadAll() already does not catch — this makes all three services consistent.
Other callers (REST resources, content packs, migrations) get a 500 on transient MongoDB failure, which is correct and preferable to silently empty results.
2. Migrate state reload to a SystemJob with retry
New PipelineInterpreterStateReloadJob following the PipelineMetadataUpdateJob pattern:
- On success: sets new state via
stateUpdater.updateState(newState), returns SystemJobResult.success()
- On exception: logs warning, returns
SystemJobResult.withRetry(1s, unlimited)
PipelineInterpreterStateUpdater changes:
- Event handlers submit the system job via
SystemJobManager instead of scheduling on daemonScheduler
- Constructor keeps synchronous initial load (system scheduler may not be ready during Guice injection), but swaps order: load state before registering on
serverEventBus to close the existing race window where an async event can trigger a reload before the initial load completes
3. Guard against empty-state overwrite
updateState() refuses to replace a non-empty state with an empty one (logs at WARN). Defense-in-depth for any remaining path where buildState() succeeds with empty data.
4. Null safety in PipelineInterpreter.process() and IlluminateMessageProcessor.process()
If getLatestState() returns null, pass messages through with a warning log instead of NPE. Last-resort defense only.
Priority
Medium-high. The silent degradation path exists on any event-triggered reload when MongoDB has a transient error — not just during startup/shutdown. #24382 reduces the probability during startup by not replaying historic cluster events, but does not close the underlying hole. The fix is small and low-risk.
Backport Consideration
Changes 1, 3, and 4 are trivially backportable (~15 lines changed across 4 files, no new dependencies). Change 2 depends on the SystemJob infrastructure and would need to be simplified for older branches by keeping the daemonScheduler but adding try/catch with retry scheduling directly in reloadAndSave().
Related
ClusterEventPeriodicalrefactor (reduces probability during startup, but does not address the underlying exception handling)Problem
PipelineInterpreterStateUpdater.reloadAndSave()can silently replace a valid pipeline state with an empty one. This happens becauseMongoDbRuleService.loadAll()andMongoDbPipelineService.loadAll()catchMongoExceptionand returnCollections.emptySet()instead of propagating the error. A transient MongoDB hiccup during any event-triggered reload (startup, shutdown, or normal operation) produces a structurally valid but semantically emptyPipelineInterpreter.State— zero pipelines, zero stream connections. Messages processed in this window bypass all pipeline rules and land in the default stream.The issue is exacerbated during shutdown because
ClusterEventPeriodical(which can trigger reloads viaserverEventBus) keeps running until after the process buffer is drained, and during startup because unconsumed cluster events trigger reloads during the MongoDB contention window.reloadAndSave()has no exception handling, no validation of the resulting state, and no retry mechanism.PipelineInterpreter.process()andIlluminateMessageProcessor.process()have no null check ongetLatestState().Proposed Fix
Four independent changes, ordered by impact:
1. Let
MongoExceptionpropagate fromloadAll()Remove the
try/catch(MongoException)inMongoDbRuleService.loadAll()andMongoDbPipelineService.loadAll()(and similar methods:loadBySourcePattern,loadAllByTitle,loadAllByScope,loadNamed). Let exceptions propagate so thatbuildState()fails loudly instead of producing empty state.MongoDbPipelineStreamConnectionsService.loadAll()already does not catch — this makes all three services consistent.Other callers (REST resources, content packs, migrations) get a
500on transient MongoDB failure, which is correct and preferable to silently empty results.2. Migrate state reload to a
SystemJobwith retryNew
PipelineInterpreterStateReloadJobfollowing thePipelineMetadataUpdateJobpattern:stateUpdater.updateState(newState), returnsSystemJobResult.success()SystemJobResult.withRetry(1s, unlimited)PipelineInterpreterStateUpdaterchanges:SystemJobManagerinstead of scheduling ondaemonSchedulerserverEventBusto close the existing race window where an async event can trigger a reload before the initial load completes3. Guard against empty-state overwrite
updateState()refuses to replace a non-empty state with an empty one (logs at WARN). Defense-in-depth for any remaining path wherebuildState()succeeds with empty data.4. Null safety in
PipelineInterpreter.process()andIlluminateMessageProcessor.process()If
getLatestState()returns null, pass messages through with a warning log instead of NPE. Last-resort defense only.Priority
Medium-high. The silent degradation path exists on any event-triggered reload when MongoDB has a transient error — not just during startup/shutdown. #24382 reduces the probability during startup by not replaying historic cluster events, but does not close the underlying hole. The fix is small and low-risk.
Backport Consideration
Changes 1, 3, and 4 are trivially backportable (~15 lines changed across 4 files, no new dependencies). Change 2 depends on the
SystemJobinfrastructure and would need to be simplified for older branches by keeping thedaemonSchedulerbut adding try/catch with retry scheduling directly inreloadAndSave().