Skip to content

Harden pipeline interpreter state reload against transient MongoDB errors #25750

@patrickmann

Description

@patrickmann

Related

Problem

PipelineInterpreterStateUpdater.reloadAndSave() can silently replace a valid pipeline state with an empty one. This happens because MongoDbRuleService.loadAll() and MongoDbPipelineService.loadAll() catch MongoException and return Collections.emptySet() instead of propagating the error. A transient MongoDB hiccup during any event-triggered reload (startup, shutdown, or normal operation) produces a structurally valid but semantically empty PipelineInterpreter.State — zero pipelines, zero stream connections. Messages processed in this window bypass all pipeline rules and land in the default stream.

The issue is exacerbated during shutdown because ClusterEventPeriodical (which can trigger reloads via serverEventBus) keeps running until after the process buffer is drained, and during startup because unconsumed cluster events trigger reloads during the MongoDB contention window.

reloadAndSave() has no exception handling, no validation of the resulting state, and no retry mechanism. PipelineInterpreter.process() and IlluminateMessageProcessor.process() have no null check on getLatestState().

Proposed Fix

Four independent changes, ordered by impact:

1. Let MongoException propagate from loadAll()

Remove the try/catch(MongoException) in MongoDbRuleService.loadAll() and MongoDbPipelineService.loadAll() (and similar methods: loadBySourcePattern, loadAllByTitle, loadAllByScope, loadNamed). Let exceptions propagate so that buildState() fails loudly instead of producing empty state. MongoDbPipelineStreamConnectionsService.loadAll() already does not catch — this makes all three services consistent.

Other callers (REST resources, content packs, migrations) get a 500 on transient MongoDB failure, which is correct and preferable to silently empty results.

2. Migrate state reload to a SystemJob with retry

New PipelineInterpreterStateReloadJob following the PipelineMetadataUpdateJob pattern:

  • On success: sets new state via stateUpdater.updateState(newState), returns SystemJobResult.success()
  • On exception: logs warning, returns SystemJobResult.withRetry(1s, unlimited)

PipelineInterpreterStateUpdater changes:

  • Event handlers submit the system job via SystemJobManager instead of scheduling on daemonScheduler
  • Constructor keeps synchronous initial load (system scheduler may not be ready during Guice injection), but swaps order: load state before registering on serverEventBus to close the existing race window where an async event can trigger a reload before the initial load completes

3. Guard against empty-state overwrite

updateState() refuses to replace a non-empty state with an empty one (logs at WARN). Defense-in-depth for any remaining path where buildState() succeeds with empty data.

4. Null safety in PipelineInterpreter.process() and IlluminateMessageProcessor.process()

If getLatestState() returns null, pass messages through with a warning log instead of NPE. Last-resort defense only.

Priority

Medium-high. The silent degradation path exists on any event-triggered reload when MongoDB has a transient error — not just during startup/shutdown. #24382 reduces the probability during startup by not replaying historic cluster events, but does not close the underlying hole. The fix is small and low-risk.

Backport Consideration

Changes 1, 3, and 4 are trivially backportable (~15 lines changed across 4 files, no new dependencies). Change 2 depends on the SystemJob infrastructure and would need to be simplified for older branches by keeping the daemonScheduler but adding try/catch with retry scheduling directly in reloadAndSave().

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions