Harden pipeline interpreter state reload against transient MongoDB errors

## Related

- Graylog2/graylog-plugin-enterprise#7181 — customer-reported issue: pipeline rules not applied after node restart
- #25745 — root cause analysis of the async state reload race
- #24382 — `ClusterEventPeriodical` refactor (reduces probability during startup, but does not address the underlying exception handling)

## Problem

`PipelineInterpreterStateUpdater.reloadAndSave()` can silently replace a valid pipeline state with an empty one. This happens because `MongoDbRuleService.loadAll()` and `MongoDbPipelineService.loadAll()` catch `MongoException` and return `Collections.emptySet()` instead of propagating the error. A transient MongoDB hiccup during any event-triggered reload (startup, shutdown, or normal operation) produces a structurally valid but semantically empty `PipelineInterpreter.State` — zero pipelines, zero stream connections. Messages processed in this window bypass all pipeline rules and land in the default stream.

The issue is exacerbated during shutdown because `ClusterEventPeriodical` (which can trigger reloads via `serverEventBus`) keeps running until after the process buffer is drained, and during startup because unconsumed cluster events trigger reloads during the MongoDB contention window.

`reloadAndSave()` has no exception handling, no validation of the resulting state, and no retry mechanism. `PipelineInterpreter.process()` and `IlluminateMessageProcessor.process()` have no null check on `getLatestState()`.

## Proposed Fix

Four independent changes, ordered by impact:

### 1. Let `MongoException` propagate from `loadAll()`

Remove the `try/catch(MongoException)` in `MongoDbRuleService.loadAll()` and `MongoDbPipelineService.loadAll()` (and similar methods: `loadBySourcePattern`, `loadAllByTitle`, `loadAllByScope`, `loadNamed`). Let exceptions propagate so that `buildState()` fails loudly instead of producing empty state. `MongoDbPipelineStreamConnectionsService.loadAll()` already does not catch — this makes all three services consistent.

Other callers (REST resources, content packs, migrations) get a `500` on transient MongoDB failure, which is correct and preferable to silently empty results.

### 2. Migrate state reload to a `SystemJob` with retry

New `PipelineInterpreterStateReloadJob` following the `PipelineMetadataUpdateJob` pattern:
- On success: sets new state via `stateUpdater.updateState(newState)`, returns `SystemJobResult.success()`
- On exception: logs warning, returns `SystemJobResult.withRetry(1s, unlimited)`

`PipelineInterpreterStateUpdater` changes:
- Event handlers submit the system job via `SystemJobManager` instead of scheduling on `daemonScheduler`
- Constructor keeps synchronous initial load (system scheduler may not be ready during Guice injection), but swaps order: load state **before** registering on `serverEventBus` to close the existing race window where an async event can trigger a reload before the initial load completes

### 3. Guard against empty-state overwrite

`updateState()` refuses to replace a non-empty state with an empty one (logs at WARN). Defense-in-depth for any remaining path where `buildState()` succeeds with empty data.

### 4. Null safety in `PipelineInterpreter.process()` and `IlluminateMessageProcessor.process()`

If `getLatestState()` returns null, pass messages through with a warning log instead of NPE. Last-resort defense only.

## Priority

**Medium-high.** The silent degradation path exists on any event-triggered reload when MongoDB has a transient error — not just during startup/shutdown. #24382 reduces the probability during startup by not replaying historic cluster events, but does not close the underlying hole. The fix is small and low-risk.

## Backport Consideration

Changes 1, 3, and 4 are trivially backportable (~15 lines changed across 4 files, no new dependencies). Change 2 depends on the `SystemJob` infrastructure and would need to be simplified for older branches by keeping the `daemonScheduler` but adding try/catch with retry scheduling directly in `reloadAndSave()`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden pipeline interpreter state reload against transient MongoDB errors #25750

Related

Problem

Proposed Fix

1. Let `MongoException` propagate from `loadAll()`

2. Migrate state reload to a `SystemJob` with retry

3. Guard against empty-state overwrite

4. Null safety in `PipelineInterpreter.process()` and `IlluminateMessageProcessor.process()`

Priority

Backport Consideration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Harden pipeline interpreter state reload against transient MongoDB errors #25750

Description

Related

Problem

Proposed Fix

1. Let MongoException propagate from loadAll()

2. Migrate state reload to a SystemJob with retry

3. Guard against empty-state overwrite

4. Null safety in PipelineInterpreter.process() and IlluminateMessageProcessor.process()

Priority

Backport Consideration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Let `MongoException` propagate from `loadAll()`

2. Migrate state reload to a `SystemJob` with retry

4. Null safety in `PipelineInterpreter.process()` and `IlluminateMessageProcessor.process()`