Skip to content

Persist zones#550

Open
ximon18 wants to merge 56 commits into
mainfrom
persist-zones
Open

Persist zones#550
ximon18 wants to merge 56 commits into
mainfrom
persist-zones

Conversation

@ximon18
Copy link
Copy Markdown
Member

@ximon18 ximon18 commented Mar 26, 2026

Status

Zone edits (e.g. via file edit and zone reload or due to changes received via XFR) are persisted both for loaded and signed zones. On application restart persisted zones are restored and cascade zone status and dig AXFR appear to work as they did prior to application shutdown.

The code has been initially reviewed and led to some wanted changes:

  • @bal-e will extend the zonedata crate interface to enable moving some of the changes to Cascade itself.
  • Invoke save_now() instead of mark_dirty(). This is because persisting zone data to disk not enough, as if there was a power outage between calling mark_dirty() and publication of the zone the paths to the persisted files (that are recorded in state) would not have been written to disk, thus save_now() should be invoked instead. We may actually want to invoke save_now() in the publication server itself to ensure all published zone related state is persisted, but that is out of scope for this PR. Update: It's not clear to me that this is actually needed.

tl;dr

  • Extend src/persistence/persist.rs to write snapshots and deltas to disk (currently in DNS AXFR/IXFR wire format).
  • Extend src/persistence/restore.rs to restore zones by loading the zone snapshot into a zone replacer and passing the zone diffs one at a time to a zone patcher.
  • Extend src/persistence/ to store generated/restored diffs for use by IXFR out.
  • Extend ZoneState with two vecs of paths to persisted snapshot and delta files.
  • Extend the CLI to show the zone is restoring in cascade zone status output.

Known issues:

  • No support (yet) for condensing/compacting/purging old persisted deltas.
  • Binary snapshot/delta format is not inspectable by operators. Could add a `cascade debug subcommand to render a snapshot/delta in XFR presentation format.
  • There's no "scheduling" or spreading out of zone restoration activity, like refreshing it happens as soon as possible.
  • There are no metrics for the restoration process (yet).
  • Snapshots and deltas are loaded entirely into memory, while they could be processed in a streaming manner instead.
  • The persistence location is hard-coded.

Alternative snapshot/delta storage formats

A choice would need to be driven by requirements:

  • Do we need to support IXFR out at the loaded review nameserver? If so, is just one delta enough? Actually do we even need to keep a loaded diff, a review hook would only need it for the next received zone update, not after restart.
  • What do we tune for? Memory usage? Persistence speed? Restoration speed?

Some ideas:

  • Presentation XFR format: was not done in this PR due to the pain of having to convert new base to old base to render in presentation format, and parsing DNS records in presentation format is only currently properly supported by domain/cascade for entire zones, though one could use the "treat a single RR line as a zonefile" hack that the Stelline code uses.
  • (De)serialize cascade DiffData to JSON/CBOR/whatever. Would tie the format to the internal format, changing the internal format would require writing additional code to handle storage format migration or support parsing old as well as new format. A PRO of this approach is being able to storage additional meta data with the snapshot/delta, if needed.

  • If you are changing Rust code or integration tests (Cargo.*, crates/, etc/, integration-tests/, src/):
    • Did you run the integration tests with act through the act-wrapper (as described in TESTING.md)?

@ximon18 ximon18 added this to the 0.1.0-beta1 milestone Mar 26, 2026
@ximon18 ximon18 added the enhancement New feature or request label Mar 26, 2026
@ximon18 ximon18 linked an issue Mar 30, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Contributor

@bal-e bal-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ximon18 and I had a discussion while I was writing this review, and decided that I will add commits changing the PR to implement a new approach for the zone data storage state machine. I've left these comments here for us to come back once I'm done; I also didn't review everything, so there will be more to come.

Comment thread crates/zonedata/src/storage/states.rs Outdated
Comment on lines +51 to +56

/// The index of the current loaded instance.
pub(super) curr_loaded_index: bool,

/// The index of the current signed instance.
pub(super) curr_signed_index: bool,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these fields necessary, if this state is only used at the very beginning (where they would both be 0/false)?

Comment thread crates/zonedata/src/storage/states.rs Outdated
pub(super) curr_signed_index: bool,
}

pub struct UninitializedStorage {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type (and RestoringPersistedStorage below) need documentation, similar to the other types in this file. I'm happy to write it by adding to the PR.

assert!(
old_reviewer.loaded_index == self.curr_loaded_index,
"'old_reviewer' does not point to the current instance",
"'old_reviewer' does not point to the current instance: {} vs:{}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} vs:{} -> {} vs {}?

let viewer = unsafe { ZoneViewer::new(self.data.clone(), true, true) };

let signed = unsafe { &*self.data.signed[0].get() };
let serial = signed.soa.as_ref().unwrap().rdata.serial;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the serial number should be extracted here. Perhaps we need methods on LoadedZoneBuilt and SignedZoneBuilt that give you readers so you can observe the data, or just getters for the serial?

Comment on lines +335 to +338
assert!(
Arc::ptr_eq(&signed_built.data, &self.data),
"'built' is for a different zone"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also check loaded_built. And the assert message should clarify which parameter it is referring to.

Comment thread src/zone/storage.rs Outdated
Comment on lines +110 to +116
let (transition, state) = if let ZoneDataStorage::Uninitialized(s) = state {
let s = s.initialize();
t.move_to(ZoneDataStorage::Passive(s));
transition(&mut self.state.storage.machine)
} else {
(t, state)
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the above commented-out initialize() function over this silent automatic conversion. Especially since it should only occur at startup and we can handle it outside this function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current setup, we might actually want to do this check in other places, but it is hard to tell.

Comment thread src/zone/storage.rs Outdated
Comment on lines +224 to +227
let (s, loaded_reviewer, signed_reviewer, viewer, serial) =
s.finish(loaded_built, signed_built);
self.state.storage.loaded_reviewer = loaded_reviewer;
self.state.storage.signed_reviewer = signed_reviewer;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does not correctly handle the reviewer types. They should be passed back into the storage state machine, so it can guarantee that viewers only exist for the permitted 0/1 instances. We should discuss what the right adjustments to the storage state machine's flow are.

Comment thread crates/zonedata/src/storage/mod.rs Outdated
Comment on lines +33 to +36
/// The zone is in an empty initial uninitialized state pending possible
/// restoration from persisted state or initialization directly to an
/// empty passive state.
Uninitialized(UninitializedStorage),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to omit this state entirely and only have a Restoring state, which can then decide to proceed with or skip restoring. Restoring might already need that functionality, e.g. if the previously persisted instances could not be found.

Comment thread crates/zonedata/src/storage/mod.rs Outdated
Uninitialized(UninitializedStorage),

/// The zone is being restored from persisted state.
Restoring(RestoringPersistedStorage),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum variant name should match the underlying name, for consistency. I think the underlying type should be RestoringStorage.

Comment thread crates/zonedata/src/storage/mod.rs Outdated
@@ -86,9 +94,6 @@ pub enum ZoneDataStorage {
impl ZoneDataStorage {
/// Construct a new [`ZoneDataStorage`].
pub fn new() -> (Self, LoadedZoneReviewer, SignedZoneReviewer, ZoneViewer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we want to change this function so it no longer returns reviewers? That might be easier all around.

ximon18 added 28 commits May 8, 2026 13:36
Currently fails due to an unrelated TSIG bug causing dig to fail.
- Merges in the ixfr-out and full-signer-update-state branches as these
were needed for local testing, but should be obsoleted / synced with
main when PRs #631 ("Save last serial and key tags in zone state")and
#605 ("Add IXFR out support") get merged.
- Extends the persist-restore system test to cover more cases. Should
perhaps be split out into separate smaller tests.
- Actually adds restored diffs to the zone storage to be served by IXFR
out. Will need updating to match the changes/fixes that have since been
made in PR #605.
- Fixes an issue where persisted diffs from multiple SOA serials, e.g.
1..2..3 would be condensed on restore so that only a single IXFR diff
from 1..3 would be available instead of two diffs from 1..2 and 2..3
being available.
- Clears the set of known persisted data file paths for a zone if any of
those files are missing or cannot be parsed during restoration.
As step.if doesn't seem to be honoured by nektos/act.
Not via TCP but via single SOA response, per RFC 1995. Also adds a
system test using dig.
The signed diff is not available when the loaded diff is restored but
should be stored with the corresponding loaded diff, so make the signed
diff an Option to be made Some as soon as it is restored.
- Don't hold the zone state lock while restoring as it blocks the zone
status command.
- Log restoring started at INFO level as well as restoring complete.
- Only log restoring complete if restoring actually happened.
- Log the zone being restored at INFO level, not just TRACE level.
Otherwise zone status of a restored zone doesn't show that it is
published.
Validated with the ixfr-out system test from PR #650.
@ximon18 ximon18 marked this pull request as ready for review May 12, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zone persistence

3 participants