HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss#8248
Conversation
7f5b2a8 to
498e80b
Compare
| // Guard: ensure restoreDir is not under production data directory (HBASE-29435) | ||
| MapReduceArchiveGuard.validateNotProductionDataPath(rootDir, restoreDir); | ||
|
|
||
| RestoreSnapshotHelper.copySnapshotForScanner(conf, fs, rootDir, restoreDir, snapshotName); |
There was a problem hiding this comment.
Right, these are the APIs that are being used across mapreduce module and server module (for the Snapshot procedure code), lets separate these as part of this change
… data Introduce MapreduceRestoreSnapshotHelper, a MapReduce-local fork of RestoreSnapshotHelper, so the MR module can layer additional safety validation on snapshot restoration without affecting Master/server-side restore and clone procedures, which legitimately operate against the production hbase.rootdir. In copySnapshotForScanner(), the path-prefix check is tightened to also reject the exact-match case (restoreDir == rootDir), and the error message is rewritten to call out the data-loss risk, the remediation (use a temporary directory outside hbase.rootdir), and HBASE-29435 for traceability. The guard now also logs at ERROR level before throwing so ops can grep blocked attempts in MR job logs. Migrate all hbase-mapreduce call sites of RestoreSnapshotHelper.copySnapshotForScanner to the new MR-local helper: - TableSnapshotInputFormatImpl.setInput() - MultiTableSnapshotInputFormatImpl.setInput() - VerifyReplication.restoreSnapshotForPeerCluster() - TestTableSnapshotInputFormat (the test that directly invoked the helper) CompactionTool is intentionally left untouched as it operates on production data by design. Add TestMapreduceRestoreSnapshotHelper covering the migration and validation logic. The hbase-mapreduce module now depends on hbase-procedure's test-jar so MR tests can use ProcedureTestingUtility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
498e80b to
2c76178
Compare
| * Restore the on-disk table to a specified snapshot state. | ||
| * @return the set of regions touched by the restore operation | ||
| */ | ||
| public MapreduceRestoreSnapshotHelper.RestoreMetaChanges restoreHdfsRegions() throws IOException { |
There was a problem hiding this comment.
We dont need these RestoreMetaChanges for map reduce right? Are we using it somewhere ?
| private StoreFileInfo restoreStoreFile(final Path familyDir, final RegionInfo regionInfo, | ||
| final SnapshotProtos.SnapshotRegionManifest.StoreFile storeFile, final boolean createBackRef, | ||
| final StoreFileTracker tracker) throws IOException { | ||
| String hfileName = storeFile.getName(); | ||
| StoreFileInfo info = null; | ||
| if (HFileLink.isHFileLink(hfileName) || StoreFileInfo.isMobFileLink(hfileName)) { | ||
| HFileLink hfileLink = tracker.createFromHFileLink(hfileName, createBackRef); | ||
| info = new StoreFileInfo(conf, fs, new Path(familyDir, hfileName), hfileLink); | ||
| return info; | ||
| } else if (StoreFileInfo.isReference(hfileName)) { | ||
| return restoreReferenceFile(familyDir, regionInfo, storeFile, tracker); | ||
| } else { | ||
| HFileLink hfileLink = tracker.createAndCommitHFileLink(regionInfo.getTable(), | ||
| regionInfo.getEncodedName(), hfileName, createBackRef); | ||
| return new StoreFileInfo(conf, fs, new Path(familyDir, HFileLink | ||
| .createHFileLinkName(regionInfo.getTable(), regionInfo.getEncodedName(), hfileName)), | ||
| hfileLink); | ||
| } | ||
| } |
There was a problem hiding this comment.
I understand that the jira mentions to have a new set of helper classes, but on the second thoughts core logics like these should still be same across mapreduce and procedures, I think it is good to have both MapReduceRestoreSnapshotHelper, RestoreSnapshotHelper
but lets not copy core logics and call the RestoreSnapshotHelper itself from MapReduceRestoreSnapshotHelper and send in the MapReduceHFileArchiver as a parameter to the methods which involve HFileArchiving. @SwaraliJoshi
sairampola
left a comment
There was a problem hiding this comment.
Thanks for tackling this — the failure mode is real and nasty (silent, TTL-delayed production data loss), and the diagnosis of the uncovered exact-match case is correct. A few things I'd want addressed before this is mergeable, grouped by severity.
Blocking (CI will fail on each independently)
- Missing Apache license header —
hbase-mapreduce/.../util/MapreduceHFileArchiver.javastarts directly atpackage …; the ASF header block was dropped.apache-rat:checkwill fail. (MapreduceRestoreSnapshotHelperkept its header.) - Missing
@InterfaceAudienceannotation —MapreduceHFileArchiveris a public class with no audience annotation; the originalHFileArchiverhas@InterfaceAudience.Privateand checkstyle enforces it. The import was dropped too. - Line length > 100 cols — several lines in both new files exceed the limit (e.g.
MapreduceRestoreSnapshotHelperL170, L330, L782, L823;MapreduceHFileArchiverL42, L107, L178), mostly from expanding nested-type names to fully-qualified form.mvn spotless:apply+ dropping the redundant self-qualifier will fix it. - Unused imports in the new test —
TestMapreduceRestoreSnapshotHelperhas several unused imports (MetaTableAccessor,SnapshotManifest,MonitoredTask,HRegion,FSTableDescriptors,WALSplitUtil,ForeignExceptionDispatcher,MergeTableRegionsProcedure,Mockito), which fails checkstyleUnusedImports.
Design / correctness
-
Most of the new validation is unreachable, and the
HFileArchiverfork may be unnecessary. IncopySnapshotForScanner, the existing check already rejects the sub-directory case:if (restoreDir.toUri().getPath().startsWith(rootDir.toUri().getPath() + "/")) { throw new IllegalArgumentException("Restore directory cannot be a sub directory ..."); }
The new block re-tests that same
startsWith(rootPath + "/")condition, so that half is dead — the only genuinely new behavior isrestorePath.equals(rootPath)(exact match). Suggest collapsing the two blocks into one guard with the better data-loss message.More broadly: the guard runs before the helper is constructed, and
MapreduceHFileArchiverstill archives/deletes exactly as the original does. Since the dangerous archival of production HFiles is only reachable whenrestoreDir == rootDir(or under it) — which the guard now rejects up front — it's not clear the ~570-line archiver fork buys any additional safety. Givenhbase-mapreducealready depends onhbase-server(so the originalHFileArchiver/RestoreSnapshotHelperare on the classpath), could the fix be just the guard on the MR call path, droppingMapreduceHFileArchiverentirely? Forking ~1,400 lines of security-critical code that will silently diverge from upstream is a significant maintenance cost, and I'd want a strong reason before taking it on. -
Guard comparison is authority-blind, weakest on the
VerifyReplicationpeer path. The checks usetoUri().getPath(), which strips scheme + authority. For the peer-cluster flow (different NameNode, operator-supplied location args) safety rests entirely on the filesystem-URI equality check; trailing slashes and..traversal can also slip past a rawstartsWith. The originalHFileArchiver.archiveRecoveredEditsassertedfsscheme consistency before archiving — that's not carried here. Suggest comparingfs.makeQualified(restoreDir)vsfs.makeQualified(rootDir)(normalizes + includes authority) and asserting the passedfsmatchesrootDir's filesystem. -
The headline behavior isn't tested. The only
assertThrowscoversSnapshotTTLExpiredException(a pre-existing path). There's no assertion that the new guard throws forrestoreDir == rootDiror a true sub-dir.testNoHFileLinkInRootDirusesrestoreDir = "/hbase/.tmp-restore"against a deepgetDefaultRootDirPath(), so it exercises the pass-through branch, not the throw. These can be fast unit tests (the guard throws before any FS work):- exact match (
restoreDir == rootDir) throws - sub-dir (
new Path(rootDir, "data/.tmp")) throws - sibling dir is allowed (negative contract)
- different filesystem throws
Asserting on a distinguishing substring of each message keeps the cases from passing on the wrong branch.
- exact match (
-
PR description doesn't match the diff. It states
hbase-mapreduce/pom.xmlwas changed to add anhbase-proceduretest-jar "so MR tests can useProcedureTestingUtility," butpom.xmlisn't in the diff, the test never referencesProcedureTestingUtility, and it compiles against the current pom. The "Files Changed" table also omitsMapreduceHFileArchiver.java. Worth correcting so reviewers know the real surface area. -
Leftover dead code from the copy.
archiveExecutorinMapreduceHFileArchiveris now an orphaned field (its users were removed).copySnapshotForScannerispublicbut returns the now-privateRestoreMetaChanges, and all call sites discard it — returningvoidwould be cleaner. The test also carries several unused private helpers (createAndAssertSnapshot,flipCompactions,getMasterProcedureExecutor,createSnapshotMock,verifyRestore,getReferredToFile).
Suggestions
testCopyExpiredSnapshotForScannersleeps up to ~60s of wall-clock in aMediumTeststest, which risks exceeding the category timeout on a busy CI agent. Injecting a manualEnvironmentEdgewould make it deterministic and sub-second.LOG.error(message)immediately beforethrow new IllegalArgumentException(message)double-logs the same text; the sibling checks throw without it.- A couple of
fs.delete(...)boolean returns are ignored in the restore tree — low impact (throwaway dir), but aWARNonfalsewould match surrounding conventions.
What's good
- The fail-fast guard logs at ERROR and throws with a clear remediation + JIRA reference before any filesystem mutation.
- The archiver's core data-loss guarantee (per-file failure →
failuresqueue →FailedArchiveException) is preserved intact through the fork. TestTableSnapshotInputFormat.testReadFromRestoredSnapshotViaMRis a solid contract test (restore → scan → succeed, then delete restore dir → assert failure).
Overall: the safety goal looks achievable with a small, well-tested guard; I'd lean toward minimizing (or avoiding) the fork and adding direct tests for the throw paths. Thanks again for catching this.
Problem
When a MapReduce snapshot-scanning job is misconfigured with restoreDir pointing to (or under) the production HBase data directory (hbase.rootdir/data/), RestoreSnapshotHelper.copySnapshotForScanner() creates HFileLink reference files directly inside the live data directory. When the MR job ends and cleanup triggers HFileArchiver, production HFiles get moved into the archive, where HFileCleaner (default TTL: 5 minutes) permanently deletes them — causing silent, irreversible data loss.
The existing same-filesystem and sub-directory checks in RestoreSnapshotHelper did not catch the exact-match case (restoreDir == rootDir), and the resulting error message gave no hint about the data-loss risk.
JIRA HBASE-29435
Current Impact (Without Fix)
A single misconfigured restoreDir parameter (e.g., set to hbase.rootdir or hbase.rootdir/data/...) can cause permanent production data loss.
The failure mode is silent — no exception, no warning; data simply disappears after the HFileCleaner TTL elapses.
Every MapReduce snapshot-scanning path is exposed: TableSnapshotInputFormat, MultiTableSnapshotInputFormat, and VerifyReplication (peer cluster).
Stricter validation cannot simply be added to RestoreSnapshotHelper itself — Master-side procedures (RestoreSnapshotProcedure, CloneSnapshotProcedure) legitimately invoke that helper against the production hbase.rootdir, and tightening it would break those flows.
Solution
MapReduce-local fork of the helper. Introduce MapreduceRestoreSnapshotHelper in hbase-mapreduce, a deliberate copy of RestoreSnapshotHelper. This isolation lets us layer MR-specific safety validation without affecting Master / server-side restore and clone procedures, which legitimately operate against the production hbase.rootdir.
Tighter path validation in copySnapshotForScanner(). Reject not just the sub-directory case but also the exact-match case (restoreDir == rootDir):
The error message now spells out the data-loss risk, the remediation, and the JIRA for traceability. An LOG.error(message) before the throw makes blocked attempts grep-friendly in MR job logs.
Intentionally left untouched: CompactionTool — operates on production data by design.
Impact After Fix
Files Changed