Skip to content

[#10474] docs: Add concurrency-control design for multi-node Gravitino (TreeLock)#11888

Open
yuqi1129 wants to merge 1 commit into
apache:mainfrom
yuqi1129:docs/treelock-concurrency-design
Open

[#10474] docs: Add concurrency-control design for multi-node Gravitino (TreeLock)#11888
yuqi1129 wants to merge 1 commit into
apache:mainfrom
yuqi1129:docs/treelock-concurrency-design

Conversation

@yuqi1129

@yuqi1129 yuqi1129 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Adds a design document (design-docs/treelock-necessity-and-concurrency-design.md, with rendered diagrams) that:

  • Explains what TreeLock protects today and why it is per-JVM, so it stops protecting anything across servers under HA.
  • Analyzes that every catalog metadata write touches two stores (the external catalog and the Gravitino store) with no shared transaction; for external-backed catalogs the external system is the source of truth and the Gravitino store self-fixes on read (import).
  • Compares two directions — (1) remove the lock and move correctness into the shared database (OCC + conditional inserts) vs (2) a cross-node lock — on correctness, performance, maintainability, and operational cost, and reviews how comparable systems solve the same dual-write problem.
  • Concludes with Direction 1 (remove/shrink TreeLock to a small in-process lock) and a phased plan.

Rendered diagrams are embedded under design-docs/images/treelock-*.png.

Why are the changes needed?

#10474 asks for a concrete proposal on TreeLock's limitations for HA. This document provides the analysis and a recommended direction before any code change.

Related to #10474

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Not applicable — documentation only. Diagrams were rendered from their .mmd sources with mermaid-cli and embedded as PNGs.

…avitino

Analyzes what TreeLock protects today, why it breaks under HA, and compares
two directions (remove the lock + DB-level OCC vs. a cross-node lock). Concludes
with the remove/shrink direction and a phased plan. Includes rendered diagrams
under design-docs/images and their .mmd sources under design-docs/diagrams.
Copilot AI review requested due to automatic review settings July 2, 2026 14:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design document analyzing why the current in-process TreeLock does not provide cross-node safety in HA deployments, compares “DB-driven correctness (OCC + conditional inserts)” vs “distributed locking”, and recommends Direction 1 (move correctness to the shared database, then shrink TreeLock to a small per-entity in-process helper).

Changes:

  • Added a detailed HA concurrency-control design doc for TreeLock limitations and the dual-write model (external catalog + Gravitino store).
  • Documented a phased proposal centered on OCC retries scoped to the relational store layer, always-increasing versioning, and parent-aliveness conditional inserts.
  • Embedded/linked rendered diagrams to illustrate write flows, import/read-repair, and tradeoffs.

/metalake/cat/db/t1 READ (parent write-locked) (parent write-locked)
```

This is built on `LockManager`, which keeps an in-memory tree of `TreeLockNode`s (each one wraps a `ReentrantReadWriteLock`), plus reference counting, a background thread that removes unused nodes, and another background thread that checks for deadlocks inside the same JVM. It is about 700 lines of code in total.

### Not every catalog has an external system that decides the winner

Whether the external system can act as the judge depends on the catalog. In the code this is the `managedStorage` capability (`Capability` in `core/.../connector/capability/Capability.java`; the default returns "managed" only for functions). The catalogs split into two groups:
Comment on lines +98 to +99
![Two-store write and the crash gap](images/treelock-two-store-write.png)

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Code Coverage Report

Overall Project 67.48% 🟢
Files changed No Java source files changed -

Module Coverage
aliyun 1.72% 🔴
api 46.82% 🟢
authorization-common 85.96% 🟢
aws 42.04% 🟢
azure 2.47% 🔴
catalog-common 10.4% 🔴
catalog-fileset 80.23% 🟢
catalog-glue 66.91% 🟢
catalog-hive 79.42% 🟢
catalog-jdbc-clickhouse 80.2% 🟢
catalog-jdbc-common 44.22% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 80.91% 🟢
catalog-jdbc-postgresql 82.29% 🟢
catalog-jdbc-starrocks 78.51% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 58.53% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 85.86% 🟢
catalog-lakehouse-paimon 84.25% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 78.01% 🟢
common 50.17% 🟢
core 82.59% 🟢
filesystem-hadoop3 77.3% 🟢
flink 0.0% 🔴
flink-common 47.09% 🟢
flink-runtime 0.0% 🔴
gcp 14.12% 🔴
hadoop-auth 66.67% 🟢
hadoop-common 12.7% 🔴
hive-metastore-common 53.29% 🟢
iceberg-common 58.3% 🟢
iceberg-rest-server 74.01% 🟢
idp-basic 85.71% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 20.81% 🔴
lance-rest-server 64.84% 🟢
lineage 53.02% 🟢
optimizer 83.24% 🟢
optimizer-api 21.95% 🔴
server 85.96% 🟢
server-common 74.62% 🟢
spark 28.57% 🔴
spark-common 46.01% 🟢
trino-connector 40.29% 🟢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants