diff --git a/docs/src/partitioning-spec.md b/docs/src/partitioning-spec.md index 9f8471fe3..9cedfb8c8 100644 --- a/docs/src/partitioning-spec.md +++ b/docs/src/partitioning-spec.md @@ -211,16 +211,14 @@ The column name is `partition_field_{i}` where `{i}` is the partition field's `f This naming convention avoids potential conflicts with user-defined column names. When a new partition spec version is defined, the `__manifest` table schema is updated accordingly to include any new partition columns. -| Column | Type | Description | -|------------------------------|----------|-----------------------------------------------------------------------------| -| `object_id` | `string` | Full namespace path with `$` separator (existing) | -| `object_type` | `string` | `"namespace"` or `"table"` (existing) | -| `metadata` | `string` | JSON-encoded metadata/properties (existing) | -| `read_version` | `uint64` | Table version for reads (optional, see [Transaction](#transaction)) | -| `read_branch` | `string` | Table branch for reads (optional, see [Transaction](#transaction)) | -| `read_tag` | `string` | Table tag for reads (optional, see [Transaction](#transaction)) | -| `partition_field_{field_id}` | `` | Partition value for the field (nullable, inherited from parent namespaces) | -| ... | ... | Additional partition field columns as needed | +| Column | Type | Description | +|------------------------------|----------|----------------------------------------------------------------------------| +| `object_id` | `string` | Full namespace path with `$` separator (existing) | +| `object_type` | `string` | `"namespace"` or `"table"` (existing) | +| `metadata` | `string` | JSON-encoded metadata/properties (existing) | +| `read_version` | `uint64` | Table version for reads (optional, see [Transaction](#transaction)) | +| `partition_field_{field_id}` | `` | Partition value for the field (nullable, inherited from parent namespaces) | +| ... | ... | Additional partition field columns as needed | Partition values are inherited from parent namespaces - each row has all partition values from its ancestors. See [Appendix C: Manifest Table Example](#appendix-c-manifest-table-example) for a complete example. @@ -297,43 +295,67 @@ This design ensures backward compatibility while enabling partition strategy evo Operations within a single partition table are ACID-compliant according to the Lance table specification. Each partition is an independent Lance table, so reads and writes to a single partition follow standard Lance transaction semantics. -### Multi-Partition Transaction +### Multi-Partition Transaction (Weak) By default, operations across multiple partitions have weaker guarantees: - **Writes across partitions are not atomic or consistent**: A write that affects multiple partitions may partially succeed, leaving some partitions updated while others are not. - **Reads across partitions are not isolated**: A read spanning multiple partitions may observe different versions of each partition, leading to inconsistent views. -To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version`, `read_branch`, and `read_tag` columns for a table. -These columns record which version of each partition table to read. +In multi-partition transaction weak mode, write operations are directly committed to the main branch of each table. It means users can always see the fresh state of each leaf +partition table without additional information from partitioned namespace. -#### Read Behavior +Users need to handle writes across partitions carefully because there is no ACID guarantees. One way is to use idempotent write like `insert overwrite` then retry +for whatever error. Another way is writing partitions one by one. + +**Summary** +* Read Behavior: Readers should always read the latest version from the main branch. + +* Write Behavior: Writers should always commit to the main branch. + +* Conflict Resolution: No conflict resolution for cross-partition writes. + +### Multi-Partition Transaction (Strong) + +To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version` column for a table. +The `read_version` records which version of each partition table to read. -Users should specify one of the following combinations: +In strong mode, write operations will use detached commit to each table. A detached commit is invisible unless +the version is set, it makes sure the intermediate state of a transaction remains invisible. Users need to first get `read_version` from partitioned +namespace, then read the leaf partition table using the current version. -1. **`read_version` only**: Read the specified version from the main branch. -2. **`read_branch` + `read_version`**: Read the specified version from the specified branch. -3. **`read_tag` only**: Read the version referenced by the specified tag. +The `__manifest` table records all the versions of each partitioned table. A version of a partitioned table is recorded +as an entry, with `object_id` in format `table_id${version}` and `object_type` value `table_version`. -When all columns are NULL or not present, readers should read the latest version from the main branch. +Partitioned namespace uses `merge_insert` to commit a transaction, which makes sure the commit is atomic. + +#### Read Behavior + +1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0). +2. Search `__manifest` table to collect current `read_version`s for the partition tables to read. +3. Read the current version from the partition table. #### Commit Behavior Multi-partition transactions are guarded by commits against the `__manifest` table. A typical multi-partition write follows this pattern: -1. Write data to each affected partition table independently -2. Atomically update the `read_version` (and optionally `read_branch` or `read_tag`) of all affected partitions in a single `__manifest` commit +1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0). +2. Write data to each affected partition table independently +3. Get current version of `__manifest` table as snapshot id(S1), detect/resolve conflicts if `S1` is not `S0`. +4. Atomically update the `read_version` from `S1` to `S2` in a single `__manifest` commit. This should be done by `merge_insert`. This ensures all-or-nothing visibility of changes across partitions. #### Conflict Resolution -If concurrent commits have been committed to `__manifest` since the transaction began, the implementation must either: - -1. Rebase the current commit onto the latest `__manifest` version and retry the commit, or -2. Fail the current commit and return an error to the caller - -Implementations are responsible for ensuring the appropriate conflict detection and resolution strategy to guarantee ACID semantics during multi-partition transactions. +1. Based on `read_version` timeline to fetch the changes between S0 and S1. +2. For each updated table in S2: + * collect the transactions between S0 to S1; + * detect/resolve conflicts between the collected transactions and S2; +3. Commit the partitioned namespace if all conflicts are resolved. +4. Otherwise + * Rebase the current commit onto the latest `__manifest` version and retry the commit, or + * Fail the current commit and return an error to the caller. ## Appendices @@ -431,15 +453,18 @@ The namespaces (`v1`, `v1$k7m2n9p4q8r5s3t6`, etc.) are tracked in the `__manifes The `__manifest` table for a partitioned namespace with partition fields `event_date` (v1), `event_year` (v2) and `country` (v2), showing entries from both spec versions: -| object_id | object_type | metadata | read_version | read_branch | read_tag | partition_field_event_date | partition_field_event_year | partition_field_country | -|-----------------------------------------------|-------------|----------|--------------|-------------|----------|----------------------------|----------------------------|-------------------------| -| v1 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL | -| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | NULL | NULL | 2025-12-10 | NULL | NULL | -| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | 5 | NULL | NULL | 2025-12-10 | NULL | NULL | -| v2 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL | -| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | NULL | -| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | US | -| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | 3 | NULL | NULL | NULL | 2025 | US | +| object_id | object_type | metadata | read_version | partition_field_event_date | partition_field_event_year | partition_field_country | +|------------------------------------------------|---------------|----------|----------------------|----------------------------|----------------------------|-------------------------| +| v1 | namespace | {} | NULL | NULL | NULL | NULL | +| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | 2025-12-10 | NULL | NULL | +| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | NULL | 2025-12-10 | NULL | NULL | +| v1$k7m2n9p4q8r5s3t6$dataset$1 | table_version | {} | 1 | NULL | NULL | NULL | +| v1$k7m2n9p4q8r5s3t6$dataset$2 | table_version | {} | 11120734598765432152 | NULL | NULL | NULL | +| v2 | namespace | {} | NULL | NULL | NULL | NULL | +| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | 2025 | NULL | +| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | 2025 | US | +| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | NULL | NULL | 2025 | US | +| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset$1 | table_version | {} | 1 | NULL | NULL | NULL | Note: The root namespace properties (`partition_spec_v1`, `partition_spec_v2`, `schema`) are stored in the `__manifest` table's metadata, not as a row. The `object_id` uses `$` as the namespace path separator. Partition columns use the naming convention `partition_field_{field_id}` where `{field_id}` is the partition field's string identifier. Partition values are inherited from parent namespaces. When retrieving properties via API, partition values are converted to `partition. = ` entries. @@ -459,7 +484,7 @@ WHERE event_date = '2025-12-10' AND country = 'US' The engine translates this to the following `__manifest` DataFusion query plan to examine related partition tables. ```sql -SELECT object_id, location, read_version, read_branch, read_tag +SELECT object_id, location, read_version FROM __manifest WHERE object_type = 'table' AND ( @@ -480,14 +505,14 @@ One example way to perform such substitution is: This query returns: -| object_id | location | read_version | read_branch | read_tag | -|----------------------------------------------|-------------------------------------------------------|--------------|-------------|----------| -| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 5 | NULL | NULL | -| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 3 | NULL | NULL | +| object_id | location | read_version | +|----------------------------------------------|-------------------------------------------------------|----------------------| +| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 18446744073709551615 | +| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 16045690984833335022 | - For partition spec v1, the `country = 'US'` filter cannot be pushed to partition pruning (v1 has no `country` partition), so it must be applied during the table scan - For partition spec v2, both filters are pushed down: `partition_field_event_year = 2025` (computed from `year(event_date)`) and `partition_field_country = 'US'` -- The engine reads each table at the version specified by `read_version`, `read_branch`, or `read_tag` for consistent snapshot reads +- The engine reads each table at the version specified by `read_version` for consistent snapshot reads ### Appendix E: Runtime Namespace Properties Example