Skip to content

[Python] Support tree-model TsFiles in TsFileDataFrame#816

Open
Young-Leo wants to merge 5 commits into
apache:developfrom
Young-Leo:tsdf-tree
Open

[Python] Support tree-model TsFiles in TsFileDataFrame#816
Young-Leo wants to merge 5 commits into
apache:developfrom
Young-Leo:tsdf-tree

Conversation

@Young-Leo

Copy link
Copy Markdown
Contributor

Summary

This PR teaches the Python TsFileDataFrame to read tree-model TsFiles
(in addition to the existing table-model support) and, in the same series of
commits, slims the underlying dataset index so it scales to wide / sparse
schemas without paying for phantom (device, field) cells.

The user-facing dataset surface (__len__, list_timeseries, __getitem__,
.loc, aligned reads) is unchanged for both models — tree-model files
just become loadable through the same API.

What changed

Tree-model support

  • Detect the model kind at reader open. An empty table-schema map ⇒ tree
    model; otherwise table model.
  • For tree files, synthesize one virtual TableEntry:
    • table name = the shared root segment of every device path
    • tag columns = _col_1 .. _col_N (one positional column per remaining
      path segment, padded with None for shorter devices)
    • fields = union of measurements across all devices
  • Per-device measurement ownership is preserved by registering only the
    (device_id, field_idx) pairs that are actually written on disk in
    series_stats_by_ref, so the dataset never advertises phantom series.
  • Tree-model reads route through query_table_on_tree with client-side
    device filtering. This works around two cwrapper limitations on the
    current native build (query_tree_by_row rejects multi-segment device
    paths, and successive query_table_on_tree calls leak duplicate col_*
    columns); both are documented inline at the cwrapper boundary.
  • Tree-mode rendering: drop the leading table column, use _col_i
    headers, print None tag cells as "None", and surface the model
    marker in the repr header.
  • Mixing table-model and tree-model files in one load set is rejected
    with a clear error.

Dataset index slim-down

  • SeriesStats becomes a NamedTuple (~120 B vs. the previous ~360 B
    per-series dict).
  • _DerivedCache removed; lookups computed lazily on top of existing
    state.
  • Per-reader device_refs: List[List[DeviceRef]] collapsed into a
    pre-aggregated device_time_bounds: List[Tuple[Optional[int], Optional[int]]], so _query_aligned reads bounds in O(1).
  • Drop the redundant series_ref_set (use series_ref_map keys).
  • Phase 6: unify table/tree semantics so the table-model branch no
    longer pads series_stats_by_ref with empty placeholders for
    schema-declared but never-written cells. The dataset view is now
    strictly “real devices × real fields” in both models.
  • Cleanup: rename _LogicalIndex_DataFrameCatalog, shorten the five
    internal field names (devices / device_index / device_time_bounds
    / series / series_shards), inline the now-trivial
    iter_owned_series_refs wrapper.

New public API

  • TsFileDataFrame.model — read-only model marker ("table" or "tree").
  • TsFileDataFrame.list_timeseries_metadata() — per-series metadata as a
    flat tabular view (works identically for both models).

Compatibility

  • No changes to the existing dataset surface. Existing user code that
    loads table-model TsFiles continues to work without modification.
  • No changes to the on-disk format, the cwrapper, or the C++/Java sides.
  • SeriesStats integer fields tighten from Optional[int] to int. The
    surrounding get_series_info_by_ref still exposes them as the existing
    dict shape, so callers do not see an API change.

Memory impact

Two benches were used because the wins land in different shapes of
schema.

Bench A — 30 k devices × 1 field, full density

Step Tracked size Δ
baseline 81.20 MB
SeriesStats NamedTuple 70.67 MB −10.53 MB
_DerivedCache removal 59.82 MB −10.85 MB
device_time_bounds aggregation 56.40 MB −3.42 MB
series_ref_set removal 54.40 MB −2.00 MB
drop phantom (device, field) cells 54.40 MB 0

End-to-end: 81.20 → 54.40 MB (−33 %). Dropping phantom cells brings
nothing here because every device writes the single declared field;
there are no skipped cells to prune.

Bench B — 5 k devices × 5 fields × 60 % density (sparse / wide)

Component Before After
series_ref_map 15.93 MB 9.55 MB
series_stats_by_ref 7.53 MB 4.51 MB
tracked total 26.50 MB 16.30 MB

Dropping phantom cells alone brings −38 % on this fixture; the
sparser and wider the schema, the larger the win.

Testing

  • python -m pytest python/tests/test_tsfile_dataset.py → 41 / 41 pass.
  • Four new tree-model tests cover: metadata + repr layout, single-series
    read, multi-field aligned read, list_timeseries_metadata column
    shape, and the mixed-model load rejection.
  • One new sparse-schema test (test_dataset_omits_table_model_phantom_series_for_skipped_cells)
    proves Tablet-skipped cells stay out of list_timeseries, __len__,
    series_ref_map, and that tsdf[skipped_path] raises KeyError.

@JackieTien97 JackieTien97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: [Python] Support tree-model TsFiles in TsFileDataFrame

Overall this is a well-structured PR with solid test coverage. The tree-model integration, phantom-series elimination, and index slim-down are all clean and well-motivated. I have a few concerns — one potential correctness issue, one performance concern, and several nits. See inline comments.

Comment thread python/tsfile/dataset/dataframe.py Outdated
_OVERLAP_ROW_CHUNK_SIZE = 256

MODEL_TABLE = "table"
MODEL_TREE = "tree"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_TABLE and MODEL_TREE are defined identically in both reader.py:37-38 and dataframe.py:54-55. The dataframe module already imports from reader (from .reader import ...). Consider defining these constants once in reader.py (or metadata.py) and importing them in dataframe.py to avoid drift.

The test file imports MODEL_TABLE from reader, so both definitions are actively used — easy to get them out of sync in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

not timeline_statistic.has_statistic
or timeline_statistic.row_count <= 0
):
if not statistic.has_statistic or statistic.row_count <= 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter condition changed from checking timeline_statistic.has_statistic to checking statistic.has_statistic. This is a semantic shift: the old code filtered based on the timeline (timestamp axis), while the new code filters on the value statistic.

Are there cases where statistic.has_statistic is False but timeline_statistic.has_statistic is True, or vice versa? If the two can diverge, this changes which series are visible. The docstring says "iff its native statistic block is populated" but the old behavior was gating on timeline_statistic.

Also, line 388 now reads timeline_statistic unconditionally after the statistic check passes. If a measurement has statistic.has_statistic=True but timeline_statistic.has_statistic=False (or timeline_statistic is None), lines 393-395 would raise. Worth a defensive check or a note explaining why this can't happen.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional, not a regression. We gate on the value statistic (non-null row_count), not the timeline, so phantom cells are excluded.

  • Concrete case: In table bench(v1, v2, v3), device d1 writes only v1/v2 (skips v3). For (d1, v3), value row_count == 0 but timeline row_count == 1 (shares d1's timeline). A timeline gate would surface a phantom all-NaN bench.d1.v3; the value gate drops it.
  • Note: has_statistic is True for the phantom too, so row_count is the real discriminator. This is pinned by test_dataset_omits_table_model_phantom_series_for_skipped_cells (asserts len == 4 and bench.d1.v3KeyError).

On the unconditional timeline_statistic read: It can't raise. The native binding (timeseries_statistic_c_to_py) always returns a TimeseriesStatistic (never None) with row_count/start/end populated even when has_statistic is False. Skipped the defensive guard since that branch is unreachable, and added a comment explaining both points.

timestamps = []
values = []
skipped = 0
with self._reader.query_table_on_tree([field_name]) as result_set:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both _read_series_by_row_tree and _read_arrow_tree do a full query_table_on_tree scan over the entire file and then client-side filter by target_path_segments. For a file with many devices, this means reading every row to extract data for a single device.

This is acknowledged in the PR description as a cwrapper limitation workaround, which is fine for now — but worth flagging for future readers: this is O(total_rows) per device, so an aligned read across N devices in the same file becomes O(N × total_rows). The comments at the call-site document the "why" well; a brief # O(total_rows) — see PR #816 for cwrapper limitations here would also help future profilers find the hot path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, still O(total_rows) per device (aligned read over N devices = O(N × total_rows)) — and yes, intentional for this PR. Added PERF markers at both query_table_on_tree call sites so the hot path is greppable.

Root cause is a cwrapper limitation: per-device tree pushdown does exist (query_timeseries, query_tree_by_row), but query_table_on_tree leaks stale col_i path columns across queries on a reused reader, so we deliberately do a single full scan + client-side device filter (hence the all_col_indices[-expected_path_len:] trim).

Follow-up options (tracked separately, not in this PR):

  1. Switch to query_tree_by_row/query_timeseries once the cwrapper stale-state bug is fixed → O(device_rows).
  2. A client-side win that needs no cwrapper change — for aligned reads, scan once and bucket rows by device path → O(total_rows) for the whole aligned read instead of per device.

return list(self.iter_series_paths())

@property
def series_count(self) -> int:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetadataCatalog.series_count (metadata.py:148) still computes the cross-product of all (device, field) combos, but the reader's series_count property (line 111) now uses len(series_stats_by_ref) which counts only physically-present series.

For sparse schemas these two diverge. Tests at test_tsfile_dataset.py:688 and :789 still call reader.catalog.series_count and happen to pass because those fixtures are fully dense, but this is a latent inconsistency.

Consider either: (a) updating MetadataCatalog.series_count to also return len(self.series_stats_by_ref), or (b) deprecating it in favor of the reader property.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with (a): MetadataCatalog.series_count now returns len(self.series_stats_by_ref) (physically-present series, matching len(tsdf)/series_paths), the reader property delegates to it as the single source of truth, and I added a sparse-schema regression assertion so the dense-fixture blind spot is covered.

parent_shards = parent._index.series_shards
subset_shards = {ref: parent_shards[ref] for ref in subset_refs}
obj._index = _DataFrameCatalog(
model=parent._index.model,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from_subset creates a new _DataFrameCatalog but does not propagate tables_with_sparse_tag_values or sparse_device_indices_by_compressed_path. If a view (created by slicing or boolean indexing) later calls _resolve_series_name on a series whose device has sparse tags, the lookup will fail because the compressed-path index is empty.

This was also missing in the old _LogicalIndex code, so it's a pre-existing gap — but since this PR refactors _from_subset, it's worth noting. A one-line fix:

tables_with_sparse_tag_values=parent._index.tables_with_sparse_tag_values,
sparse_device_indices_by_compressed_path=parent._index.sparse_device_indices_by_compressed_path,

@Young-Leo Young-Leo Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gap was already closed by #848 ("fix tag null segsegv"), which removed the sparse_device_indices_by_compressed_path / tables_with_sparse_tag_values mechanism entirely and switched resolution to position-preserving SeriesPath keys with a single direct device_index lookup. Those two fields no longer exist on the catalog, so the suggested lines would AttributeError.

This PR only renamed _LogicalIndex_DataFrameCatalog and already propagates the full device_index (plus devices/table_entries) into _from_subset, so views resolve sparse-tag series correctly. Verified on an interior-null-tag device (sensors.\N.alpha.temperature): both a slice view and a boolean-mask view resolve and read it.

device_key, table_entry, _ = self._get_series_components(series_ref)
field_stats = self._cache.field_stats[series_ref]
# Aggregate per-shard timeline stats lazily on demand for this series.
field_stats = _build_field_stats(self._index.series_shards[series_ref])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_build_field_stats is now called lazily on every _build_series_info invocation instead of being precomputed once in _DerivedCache. This means:

  • list_timeseries_metadata() calls _build_series_info for every series, each of which calls _build_field_stats. For N series across K shards, this is O(N×K) reader calls.
  • __repr__ and __getitem__(column_name) also go through _build_series_info.

The old approach precomputed these stats once at load time. The PR description frames this as removing _DerivedCache, but the trade-off is that repeated access patterns (e.g., interactive exploration calling repr() then list_timeseries_metadata() then column access) now recompute stats from scratch each time.

For typical usage with moderate series counts this is fine. For wide schemas (the PR's own benchmark cites 5k devices × 5 fields), it could add up. Worth noting in a comment or considering a @functools.lru_cache on the hot path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a real hotspot: _build_field_stats reads the already-precomputed series_stats_by_ref in memory (no I/O). repr isn't O(N) either — it previews ~20 rows. The only O(N) paths are list_timeseries_metadata() / full-column access, where the pandas frame build dominates the stats anyway.

So I'll skip the cache (extra memory + invalidation/view-sharing complexity for little gain) and just add a comment noting the lazy trade-off — happy to memoize later if a wide-schema profile ever flags it.

]
# Only the trailing expected_path_len col_i cells are genuine; the
# leading duplicates are stale from prior queries on this reader.
col_indices = all_col_indices[-expected_path_len:]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "stale col_ columns from prior queries" workaround (all_col_indices[-expected_path_len:]) appears in both _read_series_by_row_tree (line 556) and _read_arrow_tree (line 747). This is a fragile assumption about cwrapper internal state.

If the cwrapper behavior changes (e.g., fixed to not leak duplicate columns, or leaks in a different pattern), this slicing logic will silently break. Consider at minimum an assertion:

assert len(col_indices) == expected_path_len, (
    f"Expected {expected_path_len} col_ columns, got {len(col_indices)} "
    f"(total {len(all_col_indices)})"
)

This would fail fast if the assumption is violated rather than silently returning wrong data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the assertion at both tree-read sites — fails fast if the cwrapper emits fewer col_ columns than expected (stale heuristic would otherwise silently read wrong-device data). Comment notes it guards the count only; a leak where genuine columns aren't trailing still needs value-level checking.

return f"TsFileDataFrame({total} time series, {len(self._paths)} files)\n"

return (
f"TsFileDataFrame({model_marker} model, {total} time series, "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: PEP 8 expects two blank lines between top-level methods in a class. _repr_header ends at line 1059 and __repr__ starts at line 1060 — missing a blank line separator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already has one blank line between _repr_header and __repr__ — nothing missing. PEP 8 wants one blank line between methods (two is for top-level defs), and Black/spotless enforces this in CI, so spacing is guaranteed. Likely a stale line number.

writer.write_row_record(
RowRecord(
"root.ln.wf01.wt01",
t,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite covers single tree-model files, but there's no test for loading multiple tree-model files into one TsFileDataFrame. The multi-file merge path (via _register_reader + device_time_bounds aggregation) has tricky edge cases for tree-model:

  • Two files with the same device but different measurement subsets — do we get the union of fields? Do time bounds merge correctly?
  • Two files with different max_depth — does the tag padding still work correctly?

These are the same merge paths that table-model tests cover, but tree-model's synthetic table construction introduces new failure modes around the union-fields and padding logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — implemented it. Tree shards now union instead of rejecting: _merge_tree_table_entries unions fields and grows tags to the deepest file (shallower devices pad with null), and each series is keyed by the global field index resolved by measurement name (shards keep reader-local indices), so measurements never mis-map. Added multi-file tests for identical-merge, field-union, and differing-depth. Table-model still rejects incompatible schemas.

continue
current_root = full_segments[0]
if root_name is None:
root_name = current_root

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation rejects tree files with multiple root segments (root_name != current_root). In practice, IoTDB tree-model files virtually always use root as the single root. However, the synthetic table uses the root segment as the table_name, so if a file ever had devices under multiple roots, the virtual table would be ambiguous.

The error message is clear and this is the right behavior — just noting that I'd expect this to never actually fire in practice. If it does, it signals a seriously malformed file rather than a normal use case.

@ColinLeeo ColinLeeo self-requested a review June 16, 2026 02:02
Squash of the tsdf-tree branch: tree-model support plus dataset catalog memory and indexing improvements.

- support tree-model TsFiles in TsFileDataFrame
- shrink the dataframe catalog memory footprint
- drop table-model phantom cells
- inline iter_owned_series_refs
- rename _LogicalIndex to _DataFrameCatalog
- dedup repeated series specifiers in tsdf.loc
Comment on lines +248 to +259
def _cache_metadata_tree_model(self):
"""Build the in-memory catalog by synthesizing one virtual table.

Tree TsFiles have no schema, so we materialize a single
``TableEntry``: table name = the shared root segment, tag columns
= ``_col_1..._col_{N_max}`` (one per remaining path segment),
fields = union of measurements across devices. Per-device
ownership is preserved by registering only the
``(device_id, field_idx)`` pairs actually written on disk in
``series_stats_by_ref``.
"""
metadata_groups = self._reader.get_timeseries_metadata(None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In table mode, non-numeric types are filtered out, but the tree path doesn't do this, so the two modes are inconsistent. We should keep the logic consistent — or revisit why table mode
filtered non-numeric types in the first place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed non-numeric types are supportable in principle — a DataFrame can hold heterogeneous columns. For now I kept it simple and consistent by filtering them in both models. Introducing non-numeric support uniformly (heterogeneous, pandas-style columns) makes sense as a follow-up; I'll track it separately.

- metadata: define MODEL_TABLE/MODEL_TREE once and import them in reader
  and dataframe instead of redefining the same literals in both modules,
  removing the drift risk between duplicate definitions.
- metadata: MetadataCatalog.series_count now returns the count of
  physically-present series (len(series_stats_by_ref)) instead of the
  device x declared-field cross-product, so it matches len(tsdf) and the
  reader property for sparse schemas; TsFileSeriesReader.series_count
  delegates to it as the single source of truth.
- reader: document why _metadata_field_stats gates on the value
  statistic's non-null row_count (not the timeline) and why the
  timeline_statistic reads below cannot raise.
- reader: mark the tree-model full-scan reads as O(total_rows) per device
  (cwrapper per-device pushdown limitation) so the hot path is easy to
  locate when profiling.
- reader: fail fast with an assertion if the cwrapper stale col_ leak
  pattern changes, instead of silently slicing the wrong path columns and
  returning wrong-device data.
- tests: add a sparse-schema regression assertion locking series_count to
  the physically-present count.
- Tree TsFiles synthesize their schema per file, so loading several into one
  TsFileDataFrame previously raised "Incompatible schema" whenever the files
  differed in measurement subset or device depth. Widen the dataframe-level
  synthetic table to the union instead: _merge_tree_table_entries unions the
  field columns (existing order first) and grows the tag columns to the
  deepest file's depth so shallower devices pad with nulls.
- Key each logical series by the global field index resolved by measurement
  name against the merged table, while each shard keeps its reader-local
  field index for the read path. This keeps per-(device, field) ownership
  exact and avoids mis-mapping measurements across files. For table model the
  global and local indices always coincide, so behavior is unchanged
  (incompatible table schemas are still rejected).
- tests: add multi-file tree coverage -- identical-structure merge (shards and
  time bounds), field union across files, and differing-depth union with tag
  padding.
The dataset surface is numeric (values are read as float64), and table mode
already drops non-numeric (STRING/TEXT) fields from its schema. Tree mode did
not, so a string/text measurement was surfaced as a series that then crashed on
read (float() on a string). Apply the numeric-type filter once in
_metadata_field_stats so both models behave consistently -- this also makes the
existing tree-model "measurements that pass the numeric filter" comment accurate.

- reader: skip measurements whose data_type is not numeric in
  _metadata_field_stats (shared by table and tree paths); table mode is
  unaffected since it already filters at the schema level.
- tests: add a tree-model regression test asserting a STRING measurement is
  dropped (not listed, raises KeyError) while the numeric one reads back.
Comment on lines 758 to 765

# PERF: O(total_rows) full tree scan filtered to one device
# client-side; aligned reads over N devices are O(N * total_rows).
# See _read_series_by_row_tree / PR #816 for the cwrapper limitation
# that blocks per-device pushdown (query_timeseries). Hot path.
with self._reader.query_table_on_tree(
field_columns, start_time, end_time
) as result_set:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment, add a todo and fixme pls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants