Compact categorical cut layout to observed codes, harden distributed CatContainer by fingoldo · Pull Request #12182 · dmlc/xgboost

fingoldo · 2026-04-25T23:06:36Z

Summary

A polars Categorical built against a primed StringCache holds a few unique strings at very large physical codes (e.g. observed code 2_526_058 with only 51 unique values, per the original reproducer). AddCategories / MakeCuts emitted a dense [0..max_observed_code] cut layout per categorical feature, blowing cut_values to O(max_code) and triggering a silent process kill on Windows (STATUS_ACCESS_VIOLATION, exit code 3221226505 / 0xC0000005) when the saved-tree bitfield was sized off cuts.MaxCategory()+1.

This PR stores one cut per observed code and propagates per-feature compact sizing through MaxNumBinPerFeat() and the saved-tree path.

What changed

src/common/quantile.{cc,cu}: AddCategories writes one cut per observed code; column-wise empty-worker emits a placeholder cut so downstream sizing never sees the -1.f sentinel.
src/data/gradient_index.h: MaxNumBinPerFeat() returns the widest per-feature bin range scanned over cut.Ptrs() (correct upper bound under compact non-dense layouts).
src/tree/{hist/evaluate_splits.h,updater_gpu_hist.cu}: saved-tree cat_bits sized via the new common::SizeCatBitsForMaxCode using the per-feature observed max code.
src/data/cat_container_hash.{h,cuh}: dual independent 64-bit content digest (FNV-1a + secondary mixing). Lets distributed runs detect divergent ref dictionaries before training. Device sibling folds bytes on-device.
src/data/quantile_dmatrix.{cc,cu}: digest gate via AllreduceRefAgreement + AllreduceDigestAndCheck; CPU and GPU paths share allreduce shape for this gate.
src/data/cat_container.{h,cc,cu}: Sort()/Copy()/Save()/Load() are mutex-guarded and idempotent via a persisted sorted_ flag; back-compat defaults preserved.
src/data/adapter.h, device_adapter.cuh: CachedRefMapping aliases the reference DMatrix's CatContainer instead of allocating a fresh device copy on every dispatch.

Behavior change

MaxNumBinPerFeat() now scans cut.Ptrs() per-feature instead of cut.MaxCategory()+1. Equal to the old value in dense-codes layouts; strictly smaller (correct) under compact layouts produced by this PR.

Performance

Sweep primer_size in {1k, 50k, 200k, 500k}, n_real_cats=16, 2048 rows, 8 boost rounds, tree_method=hist. Same xgboost.dll build for both:

primer	CPU train master -> ours	GPU train master -> ours
50k	0.52s -> 0.04s (12.9x)	0.76s -> 0.09s (8.5x)
500k	5.42s -> 0.28s (19.5x)	6.78s -> 0.19s (36x)

Peak RSS at primer=500k drops ~33% during training and ~39% during construct on both devices (CPU train: 115 -> 76 MB; GPU train: 116 -> 79 MB; CPU construct: 100 -> 60 MB; GPU construct: 99 -> 61 MB). inplace_predict() output is byte-identical between master and this PR for every primer size on both CPU and CUDA.

In the dense-codes regime (max_observed_code == n_unique_observed - 1, typical sklearn LabelEncoder or pd.factorize() output) the new path produces the same cut_values as master; equivalent in performance and bit-for-bit results.

Testing

testxgboost --gtest_filter='Quantile.AddCategories*:CatContainer.*:CatContainerHash.*': 20/20 pass.
pytest tests/python/test_with_polars.py: 15/16 pass. The single failure is test_polars_basic[QuantileDMatrix] on the boolean-column path, reproducible on clean origin/master (e0d3dfd5) with byte-identical error - pre-existing, not introduced here.
pytest tests/python-gpu/test_gpu_with_polars.py: 4/4 pass on my old GTX 1050 Ti.
pytest tests/test_distributed/test_with_dask/test_with_dask.py::test_categorical_ref_quantile_dmatrix: pass on dask LocalCluster.
tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py::test_categorical_ref_quantile_dmatrix: GPU mirror authored to match the CPU version's structure but not exercised locally - the suite imports cudf at module top and RAPIDS has no Windows wheel.

Pre-existing concerns surfaced (not addressed here)

src/tree/gpu_hist/evaluator.cu: node_categorical_storage_size_ still sized by global MaxCategory()+1. Per-feature evaluator buffers would close the residual runtime memory hit.
cat_container.cu CopyTo and quantile.cu CopyTo/PruneImpl/FixError use the legacy default stream while the surrounding pipeline runs on ctx->CUDACtx()->Stream().
cpu_impl::MakeSketches calls SyncFeatureType (an extra Allreduce) with no GPU sibling - pre-existing CPU/GPU pipeline-shape gap; this PR's digest gate itself is shape-symmetric across CPU and GPU branches.
proxy_dmatrix.cuh DispatchAny's "dummy return value" branch via std::any_cast<CudfAdapter> throws std::bad_any_cast on the type_error sentinel path; latent (today no callers opt in).
bst_cat_t n_total_cats_: read unlocked from HasCategorical() while Copy() writes it under sort_mu_. Aligned-int32 reads make this benign on x86/ARM but TSAN-flaggable.
MaxNumBinPerFeat() is now O(n_features) per call; gradient_index.cc::ResizeIndex already caches per call. A class-level cache would amortize across all callers.

A polars Categorical built against a primed StringCache holds a few unique strings at very large physical codes. AddCategories / MakeCuts emitted a dense [0..max_observed_code] cut layout per categorical feature, blowing cut_values to O(max_code) and triggering STATUS_ACCESS_VIOLATION on Windows when the saved-tree bitfield was sized off cuts.MaxCategory()+1. Store one cut per observed code; size MaxNumBinPerFeat() and saved-tree cat_bits via per-feature observed range. Add a content-hash check (AllreduceDigestAndCheck) so distributed runs surface divergent ref dictionaries before training. Mutex-guard CatContainer Sort/Copy and round-trip is_ref/sorted across Save/Load. Bench (primer=500k, n_real_cats=16, 8 boost rounds, identical build): CPU train 5.42s -> 0.28s (19.5x), GPU train 6.78s -> 0.19s (36x), peak RSS ~33% lower during training and ~39% during construct on both devices. Predictions byte-identical to master across both regimes. Equivalent to master in dense-codes regime (typical sklearn LabelEncoder or pd.factorize() output). Closes dmlc#12177

trivialfis · 2026-04-27T10:20:52Z

Can we just enforce a dense encoding? Making changes across the code base is quite fragile.

fingoldo · 2026-04-27T14:23:07Z

Can we just enforce a dense encoding? Making changes across the code base is quite fragile.

I did not think about that, it might be an even better way. I got too excited with my current approach, it seems. I will have more time to implement and benchmark your suggestion by the end of this week. Can you approve running the rest of CI workflows in the meantime (there was one test I forgot to run locally that uncovered a bug, should be fixed by now)?

trivialfis · 2026-04-27T14:48:07Z

Thank you for the reply, we will go through the CI when we have a solution to test (it's approval per commit), I don't think we will merge the PR in its current design.

fingoldo · 2026-04-30T15:25:13Z

Thank you for the reply, we will go through the CI when we have a solution to test (it's approval per commit), I don't think we will merge the PR in its current design.

To make sure I implement the right thing - the approach you have in mind is to auto-compact categorical codes at C++ ingestion (e.g. columnar.cc::GetArrowDictionary / CudfAdapter ctor) so xgboost transparently handles sparse input while keeping the "dense codes inside" invariant. Did I read that right?
A few pieces of the current diff are independent of the sparse-codes design and don't need to be discarded with it. Are you OK if I spin them off as small focused follow-up PRs once the main one is rewritten? I think they can be valuable.

cat_container_hash + AllreduceDigestAndCheck - cross-worker check that ref dictionaries agree (surfaces divergent refs at construction instead of silent mistraining)

ValidateCatStrArrayOffsets - host-side Arrow offset validation in the categorical adapter (front==0, monotonic, in-range)

CatContainer Sort/Copy/Save/Load mutex-guarding for thread-safety.

fingoldo force-pushed the fix-categorical-sparse-codes branch from 61267ca to 9928b41 Compare April 26, 2026 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compact categorical cut layout to observed codes, harden distributed CatContainer#12182

Compact categorical cut layout to observed codes, harden distributed CatContainer#12182
fingoldo wants to merge 1 commit intodmlc:masterfrom
fingoldo:fix-categorical-sparse-codes

fingoldo commented Apr 25, 2026 •

edited

Loading

Uh oh!

trivialfis commented Apr 27, 2026

Uh oh!

fingoldo commented Apr 27, 2026

Uh oh!

trivialfis commented Apr 27, 2026

Uh oh!

fingoldo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fingoldo commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Behavior change

Performance

Testing

Pre-existing concerns surfaced (not addressed here)

Uh oh!

trivialfis commented Apr 27, 2026

Uh oh!

fingoldo commented Apr 27, 2026

Uh oh!

trivialfis commented Apr 27, 2026

Uh oh!

fingoldo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fingoldo commented Apr 25, 2026 •

edited

Loading