Skip to content

Compact categorical cut layout to observed codes, harden distributed CatContainer#12182

Open
fingoldo wants to merge 1 commit intodmlc:masterfrom
fingoldo:fix-categorical-sparse-codes
Open

Compact categorical cut layout to observed codes, harden distributed CatContainer#12182
fingoldo wants to merge 1 commit intodmlc:masterfrom
fingoldo:fix-categorical-sparse-codes

Conversation

@fingoldo
Copy link
Copy Markdown

@fingoldo fingoldo commented Apr 25, 2026

Closes #12177

Summary

A polars Categorical built against a primed StringCache holds a few unique strings at very large physical codes (e.g. observed code 2_526_058 with only 51 unique values, per the original reproducer). AddCategories / MakeCuts emitted a dense [0..max_observed_code] cut layout per categorical feature, blowing cut_values to O(max_code) and triggering a silent process kill on Windows (STATUS_ACCESS_VIOLATION, exit code 3221226505 / 0xC0000005) when the saved-tree bitfield was sized off cuts.MaxCategory()+1.

This PR stores one cut per observed code and propagates per-feature compact sizing through MaxNumBinPerFeat() and the saved-tree path.

What changed

  • src/common/quantile.{cc,cu}: AddCategories writes one cut per observed code; column-wise empty-worker emits a placeholder cut so downstream sizing never sees the -1.f sentinel.
  • src/data/gradient_index.h: MaxNumBinPerFeat() returns the widest per-feature bin range scanned over cut.Ptrs() (correct upper bound under compact non-dense layouts).
  • src/tree/{hist/evaluate_splits.h,updater_gpu_hist.cu}: saved-tree cat_bits sized via the new common::SizeCatBitsForMaxCode using the per-feature observed max code.
  • src/data/cat_container_hash.{h,cuh}: dual independent 64-bit content digest (FNV-1a + secondary mixing). Lets distributed runs detect divergent ref dictionaries before training. Device sibling folds bytes on-device.
  • src/data/quantile_dmatrix.{cc,cu}: digest gate via AllreduceRefAgreement + AllreduceDigestAndCheck; CPU and GPU paths share allreduce shape for this gate.
  • src/data/cat_container.{h,cc,cu}: Sort()/Copy()/Save()/Load() are mutex-guarded and idempotent via a persisted sorted_ flag; back-compat defaults preserved.
  • src/data/adapter.h, device_adapter.cuh: CachedRefMapping aliases the reference DMatrix's CatContainer instead of allocating a fresh device copy on every dispatch.

Behavior change

MaxNumBinPerFeat() now scans cut.Ptrs() per-feature instead of cut.MaxCategory()+1. Equal to the old value in dense-codes layouts; strictly smaller (correct) under compact layouts produced by this PR.

Performance

Sweep primer_size in {1k, 50k, 200k, 500k}, n_real_cats=16, 2048 rows, 8 boost rounds, tree_method=hist. Same xgboost.dll build for both:

primer CPU train master -> ours GPU train master -> ours
50k 0.52s -> 0.04s (12.9x) 0.76s -> 0.09s (8.5x)
500k 5.42s -> 0.28s (19.5x) 6.78s -> 0.19s (36x)

Peak RSS at primer=500k drops ~33% during training and ~39% during construct on both devices (CPU train: 115 -> 76 MB; GPU train: 116 -> 79 MB; CPU construct: 100 -> 60 MB; GPU construct: 99 -> 61 MB). inplace_predict() output is byte-identical between master and this PR for every primer size on both CPU and CUDA.

In the dense-codes regime (max_observed_code == n_unique_observed - 1, typical sklearn LabelEncoder or pd.factorize() output) the new path produces the same cut_values as master; equivalent in performance and bit-for-bit results.

Testing

  • testxgboost --gtest_filter='Quantile.AddCategories*:CatContainer.*:CatContainerHash.*': 20/20 pass.
  • pytest tests/python/test_with_polars.py: 15/16 pass. The single failure is test_polars_basic[QuantileDMatrix] on the boolean-column path, reproducible on clean origin/master (e0d3dfd5) with byte-identical error - pre-existing, not introduced here.
  • pytest tests/python-gpu/test_gpu_with_polars.py: 4/4 pass on my old GTX 1050 Ti.
  • pytest tests/test_distributed/test_with_dask/test_with_dask.py::test_categorical_ref_quantile_dmatrix: pass on dask LocalCluster.
  • tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py::test_categorical_ref_quantile_dmatrix: GPU mirror authored to match the CPU version's structure but not exercised locally - the suite imports cudf at module top and RAPIDS has no Windows wheel.

Pre-existing concerns surfaced (not addressed here)

  • src/tree/gpu_hist/evaluator.cu: node_categorical_storage_size_ still sized by global MaxCategory()+1. Per-feature evaluator buffers would close the residual runtime memory hit.
  • cat_container.cu CopyTo and quantile.cu CopyTo/PruneImpl/FixError use the legacy default stream while the surrounding pipeline runs on ctx->CUDACtx()->Stream().
  • cpu_impl::MakeSketches calls SyncFeatureType (an extra Allreduce) with no GPU sibling - pre-existing CPU/GPU pipeline-shape gap; this PR's digest gate itself is shape-symmetric across CPU and GPU branches.
  • proxy_dmatrix.cuh DispatchAny's "dummy return value" branch via std::any_cast<CudfAdapter> throws std::bad_any_cast on the type_error sentinel path; latent (today no callers opt in).
  • bst_cat_t n_total_cats_: read unlocked from HasCategorical() while Copy() writes it under sort_mu_. Aligned-int32 reads make this benign on x86/ARM but TSAN-flaggable.
  • MaxNumBinPerFeat() is now O(n_features) per call; gradient_index.cc::ResizeIndex already caches per call. A class-level cache would amortize across all callers.

A polars Categorical built against a primed StringCache holds a few
unique strings at very large physical codes. AddCategories / MakeCuts
emitted a dense [0..max_observed_code] cut layout per categorical
feature, blowing cut_values to O(max_code) and triggering
STATUS_ACCESS_VIOLATION on Windows when the saved-tree bitfield was
sized off cuts.MaxCategory()+1.

Store one cut per observed code; size MaxNumBinPerFeat() and saved-tree
cat_bits via per-feature observed range. Add a content-hash check
(AllreduceDigestAndCheck) so distributed runs surface divergent ref
dictionaries before training. Mutex-guard CatContainer Sort/Copy and
round-trip is_ref/sorted across Save/Load.

Bench (primer=500k, n_real_cats=16, 8 boost rounds, identical build):
CPU train 5.42s -> 0.28s (19.5x), GPU train 6.78s -> 0.19s (36x), peak
RSS ~33% lower during training and ~39% during construct on both
devices. Predictions byte-identical to master across both regimes.
Equivalent to master in dense-codes regime (typical sklearn
LabelEncoder or pd.factorize() output).

Closes dmlc#12177
@fingoldo fingoldo force-pushed the fix-categorical-sparse-codes branch from 61267ca to 9928b41 Compare April 26, 2026 04:30
@trivialfis
Copy link
Copy Markdown
Member

Can we just enforce a dense encoding? Making changes across the code base is quite fragile.

@fingoldo
Copy link
Copy Markdown
Author

Can we just enforce a dense encoding? Making changes across the code base is quite fragile.

I did not think about that, it might be an even better way. I got too excited with my current approach, it seems. I will have more time to implement and benchmark your suggestion by the end of this week. Can you approve running the rest of CI workflows in the meantime (there was one test I forgot to run locally that uncovered a bug, should be fixed by now)?

@trivialfis
Copy link
Copy Markdown
Member

Thank you for the reply, we will go through the CI when we have a solution to test (it's approval per commit), I don't think we will merge the PR in its current design.

@fingoldo
Copy link
Copy Markdown
Author

Thank you for the reply, we will go through the CI when we have a solution to test (it's approval per commit), I don't think we will merge the PR in its current design.

  1. To make sure I implement the right thing - the approach you have in mind is to auto-compact categorical codes at C++ ingestion (e.g. columnar.cc::GetArrowDictionary / CudfAdapter ctor) so xgboost transparently handles sparse input while keeping the "dense codes inside" invariant. Did I read that right?

  2. A few pieces of the current diff are independent of the sparse-codes design and don't need to be discarded with it. Are you OK if I spin them off as small focused follow-up PRs once the main one is rewritten? I think they can be valuable.

cat_container_hash + AllreduceDigestAndCheck - cross-worker check that ref dictionaries agree (surfaces divergent refs at construction instead of silent mistraining)

ValidateCatStrArrayOffsets - host-side Arrow offset validation in the categorical adapter (front==0, monotonic, in-range)

CatContainer Sort/Copy/Save/Load mutex-guarding for thread-safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XGBoost silent process kill with sparse polars Categorical codes

2 participants