Compact categorical cut layout to observed codes, harden distributed CatContainer#12182
Compact categorical cut layout to observed codes, harden distributed CatContainer#12182fingoldo wants to merge 1 commit intodmlc:masterfrom
Conversation
A polars Categorical built against a primed StringCache holds a few unique strings at very large physical codes. AddCategories / MakeCuts emitted a dense [0..max_observed_code] cut layout per categorical feature, blowing cut_values to O(max_code) and triggering STATUS_ACCESS_VIOLATION on Windows when the saved-tree bitfield was sized off cuts.MaxCategory()+1. Store one cut per observed code; size MaxNumBinPerFeat() and saved-tree cat_bits via per-feature observed range. Add a content-hash check (AllreduceDigestAndCheck) so distributed runs surface divergent ref dictionaries before training. Mutex-guard CatContainer Sort/Copy and round-trip is_ref/sorted across Save/Load. Bench (primer=500k, n_real_cats=16, 8 boost rounds, identical build): CPU train 5.42s -> 0.28s (19.5x), GPU train 6.78s -> 0.19s (36x), peak RSS ~33% lower during training and ~39% during construct on both devices. Predictions byte-identical to master across both regimes. Equivalent to master in dense-codes regime (typical sklearn LabelEncoder or pd.factorize() output). Closes dmlc#12177
61267ca to
9928b41
Compare
|
Can we just enforce a dense encoding? Making changes across the code base is quite fragile. |
I did not think about that, it might be an even better way. I got too excited with my current approach, it seems. I will have more time to implement and benchmark your suggestion by the end of this week. Can you approve running the rest of CI workflows in the meantime (there was one test I forgot to run locally that uncovered a bug, should be fixed by now)? |
|
Thank you for the reply, we will go through the CI when we have a solution to test (it's approval per commit), I don't think we will merge the PR in its current design. |
cat_container_hash + AllreduceDigestAndCheck - cross-worker check that ref dictionaries agree (surfaces divergent refs at construction instead of silent mistraining) ValidateCatStrArrayOffsets - host-side Arrow offset validation in the categorical adapter (front==0, monotonic, in-range) CatContainer Sort/Copy/Save/Load mutex-guarding for thread-safety. |
Closes #12177
Summary
A polars Categorical built against a primed StringCache holds a few unique strings at very large physical codes (e.g. observed code 2_526_058 with only 51 unique values, per the original reproducer).
AddCategories/MakeCutsemitted a dense[0..max_observed_code]cut layout per categorical feature, blowingcut_valuestoO(max_code)and triggering a silent process kill on Windows (STATUS_ACCESS_VIOLATION, exit code3221226505/0xC0000005) when the saved-tree bitfield was sized offcuts.MaxCategory()+1.This PR stores one cut per observed code and propagates per-feature compact sizing through
MaxNumBinPerFeat()and the saved-tree path.What changed
src/common/quantile.{cc,cu}:AddCategorieswrites one cut per observed code; column-wise empty-worker emits a placeholder cut so downstream sizing never sees the-1.fsentinel.src/data/gradient_index.h:MaxNumBinPerFeat()returns the widest per-feature bin range scanned overcut.Ptrs()(correct upper bound under compact non-dense layouts).src/tree/{hist/evaluate_splits.h,updater_gpu_hist.cu}: saved-treecat_bitssized via the newcommon::SizeCatBitsForMaxCodeusing the per-feature observed max code.src/data/cat_container_hash.{h,cuh}: dual independent 64-bit content digest (FNV-1a + secondary mixing). Lets distributed runs detect divergent ref dictionaries before training. Device sibling folds bytes on-device.src/data/quantile_dmatrix.{cc,cu}: digest gate viaAllreduceRefAgreement+AllreduceDigestAndCheck; CPU and GPU paths share allreduce shape for this gate.src/data/cat_container.{h,cc,cu}:Sort()/Copy()/Save()/Load()are mutex-guarded and idempotent via a persistedsorted_flag; back-compat defaults preserved.src/data/adapter.h,device_adapter.cuh:CachedRefMappingaliases the reference DMatrix'sCatContainerinstead of allocating a fresh device copy on every dispatch.Behavior change
MaxNumBinPerFeat()now scanscut.Ptrs()per-feature instead ofcut.MaxCategory()+1. Equal to the old value in dense-codes layouts; strictly smaller (correct) under compact layouts produced by this PR.Performance
Sweep
primer_size in {1k, 50k, 200k, 500k},n_real_cats=16, 2048 rows, 8 boost rounds,tree_method=hist. Samexgboost.dllbuild for both:Peak RSS at primer=500k drops ~33% during training and ~39% during construct on both devices (CPU train: 115 -> 76 MB; GPU train: 116 -> 79 MB; CPU construct: 100 -> 60 MB; GPU construct: 99 -> 61 MB).
inplace_predict()output is byte-identical between master and this PR for every primer size on both CPU and CUDA.In the dense-codes regime (
max_observed_code == n_unique_observed - 1, typical sklearnLabelEncoderorpd.factorize()output) the new path produces the samecut_valuesas master; equivalent in performance and bit-for-bit results.Testing
testxgboost --gtest_filter='Quantile.AddCategories*:CatContainer.*:CatContainerHash.*': 20/20 pass.pytest tests/python/test_with_polars.py: 15/16 pass. The single failure istest_polars_basic[QuantileDMatrix]on the boolean-column path, reproducible on cleanorigin/master(e0d3dfd5) with byte-identical error - pre-existing, not introduced here.pytest tests/python-gpu/test_gpu_with_polars.py: 4/4 pass on my old GTX 1050 Ti.pytest tests/test_distributed/test_with_dask/test_with_dask.py::test_categorical_ref_quantile_dmatrix: pass on daskLocalCluster.tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py::test_categorical_ref_quantile_dmatrix: GPU mirror authored to match the CPU version's structure but not exercised locally - the suite importscudfat module top and RAPIDS has no Windows wheel.Pre-existing concerns surfaced (not addressed here)
src/tree/gpu_hist/evaluator.cu:node_categorical_storage_size_still sized by globalMaxCategory()+1. Per-feature evaluator buffers would close the residual runtime memory hit.cat_container.cuCopyToandquantile.cuCopyTo/PruneImpl/FixErroruse the legacy default stream while the surrounding pipeline runs onctx->CUDACtx()->Stream().cpu_impl::MakeSketchescallsSyncFeatureType(an extraAllreduce) with no GPU sibling - pre-existing CPU/GPU pipeline-shape gap; this PR's digest gate itself is shape-symmetric across CPU and GPU branches.proxy_dmatrix.cuhDispatchAny's "dummy return value" branch viastd::any_cast<CudfAdapter>throwsstd::bad_any_caston thetype_errorsentinel path; latent (today no callers opt in).bst_cat_t n_total_cats_: read unlocked fromHasCategorical()whileCopy()writes it undersort_mu_. Aligned-int32 reads make this benign on x86/ARM but TSAN-flaggable.MaxNumBinPerFeat()is nowO(n_features)per call;gradient_index.cc::ResizeIndexalready caches per call. A class-level cache would amortize across all callers.