Skip to content

fix: pandas 3.0 compatibility and ADOPT optimizer health-check (#374)#403

Merged
Demirrr merged 3 commits intodevelopfrom
feature/374-pandas3-compat
Apr 29, 2026
Merged

fix: pandas 3.0 compatibility and ADOPT optimizer health-check (#374)#403
Demirrr merged 3 commits intodevelopfrom
feature/374-pandas3-compat

Conversation

@Demirrr
Copy link
Copy Markdown
Member

@Demirrr Demirrr commented Apr 29, 2026

  • Fix entity/relation CSV loading: use per-column dtype dict instead of dtype=str to prevent pandas>=3.0 from coercing the integer row-index to strings (dicee/static_funcs.py)
  • Lift pandas version cap from <=2.3.3 to >=2.1.0 in setup.py
  • Fix ADOPT optimizer: call _cuda_graph_capture_health_check() with hasattr fallback for cross-version PyTorch compatibility
  • Add regression tests (tests/test_regression_pandas3.py):
    • unit tests for integer-keyed entity/relation vocab loading
    • documents the root cause of dtype=str index coercion
    • end-to-end training + KGE load cycle asserting integer indices

Demirrr added 3 commits April 29, 2026 08:23
- Fix entity/relation CSV loading: use per-column dtype dict instead of
  dtype=str to prevent pandas>=3.0 from coercing the integer row-index
  to strings (dicee/static_funcs.py)
- Lift pandas version cap from <=2.3.3 to >=2.1.0 in setup.py
- Fix ADOPT optimizer: call _cuda_graph_capture_health_check() with
  hasattr fallback for cross-version PyTorch compatibility
- Add regression tests (tests/test_regression_pandas3.py):
  - unit tests for integer-keyed entity/relation vocab loading
  - documents the root cause of dtype=str index coercion
  - end-to-end training + KGE load cycle asserting integer indices
pandas 3.0.2 uses StringDtype(na_value=nan) instead of object when
dtype=str is passed to read_csv. The test now checks that the index
is not an integer dtype rather than asserting a specific string dtype.
When a prior test causes a CUDA kernel crash, torch.cuda.is_available()
continues to return True but the runtime is corrupt. Lightning's
isolate_rng -> _collect_rng_states calls torch.cuda.get_rng_state_all()
whenever is_available() is True, even with accelerator='cpu', which
re-triggers the broken init and raises RuntimeError before training starts.

Changes in dicee/trainer/dice_trainer.py:
- Add _cuda_is_usable(): probes actual CUDA init via get_device_name(0)
  to detect a broken runtime without relying on is_available()
- Add _disable_cuda_in_process(): patches torch.cuda.is_available to
  return False so Lightning's RNG isolation skips CUDA state collection
- PL Trainer init: call _cuda_is_usable() before constructing the
  Trainer; if CUDA is broken, call _disable_cuda_in_process() and force
  accelerator='cpu'
- Guard get_device_name() diagnostic print in DICE_Trainer.__init__
  with try/except to prevent crash on startup

Fixes test_swa.py::TestSWA::test_k_vs_all_ema (torch.AcceleratorError:
CUDA error: unspecified launch failure)
@Demirrr Demirrr merged commit ac64984 into develop Apr 29, 2026
3 checks passed
@Demirrr Demirrr deleted the feature/374-pandas3-compat branch April 29, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant