GitHub - mdefrance/AutoCarver: Automatic optimal discretization pipeline

AutoCarver automates supervised feature discretization (binning) to maximize statistical association with your target — using Tschuprow's T or Cramér's V — and validates the chosen bins against a held-out dev set. It supports binary classification, multiclass classification, and regression, and is widely used for credit scoring, fraud detection, and risk modeling.

Install

pip install autocarver

Quick Start

Binary classification on the Titanic dataset:

import pandas as pd
from sklearn.model_selection import train_test_split
from AutoCarver import BinaryCarver, Features

# 1. Load data
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"

# 2. Train / dev split, stratified on the target
train, dev = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

# 3. Declare features by type
features = Features(
    categoricals=["Sex"],
    quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
    ordinals={"Pclass": ["1", "2", "3"]},
)

# 4. Fit the carver (dev set drives the robustness checks)
carver = BinaryCarver(features=features, min_freq=0.05, max_n_mod=5)
train_processed = carver.fit_transform(train, train[target], X_dev=dev, y_dev=dev[target])
dev_processed = carver.transform(dev)

# 5. Inspect the carved buckets, target rate, and association
print(carver.summary)

# 6. Persist for later use
carver.save("titanic_carver.json")
# carver = BinaryCarver.load("titanic_carver.json")

For multiclass classification use MulticlassCarver; for regression use ContinuousCarver — the API is identical. To pre-select features by target association and inter-feature redundancy, pipe the carved output through ClassificationSelector or RegressionSelector.

Why AutoCarver?

Optimal supervised binning — maximizes Tschuprow's T (default) or Cramér's V between each feature and the target instead of relying on hand-tuned quantiles.
Robust to data drift — every candidate bin combination is validated on a dev set, rejecting any whose target rates flip or whose buckets fall below min_freq.
Interpretable buckets — human-readable boundaries you can audit, document, and ship to a scorecard.
Dimensionality reduction — groups under-represented modalities and caps bins per feature (max_n_mod), which is especially useful before one-hot encoding.
Feature pre-selection — ClassificationSelector / RegressionSelector rank features by target association and filter on inter-feature correlation.

Documentation

Full reference, tutorials, and end-to-end notebook examples on ReadTheDocs.

Name		Name	Last commit message	Last commit date
Latest commit History 1,043 Commits
.github/workflows		.github/workflows
AutoCarver		AutoCarver
docs		docs
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Quick Start

Why AutoCarver?

Documentation

About

Uh oh!

Releases 57

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

Quick Start

Why AutoCarver?

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 57

Uh oh!

Contributors

Uh oh!

Languages