Quick start

The code for paper Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection (WWW'26) -- arxiv

Setup

Create a virtual environment

# Create virtual env with your favorite environment manager
# Here I use venv
python -m venv env

# Activate environment
source env/bin/activate

# Install required packages
pip install -e '.[dev]'

Download data and run preprocess:

# Download Criteo dataset
wget https://go.criteo.net/criteo-research-kaggle-display-advertising-challenge-dataset.tar.gz -O dataset/criteo/criteo-research-kaggle-display-advertising-challenge-dataset.tar.gz
tar xvf dataset/criteo/criteo-research-kaggle-display-advertising-challenge-dataset.tar.gz -C dataset/criteo/

# Get data splits
zstd --decompress bin/splits.zst.zst -o bin/splits.zst
zstd --decompress bin/splits.zst -o dataset/criteo/splits.bin

# Run preprocess
python scripts/preprocess.py criteo

Run main algorithm

Run step 1 -- Coreset selection

python scripts/train_choose_selflc_v5.py --arch dcnv2 --dataset criteo --batch_size 8192 --data_size 0.01 --n_split 3

Run step 2 -- Denoise

python scripts/denoise.py --arch dcnv2 --dataset criteo --data_path outputs/dcnv2-criteo-0.01-v2-ablation

Run retrain

python scripts/train_subset.py --arch dcnv2 \
     --dataset criteo \
     --subset_path outputs/dcnv2-criteo-0.01-v2-ablation/hyperparam-test.pth \
     --loss selflc \
     --batch_size 8192 \
     --weight_decay 5e-4

Note: The weight decay is different between data size. Go to file src/const.py for weight decay search results.

Citations

If you find this repo helpful, please cite the below paper:

@article{tran2026efficient,
  title={Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection},
  author={Tran, Hung Vinh and Chen, Tong and Wen, Hechuan and Nguyen, Quoc Viet Hung and Cui, Bin and Yin, Hongzhi},
  journal={arXiv preprint arXiv:2601.10067},
  year={2026}
}

Acknowledgement

This code is based on:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
dataset/criteo		dataset/criteo
scripts		scripts
src		src
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start

Setup

Run main algorithm

Citations

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick start

Setup

Run main algorithm

Citations

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages