Could you please provide some clarification on the differences and/or how to choose between using xgboost_ray.train + xgboost_ray.RayDMatrix or ray.train.xgboost.XGBoostTrainer + ray.data.Dataset?
My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.
Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.
Data
According to the README.md one can create a RayDMatrix from either Parquet files or a Ray Dataset:
|
### Data sources |
|
|
|
The following data sources can be used with a `RayDMatrix` object. |
|
|
|
| Type | Centralized loading | Distributed loading | |
|
|------------------------------------------------------------------|---------------------|---------------------| |
|
| Numpy array | Yes | No | |
|
| Pandas dataframe | Yes | No | |
|
| Single CSV | Yes | No | |
|
| Multi CSV | Yes | Yes | |
|
| Single Parquet | Yes | No | |
|
| Multi Parquet | Yes | Yes | |
|
| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html) | Yes | Yes | |
|
| [Petastorm](https://github.com/uber/petastorm) | Yes | Yes | |
|
| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes | Yes | |
|
| [Modin dataframe](https://modin.readthedocs.io/en/latest/) | Yes | Yes | |
So if using
xgboost_ray, should I
- Create a Ray
Dataset from Parquet files, then create a RayDMatrix from that Dataset
or
- Create the
RayDMatrix directly from Parquet files
Training
Should I use Ray Tune with XGBoostTrainer or with xgboost_ray.train, running on this Ray on Spark Cluster?
I also intend to implement CV with early stopping. Since tune-sklearn is now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done with ray.tune.stopper.TrialPlateauStopper. But according to #301 we can also use XGBoost's native xgb.callback.EarlyStopping. Which approach would you recommend? Can TrialPlateauStopper be used with xgboost_ray?
Thank you very much for any help you can offer.
Could you please provide some clarification on the differences and/or how to choose between using
xgboost_ray.train+xgboost_ray.RayDMatrixorray.train.xgboost.XGBoostTrainer+ray.data.Dataset?My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.
Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.
Data
According to the
README.mdone can create aRayDMatrixfrom either Parquet files or a RayDataset:xgboost_ray/README.md
Lines 450 to 465 in e904925
So if using
xgboost_ray, should IDatasetfrom Parquet files, then create aRayDMatrixfrom thatDatasetor
RayDMatrixdirectly from Parquet filesTraining
Should I use Ray Tune with
XGBoostTraineror withxgboost_ray.train, running on this Ray on Spark Cluster?I also intend to implement CV with early stopping. Since
tune-sklearnis now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done withray.tune.stopper.TrialPlateauStopper. But according to #301 we can also use XGBoost's nativexgb.callback.EarlyStopping. Which approach would you recommend? CanTrialPlateauStopperbe used withxgboost_ray?Thank you very much for any help you can offer.