Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion src/graphnet/models/easy_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def __init__(
scheduler_class: Optional[type] = None,
scheduler_kwargs: Optional[Dict] = None,
scheduler_config: Optional[Dict] = None,
log_train_loss_on_step: bool = False,
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name could be renamed to also_log_train_loss_per_step. This would immediately clarify, that it is an additional option for logging the per-batch loss under a different key.

Suggested change
log_train_loss_on_step: bool = False,
also_log_train_loss_per_step: bool = False,

It could be also useful to add a Docstring explaining the arguments in __init__(), but especially for also_log_train_loss_per_step.

    """
    Args:
        also_log_train_loss_per_step:
            If `True`, logs an additional per-batch metric (`train_loss_step`)
            alongside the existing per-epoch metric (`train_loss`). This can
            be useful for debugging training instabilities or monitoring
            convergence within long epochs.
    """

) -> None:
"""Construct `StandardModel`."""
# Base class constructor
Expand All @@ -52,6 +53,7 @@ def __init__(
self._scheduler_class = scheduler_class
self._scheduler_kwargs = scheduler_kwargs or dict()
self._scheduler_config = scheduler_config or dict()
self._log_train_loss_on_step = log_train_loss_on_step
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._log_train_loss_on_step = log_train_loss_on_step
self._also_log_train_loss_per_step = also_log_train_loss_per_step


self.validate_tasks()

Expand Down Expand Up @@ -243,15 +245,26 @@ def training_step(
if isinstance(train_batch, Data):
train_batch = [train_batch]
loss = self.shared_step(train_batch, batch_idx)
batch_size = self._get_batch_size(train_batch)
self.log(
"train_loss",
loss,
batch_size=self._get_batch_size(train_batch),
batch_size=batch_size,
prog_bar=True,
on_epoch=True,
on_step=False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then expose the logging setting to the class. This removes duplicate code and then we don't have to define the batch_size.

Suggested change
on_epoch=True,
on_step=False,
on_epoch=log_on_epoch,
on_step=log_on_step,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, but how do we want to handle the logging of the val loss? log_on_epoch and log_on_step for me sounds like you log both val and train on epoch and or step, which I think could also be valid (personally I log train on log and step and val only on epoch). As long as we agree on something together I think either way is fine

Copy link
Copy Markdown
Collaborator

@Aske-Rosted Aske-Rosted May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to have the logging of the validation and train loss behave in the same way. In principle we could separate the arguments for validation and training, but I think that is a little too many arguments, and moving towards instances where people should just create their own torch-lightning callbacks.

sync_dist=True,
)
if self._log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be computationally expensive, if sync_dist=True.
There would be syncing across GPUs on every batch, which quickly adds up for high batch number. It should be maybe clarified in the Docstring at the top, that the training might be slowed down. The default of this option could also be set to sync_dist=False.

Suggested change
if self._log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,
if self._also_log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's set sync_dist to false as the default

)

current_lr = self.trainer.optimizers[0].param_groups[0]["lr"]
self.log("lr", current_lr, prog_bar=True, on_step=True)
Expand Down
Loading