Skip to content

About experiment details #1

@Dutch-voyage

Description

@Dutch-voyage

Hi, great extension to reproduce R1. From the experiment results shown in the wandb report, it seems that full hyperparameter are not public.
Would you mind sharing them all?
Additionally, there is an interesting claim about the choice of RL algorithm. As stated, reinforce++ is more stable compared to GRPO. From the wandb report, however, the GRPO achieves the highest test score. My personal experience is reinforce++ and GRPO does not differ much on this task. Because the kk-logic has highly similar data, so computing reward average and std in group or batch does not have significant differences. Any ideas?

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions