Hi, great extension to reproduce R1. From the experiment results shown in the wandb report, it seems that full hyperparameter are not public.
Would you mind sharing them all?
Additionally, there is an interesting claim about the choice of RL algorithm. As stated, reinforce++ is more stable compared to GRPO. From the wandb report, however, the GRPO achieves the highest test score. My personal experience is reinforce++ and GRPO does not differ much on this task. Because the kk-logic has highly similar data, so computing reward average and std in group or batch does not have significant differences. Any ideas?

Hi, great extension to reproduce R1. From the experiment results shown in the wandb report, it seems that full hyperparameter are not public.
Would you mind sharing them all?
Additionally, there is an interesting claim about the choice of RL algorithm. As stated, reinforce++ is more stable compared to GRPO. From the wandb report, however, the GRPO achieves the highest test score. My personal experience is reinforce++ and GRPO does not differ much on this task. Because the kk-logic has highly similar data, so computing reward average and std in group or batch does not have significant differences. Any ideas?