About experiment details

Hi, great extension to reproduce R1. From the experiment results shown in the wandb report, it seems that full hyperparameter are not public. 
Would you mind sharing them all?
 Additionally, there is an interesting claim about the choice of RL algorithm. As stated, reinforce++ is more stable compared to GRPO. From the wandb report, however, the GRPO achieves the highest test score. My personal experience is reinforce++ and GRPO does not differ much on this task. Because the kk-logic has highly similar data, so computing reward average and std in **group** or **batch**  does not have significant differences. Any ideas? 

![Image](https://github.com/user-attachments/assets/929b20cb-b0d3-4abe-b9ea-6aa43acde849)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About experiment details #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About experiment details #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions