Skip to content

Conversation

@emergenz
Copy link
Contributor

@emergenz emergenz commented May 2, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Currently, ppo_epochs is used for both the actor, as well as the critic. While this is the case for many PPO implementations (https://github.com/EdanToledo/Stoix, https://github.com/openai/baselines), there is no inherent reason why this should be the case. This is probably due to historic reasons, since OpenAI's PPO implementation had a single loss for both actor and critic (https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py; inherent coupling of num_epochs between actor and critic).

This PR simply decouples the actor and critic epochs.

API

The only API change is that I renamed critic.ppo_epochs to critic.epochs to reflect the new behaviour.

Additional Info.

  • Training: [FSDP, Megatron]
  • Inference: []

Checklist Before Submitting

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing config still allows custom epoch num for critic via setting critic.ppo_epochs. does it not work for you?

@emergenz
Copy link
Contributor Author

emergenz commented May 5, 2025

It does. I suppose these changes are just to make it more obvious to the user that this is possible. For context, the current code says self.config.ppo_epochs for the critic epochs. At first glance, this looks like the epochs are coupled between actor and critic.

Maybe I should rename config.critic.epochs to config.critic.critic_epochs to make the decoupling even more obvious.

Also, the default config actually does couple the two by linking the critic epochs to the PPO epochs. Of course one can just change that, but I would argue that there is no reason for this default coupling to begin with.

If you think that the default config should retain the coupling for 'backwards compatibility' with other well-known PPO implementations, then I would suggest that we just change the naming of critic.ppo_epochs to something else and leave the default config as it was before.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants