Skip to content

initial skeleton#2376

Closed
daniellepintz wants to merge 1 commit intomainfrom
dp/reinforcement_learning
Closed

initial skeleton#2376
daniellepintz wants to merge 1 commit intomainfrom
dp/reinforcement_learning

Conversation

@daniellepintz
Copy link
Contributor

No description provided.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2026
Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopping on comments here - essentially just make everything even simpler. Flat structure, no complex environments, minimal printing.

)

@endpoint
def evaluate_zorplex(self, num_samples: int = 10, seed: int = 42) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in the controller code IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine all this task specific stuff into one zorplex file

Copy link

@felipemello1 felipemello1 Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am tempted to say that we should delete all the zorplex and just make it gsm8k, single turn, no tool call or env or anything like that.

is_correct: bool


class TaskSpec(ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to generalize here - we only have on taskspec



@dataclass
class Trajectory:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we agree to do single_turn + gsm8k, this can all be much simpler, maybe bring from forge?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see an absolute minimal version.

Also, @allenwang28 told us there shouldn't be Trajectory, there should only be Episodes. Please educate.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the trimming side. Let's aim for a 200 line PR, instead of 2000+ line one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please justify the importance of this folder. If possible, replace with something absolutely minimal.



@dataclass
class Trajectory:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see an absolute minimal version.

Also, @allenwang28 told us there shouldn't be Trajectory, there should only be Episodes. Please educate.

import torch
import torch.nn.functional as F

from transformers import AutoModelForCausalLM, AutoTokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not depend on transformers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could need to depend only on tokenizers, which titan also uses, and have the generator be the vllm as in the titan

@allenwang28
Copy link
Contributor

Thanks @daniellepintz for taking a first stab at this! I want to share more thoughts on scoping - not really about code quality, but on what the right size is for landing this quickly, then iterating from there.

The goal of this skeleton, as I see it:

  • Something we're totally fine with throwing away in 2 months
  • Useful enough that we can A/B test changes against it (e.g. what are the learning dynamic differences with and without unified model)?
  • Simple enough that someone unfamiliar with Forge/Monarch or the intricacies of async RL can review it in one sitting
  • Not so toy that everything works by accident, but not so complex that it takes days to run to validate.

But before that, I would start with evaluation, not training. The first thing to get right isn't the RL loop — it's confirming a concrete hypothesis about what's learnable.

My hypothesis: a model like Qwen 2.5 0.5B can already do lookups consistently — it knows how to call LOOKUP[word] and use the result. But it doesn't emit [ANSWER] tags reliably. That gap — "can use tools but can't format a final answer" — should be learnable through RL without needing SFT first, because the behavior is simple and the reward signal is unambiguous.

So the first milestone is: set up an evaluation framework, run Zorplex tasks at the simplest difficulty, and confirm that the model actually has this gap. If it does, that's our starting point — RL to teach answer formatting on a task where the model already has the underlying capability. Then we incrementally up the difficulty. (Another note - this naturally opens a door for curriculum learning as well down the line)

We ran into exactly this problem with GSM8K before: the model was already trained on it, so it was impossible to tell whether RL was actually teaching anything or just reformatting answers the model already knew. Zorplex avoids this because the values are random and arbitrary — the model literally cannot know the answers without using the tool.

A note on task selection more broadly: this is partially a research question. There's real work (DeepSeek R1, etc.) showing that you often need SFT first to teach format compliance, then RL to improve reasoning within that format. We probably won't find the perfect task immediately, and we should keep iterating on difficulty and task design as we go. But we don't need the perfect task to land the skeleton — we need one task where we have a testable hypothesis and a clean eval. Zorplex SimpleLookup + answer formatting is that task.

Then the skeleton becomes a single file:

  1. Zorplex SimpleLookup spec
  2. Single-turn
  3. REINFORCE training step, the simplest math that works
  4. Sync loop only (generate, train, sync weights, repeat)
  5. Weight sync is only done via load_state_dict through DCP. It's slow but it's correct, no extra zero infra
  6. Console logging of loss + reward + accuracy per step
  7. Held-out evaluation every N steps with a fixed seed, printing example trajectories so a human can see what the model is doing.

What we add in future PRs
These are valuable, but I'm proposing we land them in isolated PRs where the review questions are focused and answerable against the skeleton baseline:

Feature Review questions it answers
Multi-turn agentic generation Can the model learn to chain multiple tool calls and combine results?
Service / ServiceRegistry / everything in monarch_utils How do we handle Monarch actor pools?
RDMA circular buffer weight sync, building up to torchstore Does RDMA give the same results as state_dict / what is the speed up?
Async loop + ReplayBuffer Does async converge to the same accuracy as sync? What is the throughput gain?
Additional task specs / difficulty tiers How does difficulty scaling affect learning dynamics?
Timeline plotting Lands naturally with the async PR

Thoughts?

@felipemello1
Copy link

felipemello1 commented Feb 13, 2026

My only comments would be about (1) and (3)

(1) IMO we should use something that is in other frameworks, so we can compare. Do you know if any other framework uses zorplex.

I like that we can see the curves from prime-rl here: https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/reverse_text

But, like you said, they do SFT before, which makes the example less interesting for a skeleton.

I also think that the tool_call, env abstractions, etc, can add some necessary overhead atm. But i can be convinced.

(3) I looked at the losses a lot, and i think that simple reinforce will take much longer to converge and debug. Being able to check entropy, clip ratio, all other metrics is also very helpful.

IMO we should just copy from froge the DAPO/GRPO loss. Its a low hanging fruit and shouldnt add complexity to the skeleton.

(7) extra: Eval might add extra complexity that we dont need atm. Looking at rewards may be good enough.

@allenwang28
Copy link
Contributor

@felipemello1

(1) IMO we should use something that is in other frameworks, so we can compare. Do you know if any other framework uses zorplex.

Fair point, no AFAIK no one uses Zorplex 😁 you're right that it's useful to have a comparison point that we can sanity check against. Maybe something like Zorplex already exists and we should consider using that.

But for the broader point of sanity checking correctness - yes it will be required, but perhaps we can plug it in later after the skeleton implementation.

IMO we should just copy from forge the DAPO/GRPO loss. Its a low hanging fruit and shouldnt add complexity to the skeleton

that's fair, but a few thoughts..

  1. Ideally the task we start with should be easy enough that REINFORCE can apply. If it's not, probably need to consider a different task
  2. GRPO builds on REINFORCE and DAPO builds on GRPO and the upgrade route should be clear

in other words, we should do these things but IMO as fast follows after the basic skeleton

@felipemello1
Copy link

Maybe something like Zorplex already exists and we should consider using that.

I think its worth checking and, if we can get away with not having to think about env/tools for basic skeleton, i think it can make the PR simpler and discuss tool/env abstractions in a follow up, IMO.

in other words, we should do these things but IMO as fast follows after the basic skeleton

agreed

@daniellepintz
Copy link
Contributor Author

Honestly I really like the Zorplex task, I think it is very clear and easily understandable, and I like that there is no way the model can have memorized it beforehand. Also I like defining it right in the source code itself rather than using something external, because again I think it makes it more understandable for users

@daniellepintz
Copy link
Contributor Author

@allenwang28 thanks for your comments! Agree with everything, and it makes sense to separate out the skeleton and the subsequent PRs as you mentioned

@tianyu-l
Copy link
Contributor

@allenwang28 thanks a lot for helping understand the selection of tasks!

After studying the new PR #2381, I realized that even the in-house zorplex tasks assume HF transformers model protocols (such as model.generate, tokenizer.apply_chat_template), which seems breaking the unified model assumption today. cc @wwwjn

@daniellepintz
Copy link
Contributor Author

Closing in favor of #2381

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants