initial skeleton by daniellepintz · Pull Request #2376 · pytorch/torchtitan

daniellepintz · 2026-02-12T22:09:51Z

No description provided.

joecummings

Stopping on comments here - essentially just make everything even simpler. Flat structure, no complex environments, minimal printing.

experiments/reinforcement_learning/training.py

experiments/reinforcement_learning/run.py

experiments/reinforcement_learning/environment.yml

experiments/reinforcement_learning/actors.py

joecummings · 2026-02-12T22:26:15Z

experiments/reinforcement_learning/actors.py

+        )
+
+    @endpoint
+    def evaluate_zorplex(self, num_samples: int = 10, seed: int = 42) -> dict:


Put this in the controller code IMO

experiments/reinforcement_learning/src/zorplex_rl/task_specs.py

joecummings · 2026-02-12T22:29:59Z

experiments/reinforcement_learning/src/zorplex_rl/evaluate.py

combine all this task specific stuff into one zorplex file

i am tempted to say that we should delete all the zorplex and just make it gsm8k, single turn, no tool call or env or anything like that.

experiments/reinforcement_learning/src/monarch_utils/services.py

joecummings · 2026-02-12T22:42:37Z

experiments/reinforcement_learning/src/zorplex_rl/task_specs.py

+    is_correct: bool
+
+
+class TaskSpec(ABC):


No need to generalize here - we only have on taskspec

felipemello1 · 2026-02-12T22:53:54Z

experiments/reinforcement_learning/src/rl_primitives.py

+
+
+@dataclass
+class Trajectory:


if we agree to do single_turn + gsm8k, this can all be much simpler, maybe bring from forge?

Would like to see an absolute minimal version.

Also, @allenwang28 told us there shouldn't be Trajectory, there should only be Episodes. Please educate.

tianyu-l

Agreed on the trimming side. Let's aim for a 200 line PR, instead of 2000+ line one.

experiments/reinforcement_learning/src/monarch_utils/services.py

tianyu-l · 2026-02-13T01:21:16Z

experiments/reinforcement_learning/src/zorplex_rl/__init__.py

Please justify the importance of this folder. If possible, replace with something absolutely minimal.

tianyu-l · 2026-02-13T01:24:14Z

experiments/reinforcement_learning/src/rl_primitives.py

+
+
+@dataclass
+class Trajectory:


Would like to see an absolute minimal version.

Also, @allenwang28 told us there shouldn't be Trajectory, there should only be Episodes. Please educate.

tianyu-l · 2026-02-13T01:24:50Z

experiments/reinforcement_learning/actors.py

+import torch
+import torch.nn.functional as F
+
+from transformers import AutoModelForCausalLM, AutoTokenizer


Can we not depend on transformers?

we could need to depend only on tokenizers, which titan also uses, and have the generator be the vllm as in the titan

experiments/reinforcement_learning/training.py

experiments/reinforcement_learning/actors.py

allenwang28 · 2026-02-13T16:42:53Z

Thanks @daniellepintz for taking a first stab at this! I want to share more thoughts on scoping - not really about code quality, but on what the right size is for landing this quickly, then iterating from there.

The goal of this skeleton, as I see it:

Something we're totally fine with throwing away in 2 months
Useful enough that we can A/B test changes against it (e.g. what are the learning dynamic differences with and without unified model)?
Simple enough that someone unfamiliar with Forge/Monarch or the intricacies of async RL can review it in one sitting
Not so toy that everything works by accident, but not so complex that it takes days to run to validate.

But before that, I would start with evaluation, not training. The first thing to get right isn't the RL loop — it's confirming a concrete hypothesis about what's learnable.

My hypothesis: a model like Qwen 2.5 0.5B can already do lookups consistently — it knows how to call LOOKUP[word] and use the result. But it doesn't emit [ANSWER] tags reliably. That gap — "can use tools but can't format a final answer" — should be learnable through RL without needing SFT first, because the behavior is simple and the reward signal is unambiguous.

So the first milestone is: set up an evaluation framework, run Zorplex tasks at the simplest difficulty, and confirm that the model actually has this gap. If it does, that's our starting point — RL to teach answer formatting on a task where the model already has the underlying capability. Then we incrementally up the difficulty. (Another note - this naturally opens a door for curriculum learning as well down the line)

We ran into exactly this problem with GSM8K before: the model was already trained on it, so it was impossible to tell whether RL was actually teaching anything or just reformatting answers the model already knew. Zorplex avoids this because the values are random and arbitrary — the model literally cannot know the answers without using the tool.

A note on task selection more broadly: this is partially a research question. There's real work (DeepSeek R1, etc.) showing that you often need SFT first to teach format compliance, then RL to improve reasoning within that format. We probably won't find the perfect task immediately, and we should keep iterating on difficulty and task design as we go. But we don't need the perfect task to land the skeleton — we need one task where we have a testable hypothesis and a clean eval. Zorplex SimpleLookup + answer formatting is that task.

Then the skeleton becomes a single file:

Zorplex SimpleLookup spec
Single-turn
REINFORCE training step, the simplest math that works
Sync loop only (generate, train, sync weights, repeat)
Weight sync is only done via load_state_dict through DCP. It's slow but it's correct, no extra zero infra
Console logging of loss + reward + accuracy per step
Held-out evaluation every N steps with a fixed seed, printing example trajectories so a human can see what the model is doing.

What we add in future PRs
These are valuable, but I'm proposing we land them in isolated PRs where the review questions are focused and answerable against the skeleton baseline:

Feature	Review questions it answers
Multi-turn agentic generation	Can the model learn to chain multiple tool calls and combine results?
Service / ServiceRegistry / everything in monarch_utils	How do we handle Monarch actor pools?
RDMA circular buffer weight sync, building up to torchstore	Does RDMA give the same results as state_dict / what is the speed up?
Async loop + ReplayBuffer	Does async converge to the same accuracy as sync? What is the throughput gain?
Additional task specs / difficulty tiers	How does difficulty scaling affect learning dynamics?
Timeline plotting	Lands naturally with the async PR

Thoughts?

felipemello1 · 2026-02-13T16:57:06Z

My only comments would be about (1) and (3)

(1) IMO we should use something that is in other frameworks, so we can compare. Do you know if any other framework uses zorplex.

I like that we can see the curves from prime-rl here: https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/reverse_text

But, like you said, they do SFT before, which makes the example less interesting for a skeleton.

I also think that the tool_call, env abstractions, etc, can add some necessary overhead atm. But i can be convinced.

(3) I looked at the losses a lot, and i think that simple reinforce will take much longer to converge and debug. Being able to check entropy, clip ratio, all other metrics is also very helpful.

IMO we should just copy from froge the DAPO/GRPO loss. Its a low hanging fruit and shouldnt add complexity to the skeleton.

(7) extra: Eval might add extra complexity that we dont need atm. Looking at rewards may be good enough.

allenwang28 · 2026-02-13T17:09:35Z

@felipemello1

(1) IMO we should use something that is in other frameworks, so we can compare. Do you know if any other framework uses zorplex.

Fair point, no AFAIK no one uses Zorplex 😁 you're right that it's useful to have a comparison point that we can sanity check against. Maybe something like Zorplex already exists and we should consider using that.

But for the broader point of sanity checking correctness - yes it will be required, but perhaps we can plug it in later after the skeleton implementation.

IMO we should just copy from forge the DAPO/GRPO loss. Its a low hanging fruit and shouldnt add complexity to the skeleton

that's fair, but a few thoughts..

Ideally the task we start with should be easy enough that REINFORCE can apply. If it's not, probably need to consider a different task
GRPO builds on REINFORCE and DAPO builds on GRPO and the upgrade route should be clear

in other words, we should do these things but IMO as fast follows after the basic skeleton

felipemello1 · 2026-02-13T17:42:08Z

Maybe something like Zorplex already exists and we should consider using that.

I think its worth checking and, if we can get away with not having to think about env/tools for basic skeleton, i think it can make the PR simpler and discuss tool/env abstractions in a follow up, IMO.

in other words, we should do these things but IMO as fast follows after the basic skeleton

agreed

daniellepintz · 2026-02-13T17:59:41Z

Honestly I really like the Zorplex task, I think it is very clear and easily understandable, and I like that there is no way the model can have memorized it beforehand. Also I like defining it right in the source code itself rather than using something external, because again I think it makes it more understandable for users

daniellepintz · 2026-02-13T18:00:47Z

@allenwang28 thanks for your comments! Agree with everything, and it makes sense to separate out the skeleton and the subsequent PRs as you mentioned

tianyu-l · 2026-02-17T06:49:19Z

@allenwang28 thanks a lot for helping understand the selection of tasks!

After studying the new PR #2381, I realized that even the in-house zorplex tasks assume HF transformers model protocols (such as model.generate, tokenizer.apply_chat_template), which seems breaking the unified model assumption today. cc @wwwjn

daniellepintz · 2026-02-17T15:15:50Z

Closing in favor of #2381

initial skeleton

fbf7aa9

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2026

joecummings requested changes Feb 12, 2026

View reviewed changes

joecummings reviewed Feb 12, 2026

View reviewed changes

experiments/reinforcement_learning/src/monarch_utils/services.py Show resolved Hide resolved

joecummings reviewed Feb 12, 2026

View reviewed changes

felipemello1 reviewed Feb 12, 2026

View reviewed changes

tianyu-l requested changes Feb 13, 2026

View reviewed changes

daniellepintz closed this Feb 17, 2026

allenwang28 mentioned this pull request Feb 17, 2026

Add RL skeleton #2381

Open



		@dataclass
		class Trajectory:



		@dataclass
		class Trajectory:

Conversation

daniellepintz commented Feb 12, 2026

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allenwang28 commented Feb 13, 2026

Uh oh!

felipemello1 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenwang28 commented Feb 13, 2026

Uh oh!

felipemello1 commented Feb 13, 2026

Uh oh!

daniellepintz commented Feb 13, 2026

Uh oh!

daniellepintz commented Feb 13, 2026

Uh oh!

tianyu-l commented Feb 17, 2026

Uh oh!

daniellepintz commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

felipemello1 Feb 12, 2026 •

edited

Loading

felipemello1 commented Feb 13, 2026 •

edited

Loading