Avoid materializing the entire logit matrix for logp calculations. #2772

zkpranav · 2025-06-19T15:46:16Z

Avoids materializing the entire logit matrix for ref, old, and new policy’s log probability calculation using CCE with no reductions.
selective_log_softmax(e @ c.T, index) == -cce(e, c, index, reduction="none”)

The default invocation of linear_cross_entropy applies gradient filtering, which can be turned off by setting filter_eps to -inf.

num_generations = 8
num_iterations = 4
batch_size = 8
unsloth_num_chunks = 4
max_prompt_length = 512
max_completion_length = 1024
vocab_size = 128256

Reduces VRAM usage by around 15% - 20%, though the memory usage should be lower still with CCE. Moreover, for larger values of batch_size, max_completion_length, and vocab_size, the difference will be much more profound.

Other changes -

Modifies _get_per_token_logps to accept a batch_size (https://github.com/huggingface/trl/blob/5206c927f6bb161e45114531b0bca8286acfeada/trl/trainer/grpo_trainer.py#L853). Removes calc_logprob_flag.
Computes logps in compute_loss (before calling into UnslothEfficientGRPO), ensuring a consistent interface with HF.
Removes explicit computation of ref logps since HF does that now (https://github.com/huggingface/trl/blob/5206c927f6bb161e45114531b0bca8286acfeada/trl/trainer/grpo_trainer.py#L1292).

…avoid recomputation of ref logpss

…ed and grpo_slow

danielhanchen · 2025-06-20T13:21:36Z

Wait I thought we didn't materialize logits but folded it in a torch.compile kernel @Datta0 @pluesclues

Datta0 · 2025-06-20T13:53:54Z

unsloth/models/rl_replacements.py

+                hidden_states = model(input_ids=input_ids_batch, attention_mask=attention_mask_batch, logits_to_keep=logits_to_keep + 1).logits
+                # Add dummy input_id at the end. Last logp is exluded.
+                input_ids_batch = torch.cat((input_ids_batch[:, -logits_to_keep:], torch.zeros((batch_size, 1), dtype=input_ids_batch.dtype, device=input_ids_batch.device)), dim=-1)
+                logps = -1 * linear_cross_entropy(hidden_states.to(dtype=lm_head.dtype), lm_head, input_ids_batch, reduction="none", impl="cce")


Um, why do we need cross entropy in get_per_token_logps?

Apparently, these will return logprobs and are equivalent to selective softmax? But I am not sure if we want to return logprobs in this matrix because like @danielhanchen said we folded it into a torch.compile kernel. I am questioning if the memory saved here is actually from the cut cross entropy loss rather than the chunked concatenation of the hidden states. I am currently at work but we can check later if chunking the hidden states conserves similar amounts of memory.

Another thing I see is that according to to this person's post, there also seems to be some speed up as well, what we can do instead of materializing the logits outside of here is also put the linear_cut_cross_entropy in place of the code in selective_softmax so we get speed up and memory and do not materialize logits outside of the kernel.

@pluesclues That would work too. The only reason I did it this way is to ensure consistency with HF. That being said, we may, at some point, need to write a custom kernel anyway to run fused operations on the logit matrix chunk. Currently, the implementation in HF scales the logits with temperature before computing logps (https://github.com/huggingface/trl/blob/4c92de00001379ceedaf073512ce4df5da304d08/trl/trainer/grpo_trainer.py#L871).

Okay I just tested this method inside of the kernel, its as I suspected, we cannot use linear_cross_entropy which is a torch.compile kernel in itself inside of a torch.compile kernel, I confirmed this by running ref = -1 * linear_cross_entropy(ref_hidden_states_j[:, :-1, :].to(dtype=lm_head.dtype), lm_head, input_ids_j, reduction="none", impl="cce") right before accumulate_chunk outside the kernel and also called this line inside the kernel, outside the kernel it works just fine, inside it seems to break. I still haven't tested the speed up on my machine yet, but so far it looks like we can either merge this or just change the way we calculate logprobs to exactly how CCE does it in their kernel.

About the memory-saving speed-up I reported, I believe the manner in which I profiled it does not provide an accurate account. I am only logging the peak memory allocated throughout a training step, clearing it at the beginning. This approach fails to account for the memory allocated for the old and ref policies as they are computed and cached outside the new policy update loop, i.e., every _step % num_interations == 0. I expected much higher memory savings. I would appreciate some help with this.

Moreover, I would like to confirm that UNSLOTH_USE_NEW_MODEL being set to 0 must be interpreted as the pathway to UnslothEfficientGRPO as is the case in the current implementation.
Also, UNSLOTH_RETURN_HIDDEN_STATES is set to 1 before executing the forward pass in _get_per_token_logps but never reset to its original value, creating an unintended side-effect. This is done in a couple of places. Would it not be better to reset it?

Do you have the wandb of memory usage over time (as tracked by trl/wandb itself) of the run?

This is a much smaller run with double the batch size. The CCE version completes its 4 training steps in 7 mins, whereas the current implementation OOMs on my machine after 12 mins.
In this case, the amount of memory saved is roughly 25%.

batch_size = 16
unsloth_num_chunks = 4

zkpranav · 2025-06-21T08:05:23Z

These are the loss and grad_norm graphs for the current version and CCE with no gradient filtering.

…ccumulated_loss

zkpranav · 2025-06-21T10:04:16Z

I believe this is relevant to -
#2752
#2736
#2702

danielhanchen · 2025-06-23T12:11:04Z

@zkpranav I'm actually quite surprised selective_log_softmax(e @ c.T, index) == -cce(e, c, index, reduction="none”) is equivalent :) Is it possible to write a simply Python test to make this into a small test - that would be cool - also good to check torch.allclose or torch.dist()

Also the loss plots look mostly the same - I'm assuming it's the generation dynamics - can you set temperature to a small number for GRPO say 0.001, top_k = 1 and seed = 3407 or something - the main issue is the losses arent exactly matching

zkpranav · 2025-06-23T12:19:30Z

@danielhanchen Sure, I'll write the tests and profile with a range of temperature and top_k values. I believe the differences in the loss value are a direct result of the differences in reward and KL divergence. I have seeded the runs, but perhaps I missed something? I could also try with beta=0.0 to try and isolate the issue.

I am also confused as to how the reward values diverge, albeit slightly. The changes in this PR only affect what comes after the reward and advantage calculation.

danielhanchen · 2025-06-23T12:30:41Z

Oh for GRPO temperature = 1.0, so sampling is done - seeding won't work since vLLM and other systems are not fully determinstic - it's best to change temperature = 0.001 and top_k = 1 to at least provide the illusion of mostly non sampling

pluesclues · 2025-06-23T13:52:21Z

@danielhanchen I also did a check on the selective soft max function vs CCE and the logits generally matched up with slight differences which I am assuming is part of the reason that the losses do not exactly match up along with other non deterministic parts of the system.

danielhanchen · 2025-06-30T23:45:01Z

Oh interesting

zkpranav added 5 commits June 19, 2025 19:54

Modify _get_per_token_logps to use CCE, remove calc_logprob_flag and …

0629f22

…avoid recomputation of ref logpss

Use UNSLOTH_USE_NEW_MODEL as a check to decide between grpo_accumulat…

a89b99f

…ed and grpo_slow

Integrate UnslothEfficientGRPO changes

a8207c8

Removes UNSLOTH_RETURN_HIDDEN_STATES invalid access

1389f22

Add back sleep_mode changes

f27c920

Datta0 reviewed Jun 20, 2025

View reviewed changes

Removes grad filtering, use batch chunking strategy similar to grpo_a…

5228cab

…ccumulated_loss

zkpranav added 4 commits June 23, 2025 11:33

Resolve conflicts

a5c0bd2

Perform logit scaling with temperature

9211e01

Assume a default temperature of 1.0

d467c05

No default temperature _get_per_token_logps, corrects check

75f36ba

zkpranav mentioned this pull request Jun 23, 2025

Change UnslothEfficientGRPO to accept log probs instead of embeddings unslothai/unsloth-zoo#172

Open

Uh oh!

Avoid materializing the entire logit matrix for logp calculations. #2772

Are you sure you want to change the base?

Avoid materializing the entire logit matrix for logp calculations. #2772

Uh oh!

Conversation

zkpranav commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Jun 20, 2025

Uh oh!

Datta0 Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pluesclues Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pluesclues Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

zkpranav Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

pluesclues Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

zkpranav Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Datta0 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

zkpranav Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

zkpranav Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

zkpranav commented Jun 21, 2025

Uh oh!

zkpranav commented Jun 21, 2025

Uh oh!

danielhanchen commented Jun 23, 2025

Uh oh!

zkpranav commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Jun 23, 2025

Uh oh!

pluesclues commented Jun 23, 2025

Uh oh!

danielhanchen commented Jun 30, 2025

Uh oh!

Uh oh!

zkpranav commented Jun 19, 2025 •

edited

Loading

Datta0 Jun 20, 2025 •

edited

Loading

pluesclues Jun 20, 2025 •

edited

Loading

zkpranav Jun 20, 2025 •

edited

Loading

zkpranav commented Jun 23, 2025 •

edited

Loading