Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

SamuelBarryCS · 2025-09-02T05:25:27Z

What:

Fix issue witnessed in Trainer.training_step incorrectly normalizes mean token loss when n_gpu > 1 #37474: when using >1 GPU, the loss does not get normalized correctly
Cause: each GPUs sees num_items_in_batch = cst = total number of tokens (instead of total number of tokens / n_gpus)
Fix: adjust num_items_in_batch to be set to the correct value before being fed to training_step & compute_loss.

Test performed:

On top of the updated trainer.py, I'm pushing 3 temporary files that are meant to be deleted after the review and before merging: test_simple_demo.py - a test file and trainer_old.py / trainer_new.py - respectively corresponding to the version of trainer.py before / after my commits but with logging for testing purpose
Ran test with 1 & 2 GPUs to reproduce results from Trainer.training_step incorrectly normalizes mean token loss when n_gpu > 1 #37474, and output is as follow:

1 GPU - behavior of both implementations is correct:


Found 1 different GPUs
----------------------------------------
Testing OLD Trainer
----------------------------------------
Starting OLD trainer...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.4988, 'train_samples_per_second': 60.144, 'train_steps_per_second': 4.01, 'train_loss': 10.143393516540527, 'epoch': 2.0}

----------------------------------------
Testing NEW Trainer
----------------------------------------
Starting NEW trainer...
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.2013, 'train_samples_per_second': 149.029, 'train_steps_per_second': 9.935, 'train_loss': 10.143393516540527, 'epoch': 2.0}

2 GPUS: previous implementation (i.e. "old trainer") incorrectly normalizes the loss, whereas the new one fixes it:


Found 2 different GPUs
----------------------------------------
Testing OLD Trainer
----------------------------------------
Starting OLD trainer...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
{'loss': 5.2161, 'grad_norm': 5.968019962310791, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 4.9273, 'grad_norm': 4.354814529418945, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 1.6086, 'train_samples_per_second': 37.3, 'train_steps_per_second': 1.243, 'train_loss': 5.071696758270264, 'epoch': 2.0}

----------------------------------------
Testing NEW Trainer
----------------------------------------
Starting NEW trainer...
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.1692, 'train_samples_per_second': 354.581, 'train_steps_per_second': 11.819, 'train_loss': 10.143393516540527, 'epoch': 2.0}

How to review:

Read diff of trainer.py (ignore trainer_old.py & trainer_new.py that are just meant for testing purpose)
Run test_simple_demo.py on 1 and then n > 1 GPUs to confirm my test

TODO/ Next:

NA

SamuelBarryCS · 2025-09-02T05:34:38Z

Hey @SunMarc - can you please review when you get the time ?
Lint & tests will probably fail because of the 3 temporary files, but you can still review the implementation/ testing and I will fix these details after.
Thanks!

SamuelBarryCS added 4 commits September 1, 2025 22:07

Fix issue of wrong number of tokens per GPUs

89afbb6

Push test files

ac778bb

Clean

a741158

Clean

2b311b3

SamuelBarryCS marked this pull request as ready for review September 2, 2025 05:33

github-actions bot requested review from ArthurZucker and Rocketknight1 September 2, 2025 05:33

SamuelBarryCS changed the title ~~[WIP] Fix issue of wrong number of tokens per GPUs affecting loss normalization~~ Fix issue of wrong number of tokens per GPUs affecting loss normalization Sep 2, 2025

SamuelBarryCS changed the title ~~Fix issue of wrong number of tokens per GPUs affecting loss normalization~~ Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

SamuelBarryCS commented Sep 2, 2025 •

edited

Loading

Uh oh!

SamuelBarryCS commented Sep 2, 2025

Uh oh!

Uh oh!

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Are you sure you want to change the base?

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Conversation

SamuelBarryCS commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What:

Test performed:

1 GPU - behavior of both implementations is correct:

2 GPUS: previous implementation (i.e. "old trainer") incorrectly normalizes the loss, whereas the new one fixes it:

How to review:

TODO/ Next:

Uh oh!

SamuelBarryCS commented Sep 2, 2025

Uh oh!

Uh oh!

SamuelBarryCS commented Sep 2, 2025 •

edited

Loading