Skip to content

Conversation

SamuelBarryCS
Copy link

@SamuelBarryCS SamuelBarryCS commented Sep 2, 2025

What:

Test performed:

  • On top of the updated trainer.py, I'm pushing 3 temporary files that are meant to be deleted after the review and before merging: test_simple_demo.py - a test file and trainer_old.py / trainer_new.py - respectively corresponding to the version of trainer.py before / after my commits but with logging for testing purpose
  • Ran test with 1 & 2 GPUs to reproduce results from Trainer.training_step incorrectly normalizes mean token loss when n_gpu > 1 #37474, and output is as follow:

1 GPU - behavior of both implementations is correct:


Found 1 different GPUs
----------------------------------------
Testing OLD Trainer
----------------------------------------
Starting OLD trainer...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.4988, 'train_samples_per_second': 60.144, 'train_steps_per_second': 4.01, 'train_loss': 10.143393516540527, 'epoch': 2.0}

----------------------------------------
Testing NEW Trainer
----------------------------------------
Starting NEW trainer...
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.2013, 'train_samples_per_second': 149.029, 'train_steps_per_second': 9.935, 'train_loss': 10.143393516540527, 'epoch': 2.0}

2 GPUS: previous implementation (i.e. "old trainer") incorrectly normalizes the loss, whereas the new one fixes it:


Found 2 different GPUs
----------------------------------------
Testing OLD Trainer
----------------------------------------
Starting OLD trainer...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
{'loss': 5.2161, 'grad_norm': 5.968019962310791, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 4.9273, 'grad_norm': 4.354814529418945, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 1.6086, 'train_samples_per_second': 37.3, 'train_steps_per_second': 1.243, 'train_loss': 5.071696758270264, 'epoch': 2.0}

----------------------------------------
Testing NEW Trainer
----------------------------------------
Starting NEW trainer...
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.1692, 'train_samples_per_second': 354.581, 'train_steps_per_second': 11.819, 'train_loss': 10.143393516540527, 'epoch': 2.0}

How to review:

  • Read diff of trainer.py (ignore trainer_old.py & trainer_new.py that are just meant for testing purpose)
  • Run test_simple_demo.py on 1 and then n > 1 GPUs to confirm my test

TODO/ Next:

  • NA

@SamuelBarryCS SamuelBarryCS marked this pull request as ready for review September 2, 2025 05:33
@SamuelBarryCS
Copy link
Author

Hey @SunMarc - can you please review when you get the time ?
Lint & tests will probably fail because of the 3 temporary files, but you can still review the implementation/ testing and I will fix these details after.
Thanks!

@SamuelBarryCS SamuelBarryCS changed the title [WIP] Fix issue of wrong number of tokens per GPUs affecting loss normalization Fix issue of wrong number of tokens per GPUs affecting loss normalization Sep 2, 2025
@SamuelBarryCS SamuelBarryCS changed the title Fix issue of wrong number of tokens per GPUs affecting loss normalization Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant