tcbegley / rl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pytorch/rl

0.0 0.0 0.0 33.63 MB

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

License: MIT License

Shell 3.30% C++ 0.58% Python 95.82% Batchfile 0.26% PowerShell 0.04%

rl's Issues

[nanoChatGPT] weight tying embedding

shouldn't this be transposed?
i.e. self.transformer.wte.weight = torch.t(self.lm_head.weight)

[nanoChatGPT] Consolidate duplicated code

create_loss_estimator
create_datasets / create_dataloaders
Collate
create_*_memmaps

speculative

training loops?

[nanoChatGPT]

Print accuracy in reward training loop evaluation

This is probably actually fine, because we don't perform a full pass over the validation set when calculating validation metrics, we sample instead, so we likely do want some randomness so as to not always validate on the same subset of the validation data.

[nanoChatGPT] How to represent reward model

The reward model is trained on proposed answers to a prompt which come in pairs, one marked as chosen, the other as rejected. The reward model should output a high score on the chosen answers, and a low score on the rejected.

It seems tricky to come up with a clean programming pattern for this using tensorclasses. Ideally it would be nice to represent the data using a tensorclass, and use TensorDictModule to perform a single forward pass on the data.

We have a tensorclass roughly of the form

@tensorclass
class Data:
    prompt: torch.Tensor
    chosen: torch.Tensor
    rejected: torch.Tensor

We need to do two forward passes, subtract the results and backpropagate. So we end up doing something roughly like this

chosen_loss = model(batch.prompt, batch.chosen)
rejected_loss = model(batch.prompt, batch.rejected)
loss = -torch.sigmoid(chosen_loss - rejected_loss)

which doesn't make use of TensorDictModule. One possibility would be to do something like

chosen_model = TensorDictModule(model, ["prompt", "chosen"], ["chosen_loss"])
rejected_model = TensorDictModule(model, ["prompt", "rejected"], ["rejected_loss"])
chosen_model(batch)
rejected_model(batch)
loss = -torch.sigmoid(batch.chosen_loss - batch.rejected_loss)

We could even then combine these into a single call with TensorDictSequential. The only problem is that this feels more complicated and hard to follow.

Similarly we could combine the forward passes of chosen and rejected examples into a single forward pass by adding in a flag which indicates the sign to be used for that example when aggregating the scores, but similarly that becomes more complex and hard to follow.

[nanochatGPT] should we compile tensordictmodule?

[nanoChatGPT] Configure logging

Scripts currently use print statements for logging, would be nice to use logging module so that logs can be redirected and configured more easily.

Don't persist iteration number in model checkpoints

Currently training the reward model starts at the iteration number of the transformer checkpoint, which is weird. We should just start counting iterations from 0 in the reward model training loop (assuming that the reward model is being trained from scratch, if loading a reward model from a checkpoint then we can count from the checkpointed iteration number of the reward model) regardless of how many iterations the transformer was trained for.

[nanoChatGPT] Clean up config

ATM we have lots of options that are either redundant or not used, we should streamline and update the comments accordingly.

Example: dataset choice should be deprecated, as should model choice.

Single config for entire pipeline?

[nanoChatGPT] Should we recompute attention_mask instead of storing?

It can be recovered by doing something like

attention_mask = (input_ids != pad_token_id).to(torch.int64)

so we could potentially reduce memory footprint in this way at the cost of some extra computation each iteration. We should benchmark and see what's best.

tcbegley / rl Goto Github PK

rl's Issues

[nanoChatGPT] weight tying embedding

[nanoChatGPT] Consolidate duplicated code

[nanoChatGPT]

[nanoChatGPT] swap ddp and tensordictmodule?

[nanoChatGPT] Check optimiser initialisation sequencing

[nanoChatGPT] Don't shuffle val set?

[nanoChatGPT] How to represent reward model

[nanochatGPT] should we compile tensordictmodule?

[nanoChatGPT] Configure logging

Don't persist iteration number in model checkpoints

[nanoChatGPT] Clean up config

[nanoChatGPT] Should we recompute attention_mask instead of storing?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent