Coder Social home page Coder Social logo

Comments (3)

andreped avatar andreped commented on May 26, 2024

I've added a temporary fix for this, which essentially catches when this happens, and restarts training from the previous state, keeping all model history and whatnot.

Need a proper fix for this in sapai/sapai-gym.

from super-ml-pets.

andreped avatar andreped commented on May 26, 2024

As I assumed all errors were coming from sapai-gym, I added a fix to catch all errors happening there:
andreped/sapai-gym@7443f36

However, to my surprise, when running a regular training (now without the try/except loop in the main training script train_agent.py, I got an error from within sb3. This is more challenging to solve. Not really sure what is causing it. See error prompt below after about 250k steps:

Traceback (most recent call last):
  File ".\main.py", line 28, in <module>
    train_with_masks(ret)
  File "C:\Users\andrp\workspace\super-ml-pets\src\train_agent.py", line 60, in train_with_masks
    model.learn(total_timesteps=ret.nb_steps, callback=checkpoint_callback)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 579, in learn
    self.train()
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 439, in train
    values, log_prob, entropy = self.policy.evaluate_actions(
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\policies.py", line 280, in evaluate_actions
    distribution.apply_masking(action_masks)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 152, in apply_masking
    self.distribution.apply_masking(masks)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 62, in apply_masking
    super().__init__(logits=logits)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\torch\distributions\categorical.py", line 64, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\torch\distributions\distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter probs (Tensor of shape (64, 213)) of distribution MaskableCategorical(probs: torch.Size([64, 213]), logits: torch.Size([64, 213])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[4.9590e-11, 2.1976e-10, 6.1887e-01,  ..., 3.3524e-13, 4.5890e-12,
         5.3164e-14],
        [1.4266e-06, 8.7648e-10, 1.3233e-06,  ..., 1.5695e-07, 2.9451e-08,
         1.5212e-07],
        [2.2623e-06, 2.3994e-09, 5.3787e-07,  ..., 3.9735e-08, 2.8777e-09,
         2.6170e-08],
        ...,
        [1.6828e-12, 4.9032e-04, 9.5983e-13,  ..., 1.7402e-13, 1.9223e-13,
         5.6725e-14],
        [4.7819e-10, 7.7589e-03, 7.8509e-18,  ..., 6.4911e-11, 8.8994e-12,
         8.3013e-11],
        [3.6789e-08, 1.2760e-07, 4.7924e-16,  ..., 8.6682e-09, 8.6489e-10,
         3.7913e-08]], grad_fn=<SoftmaxBackward0>)

from super-ml-pets.

andreped avatar andreped commented on May 26, 2024

Random Exception seem to happen after training thousands of steps:

Exception: get_idx < pet-hedgehog 10-1 status-honey-bee 2-1 > not found

What is causing this?

from super-ml-pets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.