The error message is as follows。 Traceback (most recent call last):<

When the onpolicy of PPO processes the continuous action space, an error occurs. about di-engine HOT 6 CLOSED

opendilab commented on August 22, 2024

When the onpolicy of PPO processes the continuous action space, an error occurs.

from di-engine.

Comments (6)

puyuan1996 commented on August 22, 2024

Hello, thanks for your question.

I replace the policy-related parameters of our hopper_onppo_config with the settings you gave and run in 3M env steps. Here is the naive result:

Although the performance is poor due to hyperparameters, the error you mentioned doesn't appear in this setting.

I guess that there may be abnormal data in the ppo_batch in your setting. It is possible that the occurrence of nan is due to abnormally large or abnormally small values in pi_new/pi_old in the update formula of ppo.

You can set up the code like below, then run cityflow_ppo_continuous_train.py in debug mode to save and analyze the ppo_batch when there is an error on this line.

try:
    ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio)
except Exception as error:
    print(error, ppo_batch)
    torch.save(ppo_batch, 'ppo_batch.pt')

After you save the ppo_batch.pt when the error occurs, you can check the ppo_batch carefully.

Thanks a lot.

from di-engine.

zyz109429 commented on August 22, 2024

First of all, thank you for your reply.
I did the debug as you suggested.
The caught exceptions are as follow。

The printed information is as follow。

ppo_data(logit_new={'mu': tensor([[nan],[nan],[nan]，...,[nan]], grad_fn=), 'sigma': tensor([[nan], [nan], [nan],...,[nan]], grad_fn=)}, logit_old={'mu': tensor([-0.6285, -0.5731, -0.6455, 0.8511, -0.6911, 0.6962, -0.4182, -0.5283,
-0.6198, 0.9046, -0.1926, -0.5818, 0.9112, -0.5774, 0.0645, 0.8799,
0.8856, -0.6692, -0.5422, 0.2381, 0.9047, 0.8866, -0.5679, 0.5986,
-0.3940, -0.6046, -0.5193, -0.6381, 0.9134, -0.3108, 0.9075, 0.9092,
-0.6426, -0.5828, 0.9051, -0.6339, -0.6349, -0.6433, -0.6034, -0.5630,
-0.5207, -0.6074, -0.6682, -0.6652, -0.7154, 0.7598, -0.1082, -0.0081,
-0.5290, 0.8938, 0.9072, -0.6852, 0.7760, -0.2126, 0.8408, -0.6748,
-0.6174, 0.6603, 0.8826, -0.6609, -0.3863, -0.4872, -0.6193, 0.8938]), 'sigma': tensor([1.4908, 1.3997, 1.5247, 0.3084, 1.5925, 0.4297, 1.2325, 1.3328, 1.4403,
0.2572, 0.9868, 1.3993, 0.2489, 1.3731, 0.7672, 0.2819, 0.2811, 1.5416,
1.3511, 0.6743, 0.2589, 0.2757, 1.3921, 0.4840, 1.1544, 1.4665, 1.3378,
1.4687, 0.2475, 1.0555, 0.2545, 0.2508, 1.5072, 1.3612, 0.2557, 1.4653,
1.4874, 1.4856, 1.4224, 1.3854, 1.2687, 1.4344, 1.5471, 1.5305, 1.6533,
0.3765, 0.8946, 0.8425, 1.3026, 0.2663, 0.2548, 1.5716, 0.3727, 0.9790,
0.3151, 1.5678, 1.4493, 0.4493, 0.2809, 1.5410, 1.1080, 1.2272, 1.4482,
0.2673])}, action=tensor([ 1.3408, 0.1602, -3.9134, 0.9856, -0.2720, 1.4314, -0.4599, -1.1067,
0.9704, 1.3853, -2.0676, -1.4295, 1.0197, 1.2636, -0.8234, 0.8861,
0.2818, -3.0889, 0.1927, -0.4374, 0.8012, 0.8854, -2.3425, -0.1133,
2.1956, -2.5130, -0.3467, -1.0016, 1.2977, 0.3969, 0.5102, 0.9858,
-2.2762, -3.7371, 0.8547, 0.5875, 1.0124, 1.4402, 1.2027, 0.0245,
-2.7108, -1.3852, -0.1114, 0.8502, -0.8981, 0.6419, 0.6401, -0.3337,
-2.1801, 1.1617, 0.7595, -2.1792, 0.2434, 0.3868, 0.2773, -0.6104,
1.0241, 0.6774, 0.3616, 0.9796, 0.3465, 0.0860, 0.4484, 0.5196]), value_new=tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
grad_fn=), value_old=tensor([-0.9685, -0.4812, -0.6246, -2.7388, -2.0899, -1.3862, -0.4165, -0.4572,
-1.7662, -2.7320, -0.8122, -0.5687, -2.7295, -1.9923, -2.6184, -2.8475,
-2.3100, -2.1565, -0.6518, -2.5187, -2.7433, -2.8006, -0.4143, -2.5838,
-1.4480, -0.4806, -0.3680, -2.8206, -2.7685, -2.2257, -2.7590, -2.5800,
-0.9456, -1.8938, -2.8282, -1.5848, -0.9681, -1.6505, -0.7479, -0.4279,
-2.8415, -0.8131, -1.6537, -2.5903, -2.4278, -2.6943, -2.6982, -1.3150,
-1.6957, -2.8198, -2.7288, -2.2993, -2.0520, -1.0157, -2.7231, -1.8825,
-0.7513, -2.7403, -2.8357, -1.6993, -1.9129, -1.6852, -2.2528, -2.6801]), adv=tensor([ 0.6305, 0.2643, 0.6371, 0.6466, 1.9546, -0.4648, -1.2747, -0.9353,
-1.0359, -0.5286, -1.3012, 0.3276, -0.2845, -0.3729, -1.1896, 0.9560,
2.1039, 0.2346, -1.3772, 1.6108, -0.5274, 1.1824, -0.1134, -0.4184,
-1.9192, -1.6429, -0.5152, 2.0295, -0.7275, 0.6804, -0.5729, -0.9296,
1.2292, -0.2330, 0.1495, -1.8284, -0.4702, -1.7449, 1.9015, 0.8298,
1.9774, -0.3908, 0.4063, 1.0299, -0.0164, 0.3812, -0.1739, -0.8045,
0.4500, -0.5867, 0.4400, 0.1079, 1.4858, -1.0783, 0.8624, -1.0333,
-0.1328, 0.1841, -0.0777, 0.5778, -0.2049, -0.6412, 0.4944, -0.2176]), return_=tensor([-0.6951, -0.3260, -0.3491, -2.4602, -1.3890, -1.4665, -0.7583, -0.6894,
-2.0309, -2.8329, -1.1625, -0.3931, -2.7516, -2.0430, -2.9327, -2.4690,
-1.5608, -2.0109, -1.0267, -1.9288, -2.8437, -2.3490, -0.3811, -2.6490,
-1.9979, -0.9413, -0.4646, -2.0955, -2.9336, -1.9362, -2.8741, -2.8103,
-0.4789, -1.8992, -2.7101, -2.1054, -1.0501, -2.1441, -0.0641, -0.0902,
-2.1332, -0.8695, -1.4527, -2.1879, -2.3633, -2.5014, -2.6846, -1.5049,
-1.4806, -2.9394, -2.5169, -2.1947, -1.5024, -1.2941, -2.3748, -2.1463,
-0.7243, -2.6110, -2.7910, -1.4429, -1.9093, -1.8224, -2.0234, -2.6806]), weight=None).

According to the data and error message, it is suggested that I modify the following total_loss. backward (retain _ graph = True).

Hope to get your further guidance, thank you very much!

from di-engine.

zyz109429 commented on August 22, 2024

First of all, thank you for your reply. I did the debug as you suggested. The caught exceptions are as follow。 The printed information is as follow。 ppo_data(logit_new={'mu': tensor([[nan],[nan],[nan]，...,[nan]], grad_fn=), 'sigma': tensor([[nan], [nan], [nan],...,[nan]], grad_fn=)}, logit_old={'mu': tensor([-0.6285, -0.5731, -0.6455, 0.8511, -0.6911, 0.6962, -0.4182, -0.5283, -0.6198, 0.9046, -0.1926, -0.5818, 0.9112, -0.5774, 0.0645, 0.8799, 0.8856, -0.6692, -0.5422, 0.2381, 0.9047, 0.8866, -0.5679, 0.5986, -0.3940, -0.6046, -0.5193, -0.6381, 0.9134, -0.3108, 0.9075, 0.9092, -0.6426, -0.5828, 0.9051, -0.6339, -0.6349, -0.6433, -0.6034, -0.5630, -0.5207, -0.6074, -0.6682, -0.6652, -0.7154, 0.7598, -0.1082, -0.0081, -0.5290, 0.8938, 0.9072, -0.6852, 0.7760, -0.2126, 0.8408, -0.6748, -0.6174, 0.6603, 0.8826, -0.6609, -0.3863, -0.4872, -0.6193, 0.8938]), 'sigma': tensor([1.4908, 1.3997, 1.5247, 0.3084, 1.5925, 0.4297, 1.2325, 1.3328, 1.4403, 0.2572, 0.9868, 1.3993, 0.2489, 1.3731, 0.7672, 0.2819, 0.2811, 1.5416, 1.3511, 0.6743, 0.2589, 0.2757, 1.3921, 0.4840, 1.1544, 1.4665, 1.3378, 1.4687, 0.2475, 1.0555, 0.2545, 0.2508, 1.5072, 1.3612, 0.2557, 1.4653, 1.4874, 1.4856, 1.4224, 1.3854, 1.2687, 1.4344, 1.5471, 1.5305, 1.6533, 0.3765, 0.8946, 0.8425, 1.3026, 0.2663, 0.2548, 1.5716, 0.3727, 0.9790, 0.3151, 1.5678, 1.4493, 0.4493, 0.2809, 1.5410, 1.1080, 1.2272, 1.4482, 0.2673])}, action=tensor([ 1.3408, 0.1602, -3.9134, 0.9856, -0.2720, 1.4314, -0.4599, -1.1067, 0.9704, 1.3853, -2.0676, -1.4295, 1.0197, 1.2636, -0.8234, 0.8861, 0.2818, -3.0889, 0.1927, -0.4374, 0.8012, 0.8854, -2.3425, -0.1133, 2.1956, -2.5130, -0.3467, -1.0016, 1.2977, 0.3969, 0.5102, 0.9858, -2.2762, -3.7371, 0.8547, 0.5875, 1.0124, 1.4402, 1.2027, 0.0245, -2.7108, -1.3852, -0.1114, 0.8502, -0.8981, 0.6419, 0.6401, -0.3337, -2.1801, 1.1617, 0.7595, -2.1792, 0.2434, 0.3868, 0.2773, -0.6104, 1.0241, 0.6774, 0.3616, 0.9796, 0.3465, 0.0860, 0.4484, 0.5196]), value_new=tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], grad_fn=), value_old=tensor([-0.9685, -0.4812, -0.6246, -2.7388, -2.0899, -1.3862, -0.4165, -0.4572, -1.7662, -2.7320, -0.8122, -0.5687, -2.7295, -1.9923, -2.6184, -2.8475, -2.3100, -2.1565, -0.6518, -2.5187, -2.7433, -2.8006, -0.4143, -2.5838, -1.4480, -0.4806, -0.3680, -2.8206, -2.7685, -2.2257, -2.7590, -2.5800, -0.9456, -1.8938, -2.8282, -1.5848, -0.9681, -1.6505, -0.7479, -0.4279, -2.8415, -0.8131, -1.6537, -2.5903, -2.4278, -2.6943, -2.6982, -1.3150, -1.6957, -2.8198, -2.7288, -2.2993, -2.0520, -1.0157, -2.7231, -1.8825, -0.7513, -2.7403, -2.8357, -1.6993, -1.9129, -1.6852, -2.2528, -2.6801]), adv=tensor([ 0.6305, 0.2643, 0.6371, 0.6466, 1.9546, -0.4648, -1.2747, -0.9353, -1.0359, -0.5286, -1.3012, 0.3276, -0.2845, -0.3729, -1.1896, 0.9560, 2.1039, 0.2346, -1.3772, 1.6108, -0.5274, 1.1824, -0.1134, -0.4184, -1.9192, -1.6429, -0.5152, 2.0295, -0.7275, 0.6804, -0.5729, -0.9296, 1.2292, -0.2330, 0.1495, -1.8284, -0.4702, -1.7449, 1.9015, 0.8298, 1.9774, -0.3908, 0.4063, 1.0299, -0.0164, 0.3812, -0.1739, -0.8045, 0.4500, -0.5867, 0.4400, 0.1079, 1.4858, -1.0783, 0.8624, -1.0333, -0.1328, 0.1841, -0.0777, 0.5778, -0.2049, -0.6412, 0.4944, -0.2176]), return_=tensor([-0.6951, -0.3260, -0.3491, -2.4602, -1.3890, -1.4665, -0.7583, -0.6894, -2.0309, -2.8329, -1.1625, -0.3931, -2.7516, -2.0430, -2.9327, -2.4690, -1.5608, -2.0109, -1.0267, -1.9288, -2.8437, -2.3490, -0.3811, -2.6490, -1.9979, -0.9413, -0.4646, -2.0955, -2.9336, -1.9362, -2.8741, -2.8103, -0.4789, -1.8992, -2.7101, -2.1054, -1.0501, -2.1441, -0.0641, -0.0902, -2.1332, -0.8695, -1.4527, -2.1879, -2.3633, -2.5014, -2.6846, -1.5049, -1.4806, -2.9394, -2.5169, -2.1947, -1.5024, -1.2941, -2.3748, -2.1463, -0.7243, -2.6110, -2.7910, -1.4429, -1.9093, -1.8224, -2.0234, -2.6806]), weight=None). According to the data and error message, it is suggested that I modify the following total_loss. backward (retain_graph=True). Hope to get your further guidance, thank you very much!

…

------------------ 原始邮件 ------------------ 发件人: "opendilab/DI-engine" ***@***.***>; 发送时间: 2022年6月15日(星期三) 中午12:43 ***@***.***>; ***@***.******@***.***>; 主题: Re: [opendilab/DI-engine] When the onpolicy of PPO processes the continuous action space, an error occurs. (Issue #357) Hello, thanks for your question. I replace the policy-related parameters of our hopper_onppo_config with the settings you gave and run in 3M env steps. The error you mentioned does not appear. I guess that there may be abnormal data in the observation of your cityflow environment. You can set up the code like below, then run cityflow_ppo_continuous_train.py in debug mode to analyze the input data when there is an error on this line. try: ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio) except Exception as error: print(error, ppo_batch) Thanks a lot. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

from di-engine.

puyuan1996 commented on August 22, 2024

Hello,

Could you save the model self._learn_model and input batch data when the error first occurs after this line

 output = self._learn_model.forward(batch['obs'], mode='compute_actor_critic')

something like this:

try:
    ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio)
except Exception as error:
    torch.save(ppo_batch, 'ppo_batch.pt')
    torch.save(batch, 'input_batch.pt')   # to save input batch
    torch.save(self._state_dict_learn(), 'ckpt.pt')  # to save model

Then we can analyze whether the model parameters or the input data have nan values.
If the input has nan value, then we can check the environment.
If the parameters of the model have nan values, error may be due to abnormal gradient.
If neither has nan, we can load the model and pass in the input batch again to verify if the output shows nan values.

Thanks.

from di-engine.

zyz109429 commented on August 22, 2024

According to the method you provided, I found after trying: input_batch has no null value.
In PPO_batch, logit_new is all Nan, logit_old is not Nan, and then the parameters of model are all Nan.
so error may be due to abnormal gradient.

I've tried other activation functions and gradient clipping, and it didn't work. In addition, I have also tried DDPG and SAC algorithms, and the reward is normally gradually convergent. But during the training process, the reward of PPO algorithm has been kept at the initial value attachment, and it seems that there is no learning.

Hope to get your further guidance again,Thanks.

from di-engine.

puyuan1996 commented on August 22, 2024

Hello,

Have you normalized the input given by this cityflow environment? What is its current maximum and minimum value of obs and reward?

In order to reproduce your error on my side, can you provide the complete main function file cityflow_ppo_continuous_train.py ?

To confirm that the other parts are the same other than the env, have you made any changes to the original PPO algorithm to adapt to your environment? Or you just specified the relevant hyperparameters in this file，the relevant code of ppo are the original DI-engine version of the latest branch?

Does the error occur at a fixed number of iterations? If yes, how many iterations is it?

Thanks a lot.

from di-engine.

When the onpolicy of PPO processes the continuous action space, an error occurs. about di-engine HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent