philtabor / youtube-code-repository Goto Github PK

View Code? Open in Web Editor NEW

854.0 20.0 479.0 43.1 MB

Repository for most of the code from my YouTube channel

Python 51.15% Jupyter Notebook 48.85%

reinforcement-learning monte-carlo-methods qlearning-algorithm convolutional-neural-networks sarsa

youtube-code-repository's People

Stargazers

Watchers

Forkers

emedd33 christiansanroman cmoganedi anassbouirdi joesoc joseska pascalhauri ihssanebouasria tripleaceme shengjunzhang allensmile rafaelmri ravising-h mac-kim leonyuang3nt azzmusam arceen seb1234 mudassir1234 kebleinsyt petteripulkkinen alam52 grahambojangles joseph-chan raulvigo qarchli pancumtneu jidenna pat1952 davinca 4mitch skycreeper2000 pradeepprasad notsoup danisbohan moma44 drl2019 officialdomasch mateokutnjak blackmamba-xuan therealmukul sahil0 razajhandir obaf niharikamessi quyuanluo tidecoder rittee steve-nanou shaunbarney jlunder00 timkoning17 drbennylu pythonlessons haozhougt mehranraisi gavin-collab bwantan mohsen-azimi raghu2 murali037 t3ch9 osamaawanpk fireddd algoskynet nguyendo24 derekgloudemans 5l1v3r1 sladrond kmelendez7 laranea zahidp selmanakinci aakashv6 deepaliverma dharani-dj zhjmcjk shawkyhanna waseemraza844 yueyin-io eswarraop xiaoxiang-ma zhang-qingang know-nothing8 nikhildesa apenam7 james-c-hunt aivanni fusion-research anupiyan rishiraj rtharungowda myxyzy garymihalik1 mengsicode cyyeung1234 wissalgdr elephann ayers1013 ibrahim-elsawy

youtube-code-repository's Issues

lunar_lander.py variable named incorrectly

In lunar_lander.py, line 55:
x = [i+1 for i in range(numGames)]

numGames is not defined and should probably read n_games

D3QN for Multiple Action Selection

Hello,
I reuse the dueling double DQN for a problem in which multiple actions should be selected, i.e., the output of the choose_action function is an action vector selected from an interval of zero and n_actions.
How can I change the codes to get non-repetitive actions (i.e., an action vector with different actions).
My code is accessible at https://github.com/Nazanin-87/D3QN_TF.git
I appreciate any help in solving this issue.

Is there a way to swap the actor models out for boosting libraries?

I'm interested in switching out the default keras NNs for catboost or other non NN models. Any guidance on how to accomplish it highly appreciated.

Outdated argument critic network DDPG

Hi,
First of all, thank you for the code. It seems the n_actions argument has been removed in a previous update (ba997f7). I believe, it should be removed here as well to make the code work again.

Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/DDPG/tensorflow2/pendulum/ddpg_tf2.py

Line 22 in 6a917ce

self.critic = CriticNetwork(n_actions=n_actions, name='critic')

Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/DDPG/tensorflow2/pendulum/ddpg_tf2.py

Line 24 in 6a917ce

self.target_critic = CriticNetwork(n_actions=n_actions, name='target_critic')

tensorflow version

@philtabor , Thanks a lot for the videos.

Can you please let me know the tensorflow versions you are using? I run into versioning issues.
kernel_initializer=tf.variance_scaling_initializer(scale=2))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
TypeError: conv2d() got an unexpected keyword argument 'input'

Using observation space dimension size for the Actor and Critic models.

Hi, @philtabor !

Thanks for the awesome code and video!
Those really help me to study and understand reinforcement learning.

I found that the actor and critic model are not using observation space dim as their input_dim.
Shouldn't it be the same dim size as observation space?

What do you think?

And I made a pull request #39 about above.
I really appreciate your review!

Little issue in ddqn_keras.py

Hi Phil, there is a small issue in the method update_network_parameters(self), I think that should be:
self.q_target.set_weights(self.q_eval.get_weights())

instead of:
self.q_target.model.set_weights(self.q_eval.model.get_weights())

Thanks for your code and your videos! These are really useful for me.
Jose.

API has changed, `state_steps` argument must contain a list of singleton tensors

I got this issue when I run the a3c.py, I also found someone has the same error as me on this link facebookresearch/fairscale#1083
I thought this because of the pytorch version, could you give me the pytorch version you used for this code?
Btw, does anyone meet the same thing, how can you guys fix this? Thanks a lot.

Purpose of Passing New Frame into State Memory with Previous Action

Hi Phil, huge fan of your work.. I have two questionsn regarding policy gradients TensorFlow for SpaceInvaders:

1.In the reinforce_cnn_tf.py and in the choose_action function there is a line:

probabilities = self.sess.run(self.actions, feed_dict={self.input: observation})[0]

Here 0 specifies that the action probability distribution is the first of the 4 probability distributions, if this is the case then your actions are taken based on the first frame or the 0th observation of the stacked_frames. Is that right?

Assuming my first assumption is right. There is a line in the main_tf_reinforce_space_invaders.py file:

observation, reward, done, info = env.step(action)
observation = preprocess(observation)
stacked_frames = stack_frames(stacked_frames, observation, stack_size)
agent.store_transition(observation, action, reward) (this one)

Here the new observation is getting stored with action taken based on the 0th observation in the stacked_frame, If this is the case why does this work while training the agent? Are the probability distributions when the observations are fed in different from the labels?

https://www.youtube.com/watch?v=LawaN3BdI00

source: https://www.youtube.com/watch?v=LawaN3BdI00
code source: https://github.com/philtabor/Youtube-Code-Repository/tree/master/ReinforcementLearning/PolicyGradient/actor_critic/tensorflow2

main.py", line 35, in
agent.learn(observation, action, reward, observation_, done)
TypeError: learn() takes 5 positional arguments but 6 were given

noise size equal to number of actions

Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/TD3/td3_torch.py

Line 163 in 733e452

mu_prime = mu + T.tensor(np.random.normal(scale=self.noise),

Instead of a scalar noise, should be a vector of number of actions size
mu_prime = mu + T.tensor(np.random.normal(scale=self.noise,size=(self.n_actions,)),

reinforce_keras.py model not learning

Hi, tried your code as an example (reinforce_keras.py ) on "pong-v0" and model is not learning, I think something is wrong in code

agent.learn function action is given but not in the function

Traceback (most recent call last):
File "main.py", line 42, in
agent.learn(observation, action, reward, observation_, done)
TypeError: learn() takes 5 positional arguments but 6 were given

change:
def learn(self, state, reward, state_, done):

def learn(self, state, action, reward, state_, done):

in actor critic

DQN

File "/home/../aichess/main.py", line 13, in
agent = Agent(
File "/home/../aichess/engines/dqn.py", line 114, in init
self.q_eval = DeepQNetwork(self.lr, self.n_actions,
TypeError: DeepQNetwork.init() got multiple values for argument 'input_dims'

Fix to solution: TypeErrorr in actor_critic tensorflow2 main.py

In the following document:
Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/actor_critic/tensorflow2/main.py

Running the program returns:
File "actor_critic.py", line 157, in
agent.learn(observation, action, reward, observation_, done)
TypeError: learn() takes 5 positional arguments but 6 were given

I think this could be fixed by changing the following line:
agent.learn(observation, action, reward, observation_, done)

to:
agent.learn(observation, reward, observation_, done)

Sorry for the suggestion format, first time suggesting fixes :D

Issue with critic target in PPO

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

[Question] Replay memory

Youtube-Code-Repository/ReinforcementLearning/DeepQLearning/dqn_tf.py

Line 95 in 7c0ca4f

def store_transition(self, state, action, reward, state_, terminal):

At the Tensorflow Deep QLearning 'dqn_tf.py' file you are using a replay memory.
I have a question about using it.

Then I understand it right, the storage is never cleared?
So, the function store_transition saves the states over epochs?
And in Epoch 2 it is possible that a state from Epoch 1 is trained.
So it gets never cleared as long as the program is running.

torch multiprocessing library

RuntimeError: API has changed, state_steps argument must contain a list of singleton tensors

[QUESTION] build_dqn function in simple_dqn_tf2.py

why the parameter input_dim add if tit is not used inside the function?where is the input shape layer?

SAC custom env

I get this error:

ValueError Traceback (most recent call last)
in
26 score = 0
27 while not done:
---> 28 action = agent.choose_action(observation)
29 observation_, reward, done, info = env.step(action)
30 score += reward

in choose_action(self, observation)
23 def choose_action(self, observation):
24 state = T.Tensor([observation]).to(self.actor.device)
---> 25 actions, _ = self.actor.sample_normal(state, reparameterize=False)
26
27 return actions.cpu().detach().numpy()[0]

in sample_normal(self, state, reparameterize)
38 def sample_normal(self, state, reparameterize=True):
39 mu, sigma = self.forward(state)
---> 40 probabilities = Normal(mu, sigma)
41
42 if reparameterize:

~\Anaconda3\lib\site-packages\torch\distributions\normal.py in init(self, loc, scale, validate_args)
48 else:
49 batch_shape = self.loc.size()
---> 50 super(Normal, self).init(batch_shape, validate_args=validate_args)
51
52 def expand(self, batch_shape, _instance=None):

~\Anaconda3\lib\site-packages\torch\distributions\distribution.py in init(self, batch_shape, event_shape, validate_args)
54 if not valid.all():
55 raise ValueError(
---> 56 f"Expected parameter {param} "
57 f"({type(value).name} of shape {tuple(value.shape)}) "
58 f"of distribution {repr(self)} "

ValueError: Expected parameter loc (Tensor of shape (1, 1, 1)) of distribution Normal(loc: tensor([[[nan]]], device='cuda:0', grad_fn=), scale: tensor([[[nan]]], device='cuda:0', grad_fn=)) to satisfy the constraint Real(), but found invalid values:
tensor([[[nan]]], device='cuda:0', grad_fn=)

Any idea how to fix this?

PPO pytorch implementation question

Hi,
Thank you so much for this guide - it is extremely clear and easy to follow! This isn't a bug, but there are a few questions I have. The first:
Why are the 'values' tensor sent to the actor device (referring to line):

values = T.tensor(values).to(self.actor.device)

The values tensor is not uniquely used by the actor parameters but by both the actor and critic as it is used for the MSE for critic and then added to actor loss for the total loss

and second:
why did you not include KL divergence in your implementation, was there a specific reason?

thank you so much again!

Custom_loss Giving cost as 0 in reinforce_keras

Despite the output of log likelihood being correct, I am getting loss as 0. I am using tensorflow 2.x.

td3

File "main.py", line 11, in <module> n_actions=env.action_space.shape[0]) IndexError: tuple index out of range

loading issue in policy gradient using keras

Due to the custom_loss it is having error of expected array when we load the trained model.
Please anyone can help revert back asap.

Pendulum TF2 maybe a bug found

Hey.

I've been laucnhing your Pendulum TF2 project, and: it only launched after I've changed lines 23 and 25 of ddpg_tf2.py from
self.critic = CriticNetwork(n_actions=n_actions, name='critic')
self.target_critic = CriticNetwork(n_actions=n_actions, name='target_critic')
to
self.critic = CriticNetwork(name='critic')
self.target_critic = CriticNetwork(name='target_critic')
I believe thats a bug?

Epsilon = 0 on main_keras_dqn_lunar_lander.py

Hi Phil,
In Agent constructor the epsilon=0. Wouldn't be epsilon = 1?

Thanks
Renato

Do some of the variables require gradient?

Thanks for sharing your code! There is a problem that bothers me.
For example, in pytorch SAC code, whether we need to use with torch.no_grad() or detach() when computing value_target and q_hat.
value_target = critic_value - log_probs # line 96
q_hat = self.scale*reward + self.gamma*value_ # line 116

I think they need to stop gradient computation.

value_target = value_target.detach()

q_hat = q_hat .detach()

The author of td3 algorithm uses with torch.no_grad() to compute target_Q.
https://github.com/sfujim/TD3/blob/master/TD3.py
110 with torch.no_grad():
And the author of sac algorithm uses tf.stop_gradient() to compute value_target and q_hat.
https://github.com/haarnoja/sac/blob/master/sac/algos/sac.py
256 ys = tf.stop_gradient( self.scale_reward * self._rewards_ph + (1 - self._terminals_ph) * self._discount * vf_next_target_t ) # N
330 self._vf_t - tf.stop_gradient(min_log_target - log_pi + policy_prior_log_probs) )**2)

Could you give me some suggestions about this problem, please?

ValueError: The parameter loc has invalid values

I've downloaded your code and made the following small changes:
-removed all loading/checkpointing/saving functions/calls
-switched the gym environment to env = gym.make("InvertedPendulum-v2")

After some training (variable amount of time before error occurs) I get the following bug:
File "C:\Users\john\Desktop\project\Clone\sac_torch.py", line 32, in choose_action actions, _ = self.actor.sample_normal(state, reparameterize=False) File "C:\Users\john\Desktop\project\Clone\networks.py", line 105, in sample_normal probabilities = Normal(mu, sigma) File "C:\Users\john\anaconda3\lib\site-packages\torch\distributions\normal.py", line 50, in __init__ super(Normal, self).__init__(batch_shape, validate_args=validate_args) File "C:\Users\john\anaconda3\lib\site-packages\torch\distributions\distribution.py", line 53, in __init__ raise ValueError("The parameter {} has invalid values".format(param)) ValueError: The parameter loc has invalid values

I print out the mu and sigma and see that immediately before the error they have become equal to nan:
tensor([[nan]], device='cuda:0', grad_fn=<AddmmBackward>) tensor([[nan]], device='cuda:0', grad_fn=<ClampBackward1>)
(This appears to be occurring during a forward pass, not buffer sampling, due to the tensor being 1 dimensional)

Thanks again for the quick reply in your video!

Critic loss calculation

Hi @philtabor

Thank you for the great tutorials ..I am following "ppo_torch.py"(code link) this code and it's tutorial and one thing that I did not understand why we are not storing the next_state and using that to generate a critic value , which will be then used for calculating the loss function? don't we need the next state in ppo, am I missing something?

main_keras_dqn_lunar_lander first env.reset() array plus empty dict

On the first env.reset() call a tuple is returned of the array and a empty dict this empty dict screws up the rest of the code.
is that a new addition off the gym library or a code Problem?

(array([ 0.00469818, 1.3994393 , 0.47585568, -0.5102643 , -0.0054372 , -0.10778829, 0. , 0. ], dtype=float32), {})

does not start on latest python

Hello, I'm trying to start your example lunar_lander.py but it is not started. I have follwoing error
Traceback (most recent call last):
File "D:/ai/DeepQLearning/lunar_lander.py", line 9, in
env = gym.make('LunarLander-v2')
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 156, in make
return registry.make(id, **kwargs)
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 101, in make
env = spec.make(**kwargs)
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 72, in make
cls = load(self._entry_point)
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 18, in load
fn = getattr(mod, attr_name)
AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander'

A2C with experience replay

Hello @philtabor ,

When you attempt to use experience replay in actor critic setting, to me it looks that only critic part is trained (gradients propagated), but the actor part that comes from stored log_probs in numpy array cannot back propagate gradients. However, imho the actual problem is more general, since policy is something that supposed to be evolving it does not make sense to store results of older worse policy. log_probs need to be recomputed in learning function the same way as outputs of critic network.

Issue

(array([-0.02680779, 0.00466264, -0.02511859, -0.04842809], dtype=float32), {})
Traceback (most recent call last):
File "main.py", line 31, in
action, prob, val = agent.choose_action(observation)
File "D:\AI\PPO\agent.py", line 41, in choose_action
state = tf.convert_to_tensor([observation],dtype=tf.float32)
File "C:\Users\Buster.conda\envs\PPO\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\Buster.conda\envs\PPO\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Can't convert non-rectangular Python sequence to Tensor.

ImportError: cannot import name 'Agent' from 'dqn_tf'

Unable to resolve this issue as googling the issue is also not helping. what syntax to use to run this.

Error when I changed dueling_ddqn_torch.py to get multiple dicrete actions

I want to implement a dueling double DQN algorithm for selecting multiple discrete actions. Since the existing dueling_ddqn_torch.py code is for choosing a single action, I should modify it. But when I changed the choose_action function of Agent to get multiple actions, I got the following error:
IndexError: tensors used as indices must be long, byte or bool tensors
The complement explanation of the error is
Traceback (most recent call last): File "C:/Users/Desktop/D3QN.py", line 339, in <module> agent.learn() File "C:/Users/Desktop/D3QN.py", line 157, in learn q_pred = T.add(Vs, (As - As.mean(dim=1, keepdim=True)))[indices, actions] IndexError: tensors used as indices must be long, byte or bool tensors
The whole code is attached.
dueling_ddqn_torch.zip
I would be grateful if anyone helps me to solve this error.

issue in keras actor critic code

Example crashes on drawing plot. There are not enough parameters

ActorNetwork - sample_normal method log_probs issue

In the following line, the code can break if the value of 'self.max_action' is high enough that 'action' could have a high value, making the value within the logarithm negative. Negative values of logarithms return NaN.

log_probs -= T.log(1-action.pow(2)+self.reparam_noise)

Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/SAC/networks.py

Line 130 in a600647

log_probs -= T.log(1-action.pow(2)+self.reparam_noise)

dqn_keras.py choose_action function should have q_next and not q_eval

I think that, since it is DQN, you have a target network for the predictions at every step and you .fit the q_eval model for the learning, also at every step. This is supposed to help with stability.

Youtube-Code-Repository/ReinforcementLearning/DeepQLearning/dqn_keras.py

Line 82 in d70e8cf

def choose_action(self, observation):

OSError: Unable to create file (unable to open file: name = 'home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/tmp/ddpg/actor_ddpg.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

I am getting this error while training. Any idea on how to solve this?
... saving models ...
Traceback (most recent call last):
File "/home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/main_ddpg.py", line 63, in
agent.save_models()
File "/home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/ddpg_tf2.py", line 56, in save_models
self.actor.save_weights(self.actor.checkpoint_file)
File "/home/ak/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 2217, in save_weights
with h5py.File(filepath, 'w') as f:
File "/home/ak/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 424, in init
fid = make_fid(name, mode, userblock_size,
File "/home/ak/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 196, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 116, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = 'home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/tmp/ddpg/actor_ddpg.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

improve dueling_dqn_keras loop

Before hand, Thanks you for all the knowledge that you have shared, if it was not for your videos I could not understand Q Learning.

Youtube-Code-Repository/ReinforcementLearning/DeepQLearning/dueling_dqn_keras.py

Line 120 in 3fd7b02

# improve on my solution!

I tested with my code and apparently they do the same

q_next = np.squeeze(q_next)
q_next[dones] = 0.0
q_tmporal = rewards + self.gamma*q_next
q_target[np.arange(self.batch_size),actions] = q_tmporal

magicSquares to self.magicSquares

magicSquares[resultingState] should be changed to self.magicSquares[resultingState]

Youtube-Code-Repository/ReinforcementLearning/Fundamentals/gridworld.py

Line 63 in 33b1e4b

resultingState = magicSquares[resultingState]

simple_dqn_tf2.py Doesn't allow for multiple return actions

If you try to change the n_actions parameter then when the model trys to learn it will fail

164/164 [==============================] - 0s 998us/step
164/164 [==============================] - 0s 887us/step
[[[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]

 ...

 [[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]] [   0    1    2 ... 5245 5246 5247] [list([2, 2, 5]) list([2, 1, 6]) list([3, 0, 6]) ... list([3, 0, 7])
 list([3, 8, 5]) list([3, 0, 3])]
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    agent.learn()
  File "simple_dqn_tf2.py", line 95, in learn
    self.gamma * np.max(q_next, axis=1)*dones
ValueError: operands could not be broadcast together with shapes (5248,82) (5248,)

This definitely has to do with the shape of the stored action. I'm just not sure how to fix it.

5248 = n_actions * batch_size
82 = n_actions

Pytorch - question.

@philtabor,
This is an intriguing implementation of PPO2. It is simple and it converges for cartpole quicker than any other I have seen. Taking a basic definition of "convergence" as 10 episodes in a row at total reward=max reward (200), this converges in ~230 episodes.

I tested a version with all pytorch functions converted to identical tensorflow 2.3 functions, and adding two gradient tapes to the .learn() function. It doesn't converge nearly as well. Do you have any idea why? is it a characteristic of pytorch that makes this implementation so successful?

Apologies if I have posted this twice, I am new to github.

Malfunctioning in simple_dqn_torch.py

Hello.

I had some problems regards DeepQNetwork implementation using Pytorch.

I ran the code showed in your youtube video. I've got this error code:

~/PROJECTS/PYTORCH_TUTORIAL/main_DQN_file.py in <module>
     35             brain.store_transition(observation, action, reward, observation_, done)
     36 
---> 37             brain.learn()
     38             observation = observation_
     39         scores.append(score)

~/PROJECTS/PYTORCH_TUTORIAL/simple_DQN.py in learn(self)
    123             print("Q_Target slice: ",q_target[batch_index,actions_random])
    124             q_target[batch_index, action_indices] = reward_batch + \
--> 125                 self.gamma*T.max(q_next,dim=1)[0]*terminal_batch
    126 
    127             self.epsilon = self.epsilon*self.eps_dec if self.epsilon > \

IndexError: The shape of the mask [64] at index 0 does not match the shape of the indexed tensor [64, 4] at index 1

This error shows if the action indices are calculated using dot operator.

When I use np.argmax function whole network works properly.

Have you encountered this type of problem?

Model never learns the game

Hi. I was following your youtube tutorial on Actor-critic method in continious space (lunar lander). However, despite having same code, my model almost never score higher than zero, nevermind reaching anywhere near 200, even after significant amount of episodes. Code is following :
https://github.com/6opoDuJIo/RL_Playground/blob/master/lunar_lander.py
And part of the log file is :

episode 6265 score -92.32 average score -178.79
episode 6266 score -17.88 average score -177.76
episode 6267 score -119.38 average score -176.25
episode 6268 score -104.23 average score -173.38
episode 6269 score -83.28 average score -172.56
episode 6270 score -146.12 average score -172.68
episode 6271 score -126.20 average score -173.07
episode 6272 score -226.61 average score -172.13
episode 6273 score -245.62 average score -173.37
episode 6274 score -105.59 average score -171.55
episode 6275 score -141.94 average score -173.44
episode 6276 score -301.27 average score -175.40
episode 6277 score -82.96 average score -175.56
episode 6278 score -134.57 average score -175.98
episode 6279 score -51.25 average score -174.88
episode 6280 score -81.76 average score -174.06
episode 6281 score -227.78 average score -173.70
episode 6282 score -386.15 average score -175.98
episode 6283 score -297.21 average score -177.16
episode 6284 score -422.21 average score -180.37
episode 6285 score -140.92 average score -180.87
episode 6286 score -236.97 average score -180.38
episode 6287 score -119.24 average score -179.11
episode 6288 score -76.24 average score -179.02
episode 6289 score -85.39 average score -176.80
episode 6290 score -131.07 average score -178.32
episode 6291 score -110.64 average score -179.56
episode 6292 score -150.60 average score -179.94
episode 6293 score -68.53 average score -179.51
episode 6294 score -184.71 average score -179.02
episode 6295 score -263.88 average score -180.30
episode 6296 score -287.41 average score -182.61
episode 6297 score -98.54 average score -181.06
episode 6298 score -82.03 average score -180.98
episode 6299 score -284.01 average score -181.44
episode 6300 score -88.97 average score -180.43
episode 6301 score -102.73 average score -178.80
episode 6302 score -179.52 average score -180.30
episode 6303 score -222.17 average score -181.51
episode 6304 score -246.87 average score -182.72
episode 6305 score -331.83 average score -184.92
episode 6306 score -361.46 average score -187.54
episode 6307 score -89.69 average score -187.46
episode 6308 score -27.86 average score -187.13
episode 6309 score -135.48 average score -184.37
episode 6310 score -115.25 average score -182.98

Error in Saving models

Hi I was trying to save model (lunar lander youtube tutorial) but I'm not able to I tried adding agent.save_model() in file main_tf2_dqn_lunar_lander.py but then it gives an error as below :

Weights for model sequential have not yet been created. Weights are created when the Model is first called on inputs or build() is called with an input_shape.

Error in Store_transition in pytorch dqns

In "main_torch_dqn_lunar_lander_2020.py" file

--> self.state_memory[index] = state
It says
"ValueError: setting an array element with a sequence. The requested array would exceed the maximum number of dimension of 1"

When i alter few things to get rid off this error i am getting into another error
,could you help me out

Does not start your python code.

Hello, I'm trying to start your Lunar_lander code, 'main_torch_dqn_lundar_lander.py' in 'archive' folder, but it is not started. The following is the error. Thank you.
File "C:\Users\ys-th\Desktop\Spring2022\main_torch_dqn_lunar_lander.py", line 15, in
env = gym.make('LunarLander-v2')

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 676, in make
return registry.make(id, **kwargs)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 490, in make
versions = self.env_specs.versions(namespace, name)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 220, in versions
self._assert_name_exists(namespace, name)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 271, in _assert_name_exists
self._assert_namespace_exists(namespace)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 268, in _assert_namespace_exists
raise error.NamespaceNotFound(message)

NamespaceNotFound: Namespace None does not exist.

Personal Question

First of all thanks for the useful videos!
Second, I had a personal doubt and I am only posting here as I have been trying to fix it for the last couple of days. My apologies is this isn't acceptable.

I followed the DQN pytorch 2020 tutorial which has LunarLander as the environment. I tried running it for cartpole as well but I am getting an error. I'll post attach a picture of the same. It works for a few iterations then fails. I have changed the action space and the input dimensions for the same.
Thanking you.

@philtabor

[TensorFlow2] Critic Loss Calculation for actor_critic

If I understand correctly, the code in tensorflow2/actor_critic.py implements the One-step Actor-Critic (episodic) algorithm given on page 332 of RLbook2020 by Sutton/barto (picture given below).

Here we can see that the critic parameters w are updated only using the gradient of the value function for the current state S
which is represented as grad(V(S, w)) in the pseudocode shown above. The update skips the gradient of the value function for the next state S'. This can again be seen in the pseudocode above, there is no grad(V(S', w)) present in the update rule for critic parameters w.

In the code given below, including state_value_, _ = self.actor_critic(state_) (L43) inside the GradientTape would result in grad(V(S', w)) appearing in the update for w, which contradicts the pseudocode shown above.

Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/actor_critic/tensorflow2/actor_critic.py

Lines 40 to 45 in 1ef7605

    
           reward = tf.convert_to_tensor(reward, dtype=tf.float32) # not fed to NN 
        
           with tf.GradientTape(persistent=True) as tape: 
        
               state_value, probs = self.actor_critic(state) 
        
               state_value_, _ = self.actor_critic(state_) 
        
               state_value = tf.squeeze(state_value) 
        
               state_value_ = tf.squeeze(state_value_)

Please let me know if there are some gaps in my understanding!

	reward = tf.convert_to_tensor(reward, dtype=tf.float32) # not fed to NN
	with tf.GradientTape(persistent=True) as tape:
	state_value, probs = self.actor_critic(state)
	state_value_, _ = self.actor_critic(state_)
	state_value = tf.squeeze(state_value)
	state_value_ = tf.squeeze(state_value_)

philtabor / youtube-code-repository Goto Github PK

youtube-code-repository's People

Stargazers

Watchers

Forkers

youtube-code-repository's Issues

Recommend Projects

Recommend Topics

Recommend Org