Coder Social home page Coder Social logo

philtabor / youtube-code-repository Goto Github PK

View Code? Open in Web Editor NEW
854.0 20.0 479.0 43.1 MB

Repository for most of the code from my YouTube channel

Python 51.15% Jupyter Notebook 48.85%
reinforcement-learning monte-carlo-methods qlearning-algorithm convolutional-neural-networks sarsa

youtube-code-repository's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

youtube-code-repository's Issues

D3QN for Multiple Action Selection

Hello,
I reuse the dueling double DQN for a problem in which multiple actions should be selected, i.e., the output of the choose_action function is an action vector selected from an interval of zero and n_actions.
How can I change the codes to get non-repetitive actions (i.e., an action vector with different actions).
My code is accessible at https://github.com/Nazanin-87/D3QN_TF.git
I appreciate any help in solving this issue.

Outdated argument critic network DDPG

Hi,
First of all, thank you for the code. It seems the n_actions argument has been removed in a previous update (ba997f7). I believe, it should be removed here as well to make the code work again.

self.critic = CriticNetwork(n_actions=n_actions, name='critic')

self.target_critic = CriticNetwork(n_actions=n_actions, name='target_critic')

tensorflow version

@philtabor , Thanks a lot for the videos.

Can you please let me know the tensorflow versions you are using? I run into versioning issues.
kernel_initializer=tf.variance_scaling_initializer(scale=2))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
TypeError: conv2d() got an unexpected keyword argument 'input'

Using observation space dimension size for the Actor and Critic models.

Hi, @philtabor !

Thanks for the awesome code and video!
Those really help me to study and understand reinforcement learning.

I found that the actor and critic model are not using observation space dim as their input_dim.
Shouldn't it be the same dim size as observation space?

What do you think?

And I made a pull request #39 about above.
I really appreciate your review!

Little issue in ddqn_keras.py

Hi Phil, there is a small issue in the method update_network_parameters(self), I think that should be:
self.q_target.set_weights(self.q_eval.get_weights())

instead of:
self.q_target.model.set_weights(self.q_eval.model.get_weights())

Thanks for your code and your videos! These are really useful for me.
Jose.

Purpose of Passing New Frame into State Memory with Previous Action

Hi Phil, huge fan of your work.. I have two questionsn regarding policy gradients TensorFlow for SpaceInvaders:

1.In the reinforce_cnn_tf.py and in the choose_action function there is a line:

probabilities = self.sess.run(self.actions, feed_dict={self.input: observation})[0]

Here 0 specifies that the action probability distribution is the first of the 4 probability distributions, if this is the case then your actions are taken based on the first frame or the 0th observation of the stacked_frames. Is that right?

  1. Assuming my first assumption is right. There is a line in the main_tf_reinforce_space_invaders.py file:

observation, reward, done, info = env.step(action)
observation = preprocess(observation)
stacked_frames = stack_frames(stacked_frames, observation, stack_size)
agent.store_transition(observation, action, reward) (this one)

Here the new observation is getting stored with action taken based on the 0th observation in the stacked_frame, If this is the case why does this work while training the agent? Are the probability distributions when the observations are fed in different from the labels?

agent.learn function action is given but not in the function

Traceback (most recent call last):
File "main.py", line 42, in
agent.learn(observation, action, reward, observation_, done)
TypeError: learn() takes 5 positional arguments but 6 were given

change:
def learn(self, state, reward, state_, done):

to

def learn(self, state, action, reward, state_, done):

in actor critic

DQN

File "/home/../aichess/main.py", line 13, in
agent = Agent(
File "/home/../aichess/engines/dqn.py", line 114, in init
self.q_eval = DeepQNetwork(self.lr, self.n_actions,
TypeError: DeepQNetwork.init() got multiple values for argument 'input_dims'

Fix to solution: TypeErrorr in actor_critic tensorflow2 main.py

In the following document:
Youtube-Code-Repository/ReinforcementLearning/PolicyGradient/actor_critic/tensorflow2/main.py

Running the program returns:
File "actor_critic.py", line 157, in
agent.learn(observation, action, reward, observation_, done)
TypeError: learn() takes 5 positional arguments but 6 were given

I think this could be fixed by changing the following line:
agent.learn(observation, action, reward, observation_, done)

to:
agent.learn(observation, reward, observation_, done)

Sorry for the suggestion format, first time suggesting fixes :D

Issue with critic target in PPO

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

[Question] Replay memory

def store_transition(self, state, action, reward, state_, terminal):

At the Tensorflow Deep QLearning 'dqn_tf.py' file you are using a replay memory.
I have a question about using it.

Then I understand it right, the storage is never cleared?
So, the function store_transition saves the states over epochs?
And in Epoch 2 it is possible that a state from Epoch 1 is trained.
So it gets never cleared as long as the program is running.

SAC custom env

I get this error:


ValueError Traceback (most recent call last)
in
26 score = 0
27 while not done:
---> 28 action = agent.choose_action(observation)
29 observation_, reward, done, info = env.step(action)
30 score += reward

in choose_action(self, observation)
23 def choose_action(self, observation):
24 state = T.Tensor([observation]).to(self.actor.device)
---> 25 actions, _ = self.actor.sample_normal(state, reparameterize=False)
26
27 return actions.cpu().detach().numpy()[0]

in sample_normal(self, state, reparameterize)
38 def sample_normal(self, state, reparameterize=True):
39 mu, sigma = self.forward(state)
---> 40 probabilities = Normal(mu, sigma)
41
42 if reparameterize:

~\Anaconda3\lib\site-packages\torch\distributions\normal.py in init(self, loc, scale, validate_args)
48 else:
49 batch_shape = self.loc.size()
---> 50 super(Normal, self).init(batch_shape, validate_args=validate_args)
51
52 def expand(self, batch_shape, _instance=None):

~\Anaconda3\lib\site-packages\torch\distributions\distribution.py in init(self, batch_shape, event_shape, validate_args)
54 if not valid.all():
55 raise ValueError(
---> 56 f"Expected parameter {param} "
57 f"({type(value).name} of shape {tuple(value.shape)}) "
58 f"of distribution {repr(self)} "

ValueError: Expected parameter loc (Tensor of shape (1, 1, 1)) of distribution Normal(loc: tensor([[[nan]]], device='cuda:0', grad_fn=), scale: tensor([[[nan]]], device='cuda:0', grad_fn=)) to satisfy the constraint Real(), but found invalid values:
tensor([[[nan]]], device='cuda:0', grad_fn=)

Any idea how to fix this?

PPO pytorch implementation question

Hi,
Thank you so much for this guide - it is extremely clear and easy to follow! This isn't a bug, but there are a few questions I have. The first:
Why are the 'values' tensor sent to the actor device (referring to line):

values = T.tensor(values).to(self.actor.device)

The values tensor is not uniquely used by the actor parameters but by both the actor and critic as it is used for the MSE for critic and then added to actor loss for the total loss

and second:
why did you not include KL divergence in your implementation, was there a specific reason?

thank you so much again!

td3

File "main.py", line 11, in <module> n_actions=env.action_space.shape[0]) IndexError: tuple index out of range

Pendulum TF2 maybe a bug found

Hey.

I've been laucnhing your Pendulum TF2 project, and: it only launched after I've changed lines 23 and 25 of ddpg_tf2.py from
self.critic = CriticNetwork(n_actions=n_actions, name='critic')
self.target_critic = CriticNetwork(n_actions=n_actions, name='target_critic')
to
self.critic = CriticNetwork(name='critic')
self.target_critic = CriticNetwork(name='target_critic')
I believe thats a bug?

Do some of the variables require gradient?

Thanks for sharing your code! There is a problem that bothers me.
For example, in pytorch SAC code, whether we need to use with torch.no_grad() or detach() when computing value_target and q_hat.
value_target = critic_value - log_probs # line 96
q_hat = self.scale*reward + self.gamma*value_ # line 116

I think they need to stop gradient computation.

value_target = value_target.detach()

q_hat = q_hat .detach()

The author of td3 algorithm uses with torch.no_grad() to compute target_Q.
https://github.com/sfujim/TD3/blob/master/TD3.py
110 with torch.no_grad():
And the author of sac algorithm uses tf.stop_gradient() to compute value_target and q_hat.
https://github.com/haarnoja/sac/blob/master/sac/algos/sac.py
256 ys = tf.stop_gradient( self.scale_reward * self._rewards_ph + (1 - self._terminals_ph) * self._discount * vf_next_target_t ) # N
330 self._vf_t - tf.stop_gradient(min_log_target - log_pi + policy_prior_log_probs) )**2)

Could you give me some suggestions about this problem, please?

ValueError: The parameter loc has invalid values

I've downloaded your code and made the following small changes:
-removed all loading/checkpointing/saving functions/calls
-switched the gym environment to env = gym.make("InvertedPendulum-v2")

After some training (variable amount of time before error occurs) I get the following bug:
File "C:\Users\john\Desktop\project\Clone\sac_torch.py", line 32, in choose_action actions, _ = self.actor.sample_normal(state, reparameterize=False) File "C:\Users\john\Desktop\project\Clone\networks.py", line 105, in sample_normal probabilities = Normal(mu, sigma) File "C:\Users\john\anaconda3\lib\site-packages\torch\distributions\normal.py", line 50, in __init__ super(Normal, self).__init__(batch_shape, validate_args=validate_args) File "C:\Users\john\anaconda3\lib\site-packages\torch\distributions\distribution.py", line 53, in __init__ raise ValueError("The parameter {} has invalid values".format(param)) ValueError: The parameter loc has invalid values

I print out the mu and sigma and see that immediately before the error they have become equal to nan:
tensor([[nan]], device='cuda:0', grad_fn=<AddmmBackward>) tensor([[nan]], device='cuda:0', grad_fn=<ClampBackward1>)
(This appears to be occurring during a forward pass, not buffer sampling, due to the tensor being 1 dimensional)

Thanks again for the quick reply in your video!

Critic loss calculation

Hi @philtabor

Thank you for the great tutorials ..I am following "ppo_torch.py"(code link) this code and it's tutorial and one thing that I did not understand why we are not storing the next_state and using that to generate a critic value , which will be then used for calculating the loss function? don't we need the next state in ppo, am I missing something?

main_keras_dqn_lunar_lander first env.reset() array plus empty dict

On the first env.reset() call a tuple is returned of the array and a empty dict this empty dict screws up the rest of the code.
is that a new addition off the gym library or a code Problem?

(array([ 0.00469818, 1.3994393 , 0.47585568, -0.5102643 , -0.0054372 , -0.10778829, 0. , 0. ], dtype=float32), {})

does not start on latest python

Hello, I'm trying to start your example lunar_lander.py but it is not started. I have follwoing error
Traceback (most recent call last):
File "D:/ai/DeepQLearning/lunar_lander.py", line 9, in
env = gym.make('LunarLander-v2')
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 156, in make
return registry.make(id, **kwargs)
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 101, in make
env = spec.make(**kwargs)
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 72, in make
cls = load(self._entry_point)
File "D:\ai\DeepQLearning\venv\lib\site-packages\gym\envs\registration.py", line 18, in load
fn = getattr(mod, attr_name)
AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander'

A2C with experience replay

Hello @philtabor ,

When you attempt to use experience replay in actor critic setting, to me it looks that only critic part is trained (gradients propagated), but the actor part that comes from stored log_probs in numpy array cannot back propagate gradients. However, imho the actual problem is more general, since policy is something that supposed to be evolving it does not make sense to store results of older worse policy. log_probs need to be recomputed in learning function the same way as outputs of critic network.

Issue

(array([-0.02680779, 0.00466264, -0.02511859, -0.04842809], dtype=float32), {})
Traceback (most recent call last):
File "main.py", line 31, in
action, prob, val = agent.choose_action(observation)
File "D:\AI\PPO\agent.py", line 41, in choose_action
state = tf.convert_to_tensor([observation],dtype=tf.float32)
File "C:\Users\Buster.conda\envs\PPO\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\Buster.conda\envs\PPO\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Can't convert non-rectangular Python sequence to Tensor.

Error when I changed dueling_ddqn_torch.py to get multiple dicrete actions

I want to implement a dueling double DQN algorithm for selecting multiple discrete actions. Since the existing dueling_ddqn_torch.py code is for choosing a single action, I should modify it. But when I changed the choose_action function of Agent to get multiple actions, I got the following error:
IndexError: tensors used as indices must be long, byte or bool tensors
The complement explanation of the error is
Traceback (most recent call last): File "C:/Users/Desktop/D3QN.py", line 339, in <module> agent.learn() File "C:/Users/Desktop/D3QN.py", line 157, in learn q_pred = T.add(Vs, (As - As.mean(dim=1, keepdim=True)))[indices, actions] IndexError: tensors used as indices must be long, byte or bool tensors
The whole code is attached.
dueling_ddqn_torch.zip
I would be grateful if anyone helps me to solve this error.

OSError: Unable to create file (unable to open file: name = 'home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/tmp/ddpg/actor_ddpg.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

I am getting this error while training. Any idea on how to solve this?
... saving models ...
Traceback (most recent call last):
File "/home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/main_ddpg.py", line 63, in
agent.save_models()
File "/home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/ddpg_tf2.py", line 56, in save_models
self.actor.save_weights(self.actor.checkpoint_file)
File "/home/ak/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 2217, in save_weights
with h5py.File(filepath, 'w') as f:
File "/home/ak/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 424, in init
fid = make_fid(name, mode, userblock_size,
File "/home/ak/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 196, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 116, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = 'home/ak/Desktop/DDPG/DDPG/tensorflow2/Mir robot/tmp/ddpg/actor_ddpg.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

simple_dqn_tf2.py Doesn't allow for multiple return actions

If you try to change the n_actions parameter then when the model trys to learn it will fail

164/164 [==============================] - 0s 998us/step
164/164 [==============================] - 0s 887us/step
[[[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]

 ...

 [[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]

 [[nan nan nan ... nan nan nan]]] [   0    1    2 ... 5245 5246 5247] [list([2, 2, 5]) list([2, 1, 6]) list([3, 0, 6]) ... list([3, 0, 7])
 list([3, 8, 5]) list([3, 0, 3])]
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    agent.learn()
  File "simple_dqn_tf2.py", line 95, in learn
    self.gamma * np.max(q_next, axis=1)*dones
ValueError: operands could not be broadcast together with shapes (5248,82) (5248,)

This definitely has to do with the shape of the stored action. I'm just not sure how to fix it.

5248 = n_actions * batch_size
82 = n_actions

Pytorch - question.

@philtabor,
This is an intriguing implementation of PPO2. It is simple and it converges for cartpole quicker than any other I have seen. Taking a basic definition of "convergence" as 10 episodes in a row at total reward=max reward (200), this converges in ~230 episodes.

I tested a version with all pytorch functions converted to identical tensorflow 2.3 functions, and adding two gradient tapes to the .learn() function. It doesn't converge nearly as well. Do you have any idea why? is it a characteristic of pytorch that makes this implementation so successful?

Apologies if I have posted this twice, I am new to github.

Malfunctioning in simple_dqn_torch.py

Hello.

I had some problems regards DeepQNetwork implementation using Pytorch.

I ran the code showed in your youtube video. I've got this error code:

~/PROJECTS/PYTORCH_TUTORIAL/main_DQN_file.py in <module>
     35             brain.store_transition(observation, action, reward, observation_, done)
     36 
---> 37             brain.learn()
     38             observation = observation_
     39         scores.append(score)

~/PROJECTS/PYTORCH_TUTORIAL/simple_DQN.py in learn(self)
    123             print("Q_Target slice: ",q_target[batch_index,actions_random])
    124             q_target[batch_index, action_indices] = reward_batch + \
--> 125                 self.gamma*T.max(q_next,dim=1)[0]*terminal_batch
    126 
    127             self.epsilon = self.epsilon*self.eps_dec if self.epsilon > \

IndexError: The shape of the mask [64] at index 0 does not match the shape of the indexed tensor [64, 4] at index 1

This error shows if the action indices are calculated using dot operator.

When I use np.argmax function whole network works properly.

Have you encountered this type of problem?

Model never learns the game

Hi. I was following your youtube tutorial on Actor-critic method in continious space (lunar lander). However, despite having same code, my model almost never score higher than zero, nevermind reaching anywhere near 200, even after significant amount of episodes. Code is following :
https://github.com/6opoDuJIo/RL_Playground/blob/master/lunar_lander.py
And part of the log file is :

episode 6265 score -92.32 average score -178.79
episode 6266 score -17.88 average score -177.76
episode 6267 score -119.38 average score -176.25
episode 6268 score -104.23 average score -173.38
episode 6269 score -83.28 average score -172.56
episode 6270 score -146.12 average score -172.68
episode 6271 score -126.20 average score -173.07
episode 6272 score -226.61 average score -172.13
episode 6273 score -245.62 average score -173.37
episode 6274 score -105.59 average score -171.55
episode 6275 score -141.94 average score -173.44
episode 6276 score -301.27 average score -175.40
episode 6277 score -82.96 average score -175.56
episode 6278 score -134.57 average score -175.98
episode 6279 score -51.25 average score -174.88
episode 6280 score -81.76 average score -174.06
episode 6281 score -227.78 average score -173.70
episode 6282 score -386.15 average score -175.98
episode 6283 score -297.21 average score -177.16
episode 6284 score -422.21 average score -180.37
episode 6285 score -140.92 average score -180.87
episode 6286 score -236.97 average score -180.38
episode 6287 score -119.24 average score -179.11
episode 6288 score -76.24 average score -179.02
episode 6289 score -85.39 average score -176.80
episode 6290 score -131.07 average score -178.32
episode 6291 score -110.64 average score -179.56
episode 6292 score -150.60 average score -179.94
episode 6293 score -68.53 average score -179.51
episode 6294 score -184.71 average score -179.02
episode 6295 score -263.88 average score -180.30
episode 6296 score -287.41 average score -182.61
episode 6297 score -98.54 average score -181.06
episode 6298 score -82.03 average score -180.98
episode 6299 score -284.01 average score -181.44
episode 6300 score -88.97 average score -180.43
episode 6301 score -102.73 average score -178.80
episode 6302 score -179.52 average score -180.30
episode 6303 score -222.17 average score -181.51
episode 6304 score -246.87 average score -182.72
episode 6305 score -331.83 average score -184.92
episode 6306 score -361.46 average score -187.54
episode 6307 score -89.69 average score -187.46
episode 6308 score -27.86 average score -187.13
episode 6309 score -135.48 average score -184.37
episode 6310 score -115.25 average score -182.98

Error in Saving models

Hi I was trying to save model (lunar lander youtube tutorial) but I'm not able to I tried adding agent.save_model() in file main_tf2_dqn_lunar_lander.py but then it gives an error as below :

Weights for model sequential have not yet been created. Weights are created when the Model is first called on inputs or build() is called with an input_shape.

Error in Store_transition in pytorch dqns

In "main_torch_dqn_lunar_lander_2020.py" file

--> self.state_memory[index] = state
It says
"ValueError: setting an array element with a sequence. The requested array would exceed the maximum number of dimension of 1"

When i alter few things to get rid off this error i am getting into another error
,could you help me out

Does not start your python code.

Hello, I'm trying to start your Lunar_lander code, 'main_torch_dqn_lundar_lander.py' in 'archive' folder, but it is not started. The following is the error. Thank you.
File "C:\Users\ys-th\Desktop\Spring2022\main_torch_dqn_lunar_lander.py", line 15, in
env = gym.make('LunarLander-v2')

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 676, in make
return registry.make(id, **kwargs)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 490, in make
versions = self.env_specs.versions(namespace, name)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 220, in versions
self._assert_name_exists(namespace, name)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 271, in _assert_name_exists
self._assert_namespace_exists(namespace)

File "C:\Users\ys-th\miniconda3\lib\site-packages\gym\envs\registration.py", line 268, in _assert_namespace_exists
raise error.NamespaceNotFound(message)

NamespaceNotFound: Namespace None does not exist.

Personal Question

First of all thanks for the useful videos!
Second, I had a personal doubt and I am only posting here as I have been trying to fix it for the last couple of days. My apologies is this isn't acceptable.

I followed the DQN pytorch 2020 tutorial which has LunarLander as the environment. I tried running it for cartpole as well but I am getting an error. I'll post attach a picture of the same. It works for a few iterations then fails. I have changed the action space and the input dimensions for the same.
Thanking you.
doubt

@philtabor

[TensorFlow2] Critic Loss Calculation for actor_critic

If I understand correctly, the code in tensorflow2/actor_critic.py implements the One-step Actor-Critic (episodic) algorithm given on page 332 of RLbook2020 by Sutton/barto (picture given below).

image

Here we can see that the critic parameters w are updated only using the gradient of the value function for the current state S
which is represented as grad(V(S, w)) in the pseudocode shown above. The update skips the gradient of the value function for the next state S'. This can again be seen in the pseudocode above, there is no grad(V(S', w)) present in the update rule for critic parameters w.

In the code given below, including state_value_, _ = self.actor_critic(state_) (L43) inside the GradientTape would result in grad(V(S', w)) appearing in the update for w, which contradicts the pseudocode shown above.

reward = tf.convert_to_tensor(reward, dtype=tf.float32) # not fed to NN
with tf.GradientTape(persistent=True) as tape:
state_value, probs = self.actor_critic(state)
state_value_, _ = self.actor_critic(state_)
state_value = tf.squeeze(state_value)
state_value_ = tf.squeeze(state_value_)

Please let me know if there are some gaps in my understanding!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.