higgsfield / rl-adventure Goto Github PK

Pytorch Implementation of DQN / DDQN / Prioritized replay/ noisy networks/ distributional values/ Rainbow/ hierarchical RL

Jupyter Notebook 97.90% Python 2.10%

rl-adventure's Introduction

DQN Adventure: from Zero to State of the Art

This is easy-to-follow step-by-step Deep Q Learning tutorial with clean readable code.

The deep reinforcement learning community has made several independent improvements to the DQN algorithm. This tutorial presents latest extensions to the DQN algorithm in the following order:

Playing Atari with Deep Reinforcement Learning [arxiv] [code]
Deep Reinforcement Learning with Double Q-learning [arxiv] [code]
Dueling Network Architectures for Deep Reinforcement Learning [arxiv] [code]
Prioritized Experience Replay [arxiv] [code]
Noisy Networks for Exploration [arxiv] [code]
A Distributional Perspective on Reinforcement Learning [arxiv] [code]
Rainbow: Combining Improvements in Deep Reinforcement Learning [arxiv] [code]
Distributional Reinforcement Learning with Quantile Regression [arxiv] [code]
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation [arxiv] [code]
Neural Episodic Control [arxiv] [code]

Environments

First, I recommend to use small test problems to run experiments quickly. Then, you can continue on environments with large observation space.

CartPole - classic RL environment can be solved on a single cpu
Atari Pong - the easiest atari environment, only takes ~ 1 million frames to converge, comparing with other atari games that take > 40 millions
Atari others - change hyperparameters, target network update frequency=10K, replay buffer size=1M

If you get stuck…

Remember you are not stuck unless you have spent more than a week on a single algorithm. It is perfectly normal if you do not have all the required knowledge of mathematics and CS. For example, you will need knowledge of the fundamentals of measure theory and statistics, especially the Wasserstein metric and quantile regression. Statistical inference: importance sampling. Data structures: Segment Tree and K-dimensional Tree.
Carefully go through the paper. Try to see what is the problem the authors are solving. Understand a high-level idea of the approach, then read the code (skipping the proofs), and after go over the mathematical details and proofs.

Best RL courses

David Silver's course link
Berkeley deep RL link
Practical RL link

rl-adventure's People

Contributors

Stargazers

Watchers

Forkers

francescosaveriozuppichini ahong007007 ml-ai-nlp-ir g-wang davidtranno1 peratham ml-lab ericargus azinaz createai navneet-nmk amicond xuehaouwa gearchen allensmile shubhampachori12110095 yydxlv ceolium srepho miguelperalvo shaunstanislauslau tony32769 shivamgupta211 odelalleau sebastiani xjohn600 williamd4112 kelvinson hsong-77 jianbing-zhang mcdavid109 alpslee cw dysdsyd aihill eljazry ansidong mrb1b0 lenixlobo outcastofmusic sreenivasanac rauwuckl felixmonkey eternalfeather alro10 roopy7890 southatsouth aixander little1tow rsilveira79 cehnegaitne romakoks trdelgado yutiansut vkbss edbeeching monajalal keruhua agoila omegastick daominglyu sts0mrg0 zy20091082 locussam rosefun wonseokjung tonyle9 kgkesoulis codelikemonkey afcarl zxspectrumz80 qingyuanxingsi robrib jerrywisdom sjsong08 alibaheri elementli mdheller a7b23 demirev koldus cezary-biernacki nathanielwei hbgtjxzbbx flydsc schekroud feynman0825 wellbeing18 selvamshan melkael tjujianyu parsonszeng narumiruna layne-wang shubhamwagh tsaoyu zghap wuxiangli91 huanghua1668 zbenic

rl-adventure's Issues

cuda tensor instead of int in 1.dqn

Hi! Thanks for the great tutorials!
I had an issue with class DQN(nn.Module), in method act this thing
action = q_value.max(1)[1].data[0]
seemed to return some torch cuda tensor, that env.step naturally couldn't take as input.
I replaced it with with
action = int(q_value.max(1)[1].data[0].cpu().int().numpy())
and it works for me.

Environment/dependencies

Hi all,

I am currently trying to run the 'quantile regression dqn' notebook, but it breaks in the training stage at line
loss = compute_td_loss(batch_size).
At some point I realised I actually have no idea if it could be due to my environment, and I wasn't able to verify this with the documentation.

Could someone please report which python/torch versions were succesfully tested? I'd be grateful to be able to put this nice code to work!

Best regards,

Jan

The atari's state.shape is (210,160,3),and u define the net's in_channel as env.shape[0]

This mainly causes the size error like ''Calculated padded input size per channel: (160 x 3). Kernel size: (8 x 8). Kernel size can't be greater than actual input size...'', so why not adjust the state's dim,maybe the code is :state = np.transpose(state, (2, 0, 1))or adjust the Net, but we may need to pay attention to this. 0.0

No prioritized experience replay in rainbow file

Hi! Probably by accident usual replay buffer is used in rainbow file instead of prioritized.

ValueError: too many values to unpack (expected 4)

ValueError Traceback (most recent call last)
/home/parksangtae/RL-Adventure-master/1.dqn.ipynb Cell 19 line 1
[11] epsilon = epsilon_by_frame(frame_idx)
[12] action = model.act(state, epsilon)
---> [14] next_state, reward, done, _ = env.step(action)
[15] replay_buffer.push(state, action, reward, next_state, done)
[17] state = next_state

ValueError: too many values to unpack (expected 4)

could you tell me why this error is happen?

RL-Adventure/3.dueling dqn.ipynb missing forward?

def compute_td_loss(batch_size):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)

    state      = Variable(torch.FloatTensor(np.float32(state)))
    next_state = Variable(torch.FloatTensor(np.float32(next_state)))
    action     = Variable(torch.LongTensor(action))
    reward     = Variable(torch.FloatTensor(reward))
    done       = Variable(torch.FloatTensor(done))

    q_values      = current_model(state)
    next_q_values = target_model(next_state)

    q_value          = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_value     = next_q_values.max(1)[0]
    expected_q_value = reward + gamma * next_q_value * (1 - done)
    
    loss = (q_value - expected_q_value.detach()).pow(2).mean()
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss

no forward call?

edit: sry, found my issue caused by Variable not "the missing forward", it works without calling forward(), the result is the same, can be closed.

About Distributional DQN--projection_distribution

I am confused about Distributional DQN. Why 'next_dists' multiplied by support in the function of ’projection_distribution‘？My model got bad learning after using it. I would appreciate it if you could give me an answer in your spare time!

development

hello,thanks for sharing your code
i want to implement duel dqn for mountain car.can you suggest anything?

batch_size for DQN

Thanks for your codes which work very well.
The origin DQN(1.dqn.ipynb) set batch_size=32 which is ineffective on GUP so I changed it to 256. But it's bad performance. Is there any experience in this?

batch_size=32

batch_size=256

Or should I try more experiments with batch_size=256 or try more random seed?
Thanks again.

Both David Silver's course and Berkeley's course link to the same URL

Gym env

Why necessarily 'NoFrameskip'? And what is the specification of 'NoFrameskip'?

RL-Adventure/common/wrappers.py

Line 214 in 0f82b69

assert 'NoFrameskip' in env.spec.id

frame_stack default to False

Hello, is it correct that frame_stack in wrap_deepmind is never in use? Will you be able to get velocity information if you only parse 1 frame at a time?

Any possibilities to include A3C implementation?

Thank you for your great tutorial.

The update in DQN

Hi,

I get a question about your implementation of DQN, which is supposed to have a C-interval-update between target q-network and current q-network. I see this update in your implementation of DDQN. Can you please tell me why it is this way?

In my point of view, your implementation of ddqn is actually dqn.

Best,
Yuxuan

ModuleNotFoundError: No module named 'common'

I can't install the packages

ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from common.wrappers import make_atari, wrap_deepmind, wrap_pytorch

ModuleNotFoundError: No module named 'common'

Distributional Reinforcement Learning with Quantile Regression

Hi, what does the "u" means in the following code snippets? It seems that the "u" is not defined in the code? Thanks!

huber_loss = 0.5 * u.abs().clamp(min=0.0, max=k).pow(2)
huber_loss += k * (u.abs() - u.abs().clamp(min=0.0, max=k))
quantile_loss = (tau - (u < 0).float()).abs() * huber_loss

Error in projection_distribution (Distributional DQN) ?

Hi,

I have a question regarding the projection_distribution method. It seems that when you are projecting back on the support/bins, at lines :

proj_dist.view(-1).index_add_(0, (l + offset).view(-1), (next_dist * (u.float() - b)).view(-1)) 
proj_dist.view(-1).index_add_(0, (u + offset).view(-1), (next_dist * (b - l.float()) ).view(-1))

the distribution next_dist is scaled by the support from the line
next_dist = target_model(next_state).data.cpu() * support
It seems like this should not be the case. This results in the final projected distribution not summing up to one. It seems one should do something like

next_dist_raw = target_model(next_state).data.cpu()
next_dist = next_dist_raw * support
next_action = next_dist.sum(2).max(1)[1]
next_action = next_action.unsqueeze(1).unsqueeze(1).expand(next_dist.size(0), 1, next_dist.size(2))
next_dist = next_dist.gather(1, next_action).squeeze(1)
next_dist_raw = next_dist_raw.gather(1, next_action).squeeze(1)

proj_dist.view(-1).index_add_(0, (l + offset).view(-1), (next_dist_raw * (u.float() - b)).view(-1))
proj_dist.view(-1).index_add_(0, (u + offset).view(-1), (next_dist_raw * (b - l.float()) ).view(-1))

This results in a distribution that contains the same amount of mass as the original one.

Thank you,
Lucas

No link found

No link for 10 task [Code]

DQN example: target DQN == behavior DQN (bug? or by design?)

Hi!

Did you make these 2 the same on purpose? Following the "Algorithm 1" from the original arxiv 2013 paper?

They initially stated that we should freeze the DQN and use it as the target net (because of stability),
but later in "Algorithm 1" they (probably by mistake) used the same theta params for both nets.

Licensing

Hello, i plan to use your DQN code for my bachelor's thesis and will of course reference where i got the code from, is there any further acceptable use policy on your code?

Error - possibly due to "Variable()" ?

Hi, many thanks for sharing the code.

I have experienced an error running 1.dqn straight out of the box. The error message shown after I run the 12th cell of code is as shown below.

My computer is running with PyTorch 0.4.1, and I suspect that the error is due to a change in the "Variable" API (as used in cells 8 and 10 for example)? If so, has anyone updated the code for the latest PyTorch 0.4.1?

Any ideas would be appreciated! Thanks in advance!

Error message after cell 12:

/home/USER/anaconda3/envs/RL/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.

AssertionError Traceback (most recent call last)
in ()
12 action = model.act(state, epsilon)
13
---> 14 next_state, reward, done, _ = env.step(action)
15 replay_buffer.push(state, action, reward, next_state, done)
16

~/anaconda3/envs/RL/lib/python3.7/site-packages/gym/wrappers/time_limit.py in step(self, action)
29 def step(self, action):
30 assert self._episode_started_at is not None, "Cannot call env.step() before calling reset()"
---> 31 observation, reward, done, info = self.env.step(action)
32 self._elapsed_steps += 1
33

~/anaconda3/envs/RL/lib/python3.7/site-packages/gym/envs/classic_control/cartpole.py in step(self, action)
52
53 def step(self, action):
---> 54 assert self.action_space.contains(action), "%r (%s) invalid"%(action, type(action))
55 state = self.state
56 x, x_dot, theta, theta_dot = state

AssertionError: tensor(0) (<class 'torch.Tensor'>) invalid

Error in Priority Update for Prioritized Replay

It looks like you're updating the priorities in the replay buffer according to the weighted and squared TD error.

loss  = (q_value - expected_q_value.detach()).pow(2) * weights
prios = loss + 1e-5
replay_buffer.update_priorities(indices, prios.data.cpu().numpy())

However, the algorithm in the original paper updates the priority only according to the absolute value of the TD error, which is not weighted. I believe this is a mistake in your implementation

higgsfield / rl-adventure Goto Github PK

rl-adventure's Introduction

DQN Adventure: from Zero to State of the Art

Environments

If you get stuck…

Best RL courses

rl-adventure's People

Contributors

Stargazers

Watchers

Forkers

rl-adventure's Issues

ValueError: too many values to unpack (expected 4)

batch_size=32

batch_size=256

I can't install the packages

Recommend Projects

Recommend Topics

Recommend Org