Hi Kai, In the Rainbow paper, the evaluation procedure is described

OK I've added a TODO with <a class="commit-link" data-hovercard-type="commit" data-hov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is the evluation procedure different? about rainbow HOT 8 OPEN

kaixhin commented on August 22, 2024

Is the evluation procedure different?

from rainbow.

Comments (8)

guydav commented on August 22, 2024

Another question which I'll tack onto here -- the default value for target-update parameter is 8000, which matches Table 1 in the Rainbow paper, which reports it as 32k frames.

Do you have a sense of why the data-efficient Rainbow paper, in Table 2 (Appendix E), reports the update period for both Rainbow and the data-efficient Rainbow as being every 2000 updates?

from rainbow.

Kaixhin commented on August 22, 2024

Ah honestly using a fixed number of episodes is something I came up with (as it makes sense, keeps statistics easy, and also works across other environments), and I completely overlooked that evaluation detail. My default - 10 episodes - could run for max ~2x as long as DeepMind's procedure, but I feel like 5 episodes may be a bit too little to get a good estimate? So I'm in favour of keeping what I've done, but maybe noting in the readme that this differs from the original procedure - what do you think?

As noted in the footnote of the data-efficient paper, the target network update period is reported with respect to the online network update, which is only every 4 steps in the original and hence 8000 steps for the target network update.

from rainbow.

guydav commented on August 22, 2024

I think that makes sense. I don't know why you'd evaluate over a fixed number of frames rather than episodes. You could make a TODO to eventually implement their evaluation procedure too? It wouldn't be hard at all but might take a bit of someone's time.

Another realization regarding the target-update bit. These 32K frames are not, actually, 32K unique frames, right? If I understand correctly, we repeat every action 4 times, add the max pool of the last two action frames to the state (and drop the oldest frame from the state), and pass the 4-frame state buffer to the agent.

In other words, every pair of agent actions shares three of the four frames in the environment state, correct?

(also, which paper does the max pool of the last two frames of the action repetition come from, if at all? I'm just trying to trace all of these implementation details.)

from rainbow.

Kaixhin commented on August 22, 2024

OK I've added a TODO with 2188966 .

You are correct. I can't remember where this was first reported in a paper, but I've spent years trying to replicate DeepMind's results, and I did a lot of digging into their Atari wrapper repositories and the DQN source code released with the Nature paper to get these implementation details.

from rainbow.

guydav commented on August 22, 2024

Thanks for the clarification. There appears to be so much voodoo around the implementation details that make it quite hard to know when you can trust your results.

It's interesting that the original Rainbow paper frames it as updating ever 32K frames, which, while strictly true, is far fewer than that in actual games frames given the overlap.

from rainbow.

guydav commented on August 22, 2024

@Kaixhin -- it seems that most papers also evaluate on either no-op starts or human starts. Did you ever take a stab at implementing either?

from rainbow.

Kaixhin commented on August 22, 2024

No-op starts are still used during evaluation. Haven't tried human starts, but have no idea where you would get them from (presumably internal to DeepMind).

from rainbow.

guydav commented on August 22, 2024

Ah, I see, in env.reset(). That makes sense.

from rainbow.

Is the evluation procedure different? about rainbow HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent