ufal / npfl122 Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 24.0 37.55 MB

NPFL122 repository

License: Creative Commons Attribution Share Alike 4.0 International

HTML 0.23% Python 97.31% C++ 2.46%

npfl122's People

Contributors

Stargazers

Watchers

npfl122's Issues

Lecture 1 / slide 14: Fixed learning rate vs alpha_n

The slide is titled "Fixed learning rate" but alpha_n is used in convergence condition. I suppose we could keep decreasing the learning rate and still call it fixed but maybe this could be better explained in slides?

Make the possibility of obtaining various rewards from the same transition more explicit

It feels that the examples of MDPs used in lectures and assignments encourage the idea that while an action may lead to many states, the triplet old state - action - new state always produces the same reward, which is (given that I understand it all right) incorrect. Maybe it would be worth to make this possibility more explicit?

Lecture 9 / Slide 16 text doesn't fit inside

Please see screenshot. Resizing doesn't seem to help as the presentation keeps a fixed aspect ratio.

optimal policy distribution

Is pi*(s) probability distribution, or does it return just the best action? Slide number 5 says policy pi computes probability distribution...

PopArt - is 'n' really unnormalized?

The slides say that the value estimate v is normalized with respect to an unnormalized value predictor n. Isn't it actually the other way round?

The paper says: In order to normalise both baseline and policy gradient updates, we first parameterise the value estimate v as the linear transformation of a suitably normalised value prediction n.

Possible typo

In Lecture 4, slides 18 and 19, shouldn't we initialize action-value-function weights instead of just value-function weights?

"The gradient direction is a local minimizer"

Slides of lecture 10 state that "the gradient direction is a local minimizer" - shouldn't it actually say that it's a maximizer?
This is also what is said in the paper corresponding to the slides.

Missing `import math` in wrappers.py in 03 labs

At least my Python 3.9.7 on Windows won't run it without import math which is used on line 89 as math.inf.

Parameter to change for UCB in MultiArmedBandits seems wrong

ucb: perform UCB search with confidence level c and computing the value function using averaging. Plot results for c=1 and ε∈{1/128,1/64,1/32,1/16,1/8}

But according to the slides, the ε parameter is not used at all in UCB, as we replace the ε-greedy strategy with an argmax. Perhaps we should be changing c?

TD errors definitions +-1 error

Behem delani ulohy tarce_algorithms jsem narazil na nasledujici nejasnost:

zdroj definice

Myslim, ze tam misto R_{t+1} ma byt R_i

Ale kdyz si dany vzorecek dosadim do sum z prehledu
Tak mi vychazi zohlednene rewards o jedna posunute.

Na piazze zminujete nasledujici:

If an episode ends, only the value function of the V(next_state) “after the episode” is not used; all other value functions in TD errors are computed. So all TD errors are unchanged, except for the last one, which is just R_T - V(S_T)

zdroj citace Coz neodpovida slidum.

Cekal bych, ze R_t a V_t budou vynasobene stejnou mocninou gammy.

EDIT: Klidne vyrobim PR, jenom jsem si chtel nejdrive overit, ze to je opravdu spatne.

Non-matching types

There are non-matching types in policy_iteration_mc_egreedy.py. Main function is supposed to return "list[int]", instead returns "np.ndarray".

[06/slides] Wrong/misleading pseudocode for REINFORCE

When studying the materials in slides for my diploma thesis I ran into a possible error or misleading formulation in the pseudocode for REINFORCE on slide https://ufal.mff.cuni.cz/~straka/courses/npfl122/2223/slides/?06#22

There is IMO an error on the last line in that the $\gamma^t$ is missing. It even says below the image "removing $\gamma^t$ from the update of $\theta$. However I'd argue that the update rule is now only valid in the non-discounted case, where $\gamma=1$. Let me explain.

Consider the definition of on-policy distribution for infinite horizon trajectories (the one not mentioned in the Sutton's book as they only define it for finite horizon non-discounted tasks).

$$ \mu(s) = \frac{\eta(s)}{\sum_{s^\prime} \eta(s^\prime)}, $$

where

$$ \eta(s) = h(s) + \sum_{s^\prime} \eta(s^\prime) \sum_{a} \gamma \pi(a|s^\prime)p(s|s^\prime,a). $$

When I expand the recursion I get

$$ \eta(s) = \\\\ h(s) + \gamma \sum_{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) + \gamma^2 \sum_{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) \sum_{s^{\prime\prime}, a^\prime} \pi(a^\prime|s^{\prime\prime})p(s^\prime|s^{\prime\prime},a^\prime) \\\ + \cdots $$

so I get the term $\gamma^t P(s_0 \rightarrow s_t \text{ in t steps})$ that is then used in the policy gradient theorem proof.

Now as I'm calculating the expectation under the on-policy distribution $\mu$ to get the policy gradient for REINFORCE the $\gamma^t$ is not present as it's already included in the probability $\mu(s)$. However when I want to estimate the expectation by the returns calculated over a sampled trajectory I believe that I need to include the term $\gamma^t$ again, otherwise the estimate will be biased. Or is my reasoning wrong here?

To be more clear what I mean is that it is true that:

$$ \nabla_{\theta} J(\theta) \propto E_{s\sim\mu} E_{a\sim\pi} \left[ G_t \nabla_{\theta} \log \pi(a|s;\theta) \right] $$

but the corresponding update rule when estimating the expectations should be

$$ \theta = \theta + \alpha \gamma^t G_t \nabla_{\theta} \log \pi(a_t|s_t;\theta) $$

Using f-string instead of `str.format()`

Hello, I would like to propose an enhancement to the current set of templates. I believe, it would be worth the effort to start using f-strings, a feature available since Python 3.6. It would improve readability of the code and it would make the code shorter and more concise.

I would be willing to create a pull request for all of the already published assignments which would implement this transition to f-strings.

S^+ not explained in code

In code for value iteration (lec 2, slide 11) and Sarsa (lec 3, slide 13), S^+ is used but not explained. Does it stand for reachable states only? (And could that be written on the slides? :))

ufal / npfl122 Goto Github PK

npfl122's People

Contributors

Stargazers

Watchers

Forkers

npfl122's Issues

Recommend Projects

Recommend Topics

Recommend Org