Coder Social home page Coder Social logo

npfl122's People

Contributors

dburian avatar dependabot[bot] avatar dunzel avatar fassty avatar filipbartek avatar foxik avatar georgeus19 avatar gldkslfmsd avatar husarma avatar iamwave avatar jiribenes avatar karryanna avatar kszabova avatar mafi412 avatar matezzzz avatar okarlicek avatar patrikvalkovic avatar pecimuth avatar peskaf avatar peter-grajcar avatar petrroll avatar pixelneo avatar qwedux avatar rattko avatar simonmandlik avatar uhlajs avatar vastlik avatar vvolhejn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

npfl122's Issues

Lecture 1 / slide 14: Fixed learning rate vs alpha_n

The slide is titled "Fixed learning rate" but alpha_n is used in convergence condition. I suppose we could keep decreasing the learning rate and still call it fixed but maybe this could be better explained in slides?

optimal policy distribution

image

Is pi*(s) probability distribution, or does it return just the best action? Slide number 5 says policy pi computes probability distribution...

PopArt - is 'n' really unnormalized?

The slides say that the value estimate v is normalized with respect to an unnormalized value predictor n. Isn't it actually the other way round?

The paper says: In order to normalise both baseline and policy gradient updates, we first parameterise the value estimate v as the linear transformation of a suitably normalised value prediction n.

Possible typo

In Lecture 4, slides 18 and 19, shouldn't we initialize action-value-function weights instead of just value-function weights?

Parameter to change for UCB in MultiArmedBandits seems wrong

ucb: perform UCB search with confidence level c and computing the value function using averaging. Plot results for c=1 and ε∈{1/128,1/64,1/32,1/16,1/8}

But according to the slides, the ε parameter is not used at all in UCB, as we replace the ε-greedy strategy with an argmax. Perhaps we should be changing c?

TD errors definitions +-1 error

Behem delani ulohy tarce_algorithms jsem narazil na nasledujici nejasnost:

Screenshot from 2022-02-07 00-43-17
zdroj definice

Myslim, ze tam misto R_{t+1} ma byt R_i

Ale kdyz si dany vzorecek dosadim do sum z prehledu
Tak mi vychazi zohlednene rewards o jedna posunute.

Na piazze zminujete nasledujici:

If an episode ends, only the value function of the V(next_state) “after the episode” is not used; all other value functions in TD errors are computed. So all TD errors are unchanged, except for the last one, which is just R_T - V(S_T)

zdroj citace Coz neodpovida slidum.

Cekal bych, ze R_t a V_t budou vynasobene stejnou mocninou gammy.

EDIT: Klidne vyrobim PR, jenom jsem si chtel nejdrive overit, ze to je opravdu spatne.

[06/slides] Wrong/misleading pseudocode for REINFORCE

When studying the materials in slides for my diploma thesis I ran into a possible error or misleading formulation in the pseudocode for REINFORCE on slide https://ufal.mff.cuni.cz/~straka/courses/npfl122/2223/slides/?06#22

image

There is IMO an error on the last line in that the $\gamma^t$ is missing. It even says below the image "removing $\gamma^t$ from the update of $\theta$. However I'd argue that the update rule is now only valid in the non-discounted case, where $\gamma=1$. Let me explain.

Consider the definition of on-policy distribution for infinite horizon trajectories (the one not mentioned in the Sutton's book as they only define it for finite horizon non-discounted tasks).

$$ \mu(s) = \frac{\eta(s)}{\sum_{s^\prime} \eta(s^\prime)}, $$

where

$$ \eta(s) = h(s) + \sum_{s^\prime} \eta(s^\prime) \sum_{a} \gamma \pi(a|s^\prime)p(s|s^\prime,a). $$

When I expand the recursion I get

$$ \eta(s) = \\\\ h(s) + \gamma \sum_{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) + \gamma^2 \sum_{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) \sum_{s^{\prime\prime}, a^\prime} \pi(a^\prime|s^{\prime\prime})p(s^\prime|s^{\prime\prime},a^\prime) \\\ + \cdots $$

so I get the term $\gamma^t P(s_0 \rightarrow s_t \text{ in t steps})$ that is then used in the policy gradient theorem proof.

Now as I'm calculating the expectation under the on-policy distribution $\mu$ to get the policy gradient for REINFORCE the $\gamma^t$ is not present as it's already included in the probability $\mu(s)$. However when I want to estimate the expectation by the returns calculated over a sampled trajectory I believe that I need to include the term $\gamma^t$ again, otherwise the estimate will be biased. Or is my reasoning wrong here?

To be more clear what I mean is that it is true that:

$$ \nabla_{\theta} J(\theta) \propto E_{s\sim\mu} E_{a\sim\pi} \left[ G_t \nabla_{\theta} \log \pi(a|s;\theta) \right] $$

but the corresponding update rule when estimating the expectations should be

$$ \theta = \theta + \alpha \gamma^t G_t \nabla_{\theta} \log \pi(a_t|s_t;\theta) $$

Using f-string instead of `str.format()`

Hello, I would like to propose an enhancement to the current set of templates. I believe, it would be worth the effort to start using f-strings, a feature available since Python 3.6. It would improve readability of the code and it would make the code shorter and more concise.

I would be willing to create a pull request for all of the already published assignments which would implement this transition to f-strings.

S^+ not explained in code

In code for value iteration (lec 2, slide 11) and Sarsa (lec 3, slide 13), S^+ is used but not explained. Does it stand for reachable states only? (And could that be written on the slides? :))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.