jlm429 / bettermdptools Goto Github PK
View Code? Open in Web Editor NEWso much room for activities
so much room for activities
2nd paragraph says (emphasis added):
Here, blackjack.convert_state_obs changes the 3-tuple into a discrete space with 290 states by concatenating player states 0-28 (hard 4-21 & soft 12-21) with dealer states 0-9 (2-9, ten, ace).
This appears to be a typo. There are 280 total states, because player states range from 0 to 27 (not 0 to 28): 18 states for hard 4-21 (so 0, 1, ..., 17) and 10 states for soft 12-21 (18, 19, ...27).
bettermdptools/algorithms/planner.py
Line 127 in 12163cf
The "value_iteration()" works perfectly.
However, "policy_iteration()" appears to get stuck on line 135 of planner.py in "policy_evaluation()" as the break statement isn't reached.
if np.max(np.abs(prev_V - V)) < theta:
break
This appears to happen due to the Taxi-v3 environment (link) having a -1 reward for each step taken, which prevents the absolute difference from line 135 from being satisfied.
Code to reproduce:
import gymnasium as gym
from bettermdptools.algorithms.planner import Planner
large_mdp = gym.make('Taxi-v3', render_mode=None)
observation, info = large_mdp.reset(seed=555) # passenger at green, destination at yellow
V, V_track, pi = Planner(large_mdp.unwrapped.P).policy_iteration()
To bypass the infinite loop, and reach an identical result to value iteration I increased theta and n_iters, this works, but seems against the spirit of policy iteration:
import gymnasium as gym
from bettermdptools.algorithms.planner import Planner
large_mdp = gym.make('Taxi-v3', render_mode=None)
observation, info = large_mdp.reset(seed=555) # passenger at green, destination at yellow
V, V_track, pi = Planner(large_mdp.unwrapped.P).policy_iteration(n_iters=10000, theta=10000)
I was reading up on Policy Evaluation, it seems that it is required that for convergence the value function (within policy_evaluation()) needs to be monotonically increasing, and the Taxi-v3 environment doesn't satisfy that requirement as it has -1 rewards for steps.
I wonder if a different criteria of convergence could be used in "policy_evaluation()", perhaps np.isclose and use a relative difference?
Pygame errors out on linux ("libGL error: MESA-LOADER: failed to open iris"
See: https://stackoverflow.com/questions/72110384/libgl-error-mesa-loader-failed-to-open-iris
Solution 2, edit the first line
cd /home/$USER/anaconda3/envs/$ENV/lib
where:
$USER is your username
$ENV is the name of your conda env.
See deprecation warnings.
Run Q-learning on wrapped env in plot examples instead of .env attribute, which is the environment underneath the first wrapper.
... but the current implementation starts every evaluation with zero utility values.
A trivial code update sped things up by 1.5x in several experiments.
#ireplace 8x8 with sqrt of data length
https://github.com/jlm429/bettermdptools/blob/master/examples/plots.py#L39
When I try to instantiate the Blackjack class, I receive the following error:
AttributeError: 'Blackjack' object has no attribute '_P'
Which appears to stem from the missing file blackjack-envP:
Pickle load failed. Check path C:\Users\diuan.conda\envs\py3.8\lib\site-packages\examples\blackjack-envP
Suspicion is that the pip install doesn't load this file alongside others in the example folder, since this file in particular does not contain a .py extension.
Make sure the site-packages directory has the correct name. Run the following command to locate the site-packages folder.
python -c "import site; print(site.getsitepackages())"
Also see:
https://stackoverflow.com/questions/43485569/installed-a-package-with-anaconda-cant-import-in-python
Using the example tutorial code from the ReadMe:
import gymnasium as gym
from bettermdptools.algorithms.planner import Planner
from bettermdptools.utils.plots import Plots
# make gym environment
frozen_lake = gym.make('FrozenLake8x8-v1', render_mode=None)
# run VI
V, V_track, pi = Planner(frozen_lake.P).value_iteration()
#plot state values
size=(8,8)
Plots.values_heat_map(V, "Frozen Lake\nValue Iteration State Values", size)
I get the following warning:
UserWarning: WARN: env.P to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.P` for environment variables or `env.get_wrapper_attr('P')` that will search the reminding wrappers.
I know it's just a warning, but this should probably be updated to use the modern wrapper technique.
Add the imports to the README:
from examples.plots import Plots
from examples.grid_search import GridSearch
epsilon_decay = [.4, .7, .9]
iters = [500, 5000, 50000]
GridSearch.Q_learning_grid_search(frozen_lake.env, epsilon_decay, iters)
#plot state values
frozen_lake = gym.make('FrozenLake8x8-v1', render_mode=None)
V, V_track, pi = Planner(frozen_lake.env.P).value_iteration()
Plots.grid_values_heat_map(V, "State Values")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.