sawcordwell / pymdptoolbox Goto Github PK
View Code? Open in Web Editor NEWMarkov Decision Process (MDP) Toolbox for Python
License: BSD 3-Clause "New" or "Revised" License
Markov Decision Process (MDP) Toolbox for Python
License: BSD 3-Clause "New" or "Revised" License
Using the forest example from the quickstart guide and modifying the max_iter does not seem to work.
It looks like it always defaults to 39 and stops.
import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9, max_iter=100)
vi.run()
vi.policy
print vi.V
print vi.max_iter
(5.051970000000001, 8.291970000000001, 12.291970000000001)
39
I am experimenting with number of iterations to see how the algorithm is performing and it would be very beneficial to collect values at each iteration
Update:
Setting MDP to verbose shows that only 4 iterations were executed. So what does 39 means here?
Iteration V-variation
1 4.0
2 2.43
3 0.81
4 0.0
Iterating stopped, epsilon-optimal policy found.
(5.051970000000001, 8.291970000000001, 12.291970000000001)
39
I tried to use this library for the policy iteration. I used this code:
P = np.array([[[0.25, 0.25, 0.5],
[0.75, 0, 0.25],
[0.5, 0.5, 0]],
[[0, 0.25, 0.75],
[0.25, 0, 0.75],
[0.25, 0.25, 0.5]]]) # A * S * S
R = np.array([[0.55, 0.75], [1, 0.8], [1.2, 1]]) # S * A
pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.000000001, [0, 0, 0])
pi.run()
print(pi.policy)
I want to solve this problem actually without discounting, and if this is not possible with a very small discount but whether I use 1, 0.000000001 or 0.999999 I get the same result which is wrong since we solved this exercise in my class and the results there are correct. What am I doing wrong?
The policies that the LP class gives are incorrect in (every?) case. This may be because the class used to call cvxopt with the glpk option but now it uses the native cvxopt algorithms, so it may need different options now?
Dose it cannot be trained? After reading the document, i have no opinion about how to train this model.
Somebody can tell me how to train it here?
These are useful for checking policies in cases other than policy iteration algorithms, so they should be exported under mdptoolbox top level module.
I'm fairly unfamiliar with how pip works, however I guess the newer versions only allow installing of certain release number formats. so the default "pip install <>" isn't working.
This SO question/answer talks about the format/convention that is necessary for the newer pip installs.
Pre-release Versions
Starting with v1.4, pip will only install stable versions as specified by PEP426 by default. If a version cannot be parsed as a compliant PEP426 version then it is assumed to be a pre-release.
If a Requirement specifier includes a pre-release or development version (e.g. >=0.0.dev0) then pip will allow pre-release and development versions for that requirement. This does not include the != flag.
The pip install command also supports a --pre flag that will enable installing pre-releases and development releases.
So I finally tried
pip install --pre pymdptoolbox
and it worked
It seems that all the algorithms require that you pass a transition probability table and reward vector, however most of the usefullness of algorithms such as QLearning relies on the fact that it doesn't need these values to estimate policies.
Is this by design? A good update to the library would be to enable model-free learning, because most of the time you don't know the model, you have to simulate it. This would make it much more useful to more people.
The _boundIter
method causes the algorithm to stop before it has reached the epsilon-optimal value function. Steps to reproduce:
import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vigs = mdptoolbox.mdp.ValueIterationGS(P, R, 0.96)
vigs.setVerbose()
vigs.run()
What is expected: _boundIter
should compute something that the algorithm can finish before, or it should not be used.
I'm using the backward induction (Finite Horizon).
When I put this fh = mdptoolbox.mdp.FiniteHorizon(P, R, d, T, skip_check=True)
in my code, I get the following error:
TypeError: init() got an unexpected keyword argument 'skip_check'
Is it a bug or am I not skipping the check the right way?
Thanks
Many of the doctests are failing when they get run on each module. The package will probabliy have to be restructured somewhat because I can't get Python 3 to run the doctests as in at the moment.
In many relevant MDPs, some actions are unavailable in some states. An example is the "Recycling Robot" (Example 3.3 in Sutton & Barto). In short, a robot cannot recharge its battery when the battery is fully charged.
In pymdptoolbox, however, it's necessary to define a probability of state transition for all actions and all states. If an action is not available in a certain state, these probabilities have no meaning. Setting them to 0 is inadequate and causes errors, because the state transition matrix needs to have row sums of 1 for each action.
Another possibility would be to assign a probability of 1 and reward of 0 (or less) to "remaining in the same state", but this seems a bit artificial, since the action isn't actually available.
Do you have any ideas how to circumvent this problem in a more elegant way?
I need some advice for with the recycling robot mdp problem, even though the robot can't charge at high level I add it so the transition matrix be stochastic and also for the search reward array do I put both states rewards as 5? because we can get the -3 while searching in low.
In my case the reward for search is 5 and 2 for wait, alpha is 0.8 and beta is 0.4.
`
#Recycling Robot MDP
import mdptoolbox as mdpT
import numpy as np;
from numpy.linalg import matrix_power
from numpy import linalg as LA
S=2 #number of states High & Low
A=3 #number of possible actions
w=0 # wait
s=1 #search
r = 2 # recharge
#discount factor
gamma=0.9
#transitions array
#---------------------
P = np.zeros((A, S, S))
#transition matrix for wait
P[w] = [[1,0],[0,1]]
#transition matrix for search
P[s] =[[0.8,0.2], [0.6,0.4]]
#transition matrix for recharge
P[r] =[[1,0], [1,0]]
#Reward array
R = np.zeros((S,A))
R[:,w]=[1,1]
R[:,s]=[5,5]
R[:,r]=[0,0]
print("P=\n", P)
print("R=\n", R)
#finding optimal policy using policy iteration
prob =mdpT.mdp.PolicyIteration(P,R,gamma)
prob.run()
print("Optimal policy:",prob.policy)
print("Optimal values:",prob.V)
`
I was trying to understand the source code of boundIter() function which computes a bound for the number of iterations as per Proposition 6.6.5 in “Markov Decision Processes”, M. L. Puterman, 1994:
Pymdptoolbox: https://github.com/sawcordwell/pymdptoolbox/blob/master/src/mdptoolbox/mdp.py
Although the Hajnal measure (k) is defined by Theorem 6.6.6, and its implementation code is clear, I can’t understand the overall meaning behind the formula:
It’s very difficult for me to relate the equation used to calculate maximum iteration to the formula used in the code:
max_iter = (log((epsilon * (1 - discount) / discount) / span) / log(discount * k))
Can you please explain that formula to me, so that I can understand why the logarithm is used and how Hajnal measure (k) can be used to calculate the maximum iteration?
Your Support is much appreciated!
assert self.max_iter >= 10000, "'n_iter' should be greater than 10000."
This line in QLearning is inconvenient and arbitrary. It should be either removed or turned into a warning.
I am interested in hacking these classes, for instance, such that they can iteratively learn as the Reward matrix changes. As such, small values of this value are necessary.
Hi,
Is there an example to test-use the LP algo in the documentation? I was able to run all the examples in the docs but couldn't find any references to the LP, which I believe is still beta-testing? I wouldn't mind using it even if results are not 100% correct.
Thanks,
Gui
Code:
import mdptoolbox.example
P, R = mdptoolbox.example.small()
S = [i for i in range(len(R))]
A = [i for i in range(len(R[0]))]
gamma = 0.99
PI = mdptoolbox.mdp.PolicyIteration(P, R, gamma)
PI.run()
print("Optimal policy for policy iteration: {}".format([PI.policy[s] for s in S]))
print("Optimal value for policy iteration: {}".format([PI.V[s] for s in S]))
VI = mdptoolbox.mdp.ValueIteration(P, R, gamma)
VI.run()
print("Optimal policy for value iteration: {}".format( [ VI.policy[s] for s in S]))
print("Optimal value for value iteration: {}".format( [VI.V[s] for s in S]))
Output:
Optimal policy for policy iteration: [1, 0]
Optimal value for policy iteration: [392.29910714285614, 386.16071428571325]
Optimal policy for value iteration: [1, 0]
Optimal value for value iteration: [150.86673787890174, 144.72830416664837]
Also for QLearning, PolicyInterationModified, ValueIterationGS, all of them are different.
Just downloaded Anaconda and installed mdptoolbox. However, import failed due to API of mdptoolbox having been compiled against API version 9 of numpy and Anaconda has version 7 of numpy. Recommendation?
use it helps me solve my problem easily!
Is there a type of user manual for this toolbox?
The following code runs fine with dense matrices, but fails in different ways when sparse matrices are used (i.e. when the lines P = list(map(csr_matrix, P))
and/or R = list(map(csr_matrix, R))
are uncommented):
import mdptoolbox
import numpy as np
from scipy.sparse import csr_matrix
P = [None] * 2
P[0] = np.array([
[0.0, 0.0, 0.6, 0.4, 0.0],
[0.0, 0.0, 0.0, 0.0, 1.0],
[0.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 1.0]
])
P[1] = np.array([
[0.0, 0.4, 0.0, 0.0, 0.6],
[0.0, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 1.0],
[0.0, 0.0, 0.0, 0.0, 1.0],
[0.0, 0.0, 0.0, 0.0, 1.0]
])
R = [None] * 2
R[0] = np.zeros((5, 5))
R[1] = np.array([
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]
])
# P = list(map(csr_matrix, P))
# R = list(map(csr_matrix, R))
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
vi.run()
assert vi.policy[0] == 1
pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
pi.run()
assert pi.policy[0] == 1
With sparse P
and dense R
:
Traceback (most recent call last):
File "example.py", line 34, in <module>
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
File "mdptoolbox/mdp.py", line 1300, in __init__
self._boundIter(epsilon)
File "mdptoolbox/mdp.py", line 1345, in _boundIter
null, value = self._bellmanOperator()
File "mdptoolbox/mdp.py", line 232, in _bellmanOperator
Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
ValueError: setting an array element with a sequence.
With dense P
and sparse R
:
Traceback (most recent call last):
File "mdptoolbox/mdp.py", line 279, in _computePR
if R.ndim == 1:
AttributeError: 'list' object has no attribute 'ndim'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 34, in <module>
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
File "mdptoolbox/mdp.py", line 1288, in __init__
MDP.__init__(self, transitions, reward, discount, epsilon, max_iter)
File "mdptoolbox/mdp.py", line 187, in __init__
self._computePR(transitions, reward)
File "mdptoolbox/mdp.py", line 291, in _computePR
for aa in range(self.A))
File "mdptoolbox/mdp.py", line 291, in <genexpr>
for aa in range(self.A))
AttributeError: 'NotImplementedType' object has no attribute 'sum'
With sparse P
and sparse R
:
mdptoolbox/mdp.py:1348: RuntimeWarning: divide by zero encountered in double_scalars
getSpan(value - Vprev) ) / _math.log(self.discount * k))
Traceback (most recent call last):
File "example.py", line 34, in <module>
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
File "mdptoolbox/mdp.py", line 1300, in __init__
self._boundIter(epsilon)
File "mdptoolbox/mdp.py", line 1351, in _boundIter
self.max_iter = int(_math.ceil(max_iter))
OverflowError: cannot convert float infinity to integer
I'm using Python 3.4.0, scipy 0.14.1, numpy 1.9.1 and mdptoolbox 5af7d2a.
The functionality is slightly differet in several places when a discount rate of 1 is used, and tests need to be written to cover these cases, as shown by #5 .
pymdptoolbox/src/mdptoolbox/util.py
Line 159 in 7c96789
Due to either a bug or feature in numpy (I'm running numpy 1.14.0 and python 3.5.2), the above line returns a matrix of differences instead of a vector. This results in a memory error for relatively small vectors (in my case matrix.shape = (97200, 97200) and I have 40GB RAM). I think the fix is easy, just force the second term to be a proper column vector:
(_np.abs(matrix.sum(axis=1) - _np.ones((matrix.shape[0],1))))
The epsilon obtained is 1-1 / (log (n + 2)), and the update Q uses 1 / sqrt (n + 2)。
Seems like a good choice, is there any basis for doing so?
Due to the way that scipy works, if the rewards are given in sparse format they are at the moment converted to dense arrays. Specifically sparse.multiply(dense)
will result in a dense numpy.matrix
and sparse.sum(1)
will result in a numpy.matrix
.
It may be dersirable to have a way of keeping them sparse.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.