sawcordwell / pymdptoolbox Goto Github PK

View Code? Open in Web Editor NEW

518.0 518.0 252.0 759 KB

Markov Decision Process (MDP) Toolbox for Python

License: BSD 3-Clause "New" or "Revised" License

Shell 0.14% Python 99.86%

pymdptoolbox's People

Contributors

Stargazers

Watchers

Forkers

rgaonkar iinonaoki silgon yarden lenovor darcy0511 jethrotan lucosax pfjob09 amoliu lutz1013 snazz2001 r0k3 cosmoharrigan danielrich drvinceknight jasyoung315 nasa exg77 mccabejm davidshinn jameskyle yywxenia pdthill maxwellrebo lstavr aliciawyy laurentperrinet jyprksh001 pwaila castley mshvartsman imanojkumar blueyyc0 bo-yang jenifferyingyiwu exception4u zsh1313 tima04 catchbears samgoree h3dema bionicles grseb9s ianmadlenya egmeek pythonai wolfenfeld haoybl liyanonline ianchute yswhynot michaeljyt ryanpeach xinhandi julian-ramos pshwu nguyentu1602 vishalbelsare vincentami shellstreetlabs leofelipefrohlich hurryon rn0311 jinhuajian reiisky movermeyer baptisteamato loganwendholt zwc662 ifgi-sil jolyzhang nbarbosa-git waykole zhangzhangzf zrajabi wenkangwei zhaosv cocolico14 nunofernandes-plight schweinebauch sealyu cpimentel10 crypto-jeronimo stjordanis cooperbaerseth khiemb rajaramkuberan alexandroid000 sachet-mittal wenliyuan davidbenton hj-dojo jspromisel gp1702 exclaminator ziyuezhang oiwen jjyy2333 pierrefg

pymdptoolbox's Issues

max_iteration parameters is not respected in the ValueIteration Algorithm

Using the forest example from the quickstart guide and modifying the max_iter does not seem to work.

It looks like it always defaults to 39 and stops.

import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9, max_iter=100)
vi.run()
vi.policy
print vi.V
print vi.max_iter

(5.051970000000001, 8.291970000000001, 12.291970000000001)
39

I am experimenting with number of iterations to see how the algorithm is performing and it would be very beneficial to collect values at each iteration

Update:

Setting MDP to verbose shows that only 4 iterations were executed. So what does 39 means here?


  Iteration     V-variation
    1         4.0
    2         2.43
    3         0.81
    4         0.0
Iterating stopped, epsilon-optimal policy found.
(5.051970000000001, 8.291970000000001, 12.291970000000001)
39

Question about undiscounted model

I tried to use this library for the policy iteration. I used this code:

P = np.array([[[0.25, 0.25, 0.5],
[0.75, 0, 0.25],
[0.5, 0.5, 0]],

          [[0,    0.25, 0.75],
           [0.25, 0,    0.75],
           [0.25, 0.25, 0.5]]]) # A * S * S

R = np.array([[0.55, 0.75], [1, 0.8], [1.2, 1]]) # S * A

pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.000000001, [0, 0, 0])
pi.run()
print(pi.policy)

I want to solve this problem actually without discounting, and if this is not possible with a very small discount but whether I use 1, 0.000000001 or 0.999999 I get the same result which is wrong since we solved this exercise in my class and the results there are correct. What am I doing wrong?

Linear programming class is broken

The policies that the LP class gives are incorrect in (every?) case. This may be because the class used to call cvxopt with the glpk option but now it uses the native cvxopt algorithms, so it may need different options now?

how to train it?

Dose it cannot be trained? After reading the document, i have no opinion about how to train this model.
Somebody can tell me how to train it here?

Move evalPolicy* functions to be functions of the util module

These are useful for checking policies in cases other than policy iteration algorithms, so they should be exported under mdptoolbox top level module.

pip install issues

I'm fairly unfamiliar with how pip works, however I guess the newer versions only allow installing of certain release number formats. so the default "pip install <>" isn't working.
This SO question/answer talks about the format/convention that is necessary for the newer pip installs.

http://stackoverflow.com/questions/18230956/could-not-find-a-version-that-satisfies-the-requirement-pytz

Pre-release Versions

Starting with v1.4, pip will only install stable versions as specified by PEP426 by default. If a version cannot be parsed as a compliant PEP426 version then it is assumed to be a pre-release.

If a Requirement specifier includes a pre-release or development version (e.g. >=0.0.dev0) then pip will allow pre-release and development versions for that requirement. This does not include the != flag.

The pip install command also supports a --pre flag that will enable installing pre-releases and development releases.

So I finally tried

pip install --pre pymdptoolbox

and it worked

Please release a new version with the latest changes

Model-free algorithms depend on model

It seems that all the algorithms require that you pass a transition probability table and reward vector, however most of the usefullness of algorithms such as QLearning relies on the fact that it doesn't need these values to estimate policies.

Is this by design? A good update to the library would be to enable model-free learning, because most of the time you don't know the model, you have to simulate it. This would make it much more useful to more people.

ValueIterationGS _boundIter is incorrect

The _boundIter method causes the algorithm to stop before it has reached the epsilon-optimal value function. Steps to reproduce:

import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vigs = mdptoolbox.mdp.ValueIterationGS(P, R, 0.96)
vigs.setVerbose()
vigs.run()

What is expected: _boundIter should compute something that the algorithm can finish before, or it should not be used.

skip_check

I'm using the backward induction (Finite Horizon).
When I put this fh = mdptoolbox.mdp.FiniteHorizon(P, R, d, T, skip_check=True) in my code, I get the following error:

TypeError: init() got an unexpected keyword argument 'skip_check'

Is it a bug or am I not skipping the check the right way?

Thanks

Make sure that doctests are run when testing on Travis CI

Many of the doctests are failing when they get run on each module. The package will probabliy have to be restructured somewhat because I can't get Python 3 to run the doctests as in at the moment.

MDP where not all actions are always available

In many relevant MDPs, some actions are unavailable in some states. An example is the "Recycling Robot" (Example 3.3 in Sutton & Barto). In short, a robot cannot recharge its battery when the battery is fully charged.

In pymdptoolbox, however, it's necessary to define a probability of state transition for all actions and all states. If an action is not available in a certain state, these probabilities have no meaning. Setting them to 0 is inadequate and causes errors, because the state transition matrix needs to have row sums of 1 for each action.

Another possibility would be to assign a probability of 1 and reward of 0 (or less) to "remaining in the same state", but this seems a bit artificial, since the action isn't actually available.

Do you have any ideas how to circumvent this problem in a more elegant way?

Recycling robot MDP

I need some advice for with the recycling robot mdp problem, even though the robot can't charge at high level I add it so the transition matrix be stochastic and also for the search reward array do I put both states rewards as 5? because we can get the -3 while searching in low.
In my case the reward for search is 5 and 2 for wait, alpha is 0.8 and beta is 0.4.
`
#Recycling Robot MDP
import mdptoolbox as mdpT
import numpy as np;
from numpy.linalg import matrix_power
from numpy import linalg as LA

S=2 #number of states High & Low
A=3 #number of possible actions
w=0 # wait
s=1 #search
r = 2 # recharge

#discount factor
gamma=0.9

#transitions array
#---------------------
P = np.zeros((A, S, S))
#transition matrix for wait
P[w] = [[1,0],[0,1]]
#transition matrix for search
P[s] =[[0.8,0.2], [0.6,0.4]]
#transition matrix for recharge
P[r] =[[1,0], [1,0]]

#Reward array
R = np.zeros((S,A))
R[:,w]=[1,1]
R[:,s]=[5,5]
R[:,r]=[0,0]

print("P=\n", P)
print("R=\n", R)

#finding optimal policy using policy iteration

prob =mdpT.mdp.PolicyIteration(P,R,gamma)
prob.run()
print("Optimal policy:",prob.policy)
print("Optimal values:",prob.V)
`

Stopping Criteria || pymdptoolbox || boundIter() function

I was trying to understand the source code of boundIter() function which computes a bound for the number of iterations as per Proposition 6.6.5 in “Markov Decision Processes”, M. L. Puterman, 1994:

Pymdptoolbox: https://github.com/sawcordwell/pymdptoolbox/blob/master/src/mdptoolbox/mdp.py

Although the Hajnal measure (k) is defined by Theorem 6.6.6, and its implementation code is clear, I can’t understand the overall meaning behind the formula:

It’s very difficult for me to relate the equation used to calculate maximum iteration to the formula used in the code:

max_iter = (log((epsilon * (1 - discount) / discount) / span) / log(discount * k))

Can you please explain that formula to me, so that I can understand why the logarithm is used and how Hajnal measure (k) can be used to calculate the maximum iteration?

Your Support is much appreciated!

Implement own exception class

Improper Assertion Statement.

assert self.max_iter >= 10000, "'n_iter' should be greater than 10000."

This line in QLearning is inconvenient and arbitrary. It should be either removed or turned into a warning.

I am interested in hacking these classes, for instance, such that they can iteratively learn as the Reward matrix changes. As such, small values of this value are necessary.

Linear Programming algo

Hi,
Is there an example to test-use the LP algo in the documentation? I was able to run all the examples in the docs but couldn't find any references to the LP, which I believe is still beta-testing? I wouldn't mind using it even if results are not 100% correct.
Thanks,
Gui

Why the values of policy iteration and the values of value iteration are different?

Code:

import mdptoolbox.example

P, R = mdptoolbox.example.small()                                     
S = [i for i in range(len(R))]                                                  
A = [i for i in range(len(R[0]))]                                               
gamma = 0.99 
PI = mdptoolbox.mdp.PolicyIteration(P, R, gamma)
PI.run()                                                                        
print("Optimal policy for policy iteration: {}".format([PI.policy[s] for s in S]))                
print("Optimal value for policy iteration: {}".format([PI.V[s] for s in S]))      
VI = mdptoolbox.mdp.ValueIteration(P, R, gamma)
VI.run()
print("Optimal policy for value iteration: {}".format( [ VI.policy[s]  for s in S]))
print("Optimal value for value iteration: {}".format( [VI.V[s] for s in S]))

Output:

Optimal policy for policy iteration: [1, 0]
Optimal value for policy iteration: [392.29910714285614, 386.16071428571325]
Optimal policy for value iteration: [1, 0]
Optimal value for value iteration: [150.86673787890174, 144.72830416664837]

Also for QLearning, PolicyInterationModified, ValueIterationGS, all of them are different.

Numpy Version

Just downloaded Anaconda and installed mdptoolbox. However, import failed due to API of mdptoolbox having been compiled against API version 9 of numpy and Anaconda has version 7 of numpy. Recommendation?

thanks

use it helps me solve my problem easily!

User guide?

Is there a type of user manual for this toolbox?

Errors with sparse transition and/or reward matrices

The following code runs fine with dense matrices, but fails in different ways when sparse matrices are used (i.e. when the lines P = list(map(csr_matrix, P)) and/or R = list(map(csr_matrix, R)) are uncommented):

import mdptoolbox
import numpy as np
from scipy.sparse import csr_matrix

P = [None] * 2
P[0] = np.array([
    [0.0, 0.0, 0.6, 0.4, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0],
    [0.0, 0.0, 1.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 1.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0]
])
P[1] = np.array([
    [0.0, 0.4, 0.0, 0.0, 0.6],
    [0.0, 1.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0],
    [0.0, 0.0, 0.0, 0.0, 1.0],
    [0.0, 0.0, 0.0, 0.0, 1.0]
])

R = [None] * 2
R[0] = np.zeros((5, 5))
R[1] = np.array([
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0]
])

# P = list(map(csr_matrix, P))
# R = list(map(csr_matrix, R))

vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
vi.run()
assert vi.policy[0] == 1

pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
pi.run()
assert pi.policy[0] == 1

With sparse P and dense R:

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
  File "mdptoolbox/mdp.py", line 1300, in __init__
    self._boundIter(epsilon)
  File "mdptoolbox/mdp.py", line 1345, in _boundIter
    null, value = self._bellmanOperator()
  File "mdptoolbox/mdp.py", line 232, in _bellmanOperator
    Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
ValueError: setting an array element with a sequence.

With dense P and sparse R:

Traceback (most recent call last):
  File "mdptoolbox/mdp.py", line 279, in _computePR
    if R.ndim == 1:
AttributeError: 'list' object has no attribute 'ndim'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
  File "mdptoolbox/mdp.py", line 1288, in __init__
    MDP.__init__(self, transitions, reward, discount, epsilon, max_iter)
  File "mdptoolbox/mdp.py", line 187, in __init__
    self._computePR(transitions, reward)
  File "mdptoolbox/mdp.py", line 291, in _computePR
    for aa in range(self.A))
  File "mdptoolbox/mdp.py", line 291, in <genexpr>
    for aa in range(self.A))
AttributeError: 'NotImplementedType' object has no attribute 'sum'

With sparse P and sparse R:

mdptoolbox/mdp.py:1348: RuntimeWarning: divide by zero encountered in double_scalars
  getSpan(value - Vprev) ) / _math.log(self.discount * k))
Traceback (most recent call last):
  File "example.py", line 34, in <module>
    vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
  File "mdptoolbox/mdp.py", line 1300, in __init__
    self._boundIter(epsilon)
  File "mdptoolbox/mdp.py", line 1351, in _boundIter
    self.max_iter = int(_math.ceil(max_iter))
OverflowError: cannot convert float infinity to integer

I'm using Python 3.4.0, scipy 0.14.1, numpy 1.9.1 and mdptoolbox 5af7d2a.

Unit tests for undiscounted MDPs required

The functionality is slightly differet in several places when a discount rate of 1 is used, and tests need to be written to cover these cases, as shown by #5 .

Absolute difference is unnecessarily space-inefficient

pymdptoolbox/src/mdptoolbox/util.py

Line 159 in 7c96789

absdiff = (_np.abs(matrix.sum(axis=1) - _np.ones(matrix.shape[0])))

Due to either a bug or feature in numpy (I'm running numpy 1.14.0 and python 3.5.2), the above line returns a matrix of differences instead of a vector. This results in a memory error for relatively small vectors (in my case matrix.shape = (97200, 97200) and I have 40GB RAM). I think the fix is easy, just force the second term to be a proper column vector:

(_np.abs(matrix.sum(axis=1) - _np.ones((matrix.shape[0],1))))

Why changed the epsilon in Q-learning and the way to update Q, is this better?

The epsilon obtained is 1-1 / (log (n + 2)), and the update Q uses 1 / sqrt (n + 2)。
Seems like a good choice, is there any basis for doing so?

Sparse rewards are converted to dense arrays

Due to the way that scipy works, if the rewards are given in sparse format they are at the moment converted to dense arrays. Specifically sparse.multiply(dense) will result in a dense numpy.matrix and sparse.sum(1) will result in a numpy.matrix.

It may be dersirable to have a way of keeping them sparse.

Solution for basic grid world example

I'm trying to model the canonical grid world example, but I'm having a lot of trouble figuring out how exactly the inputs should be defined. I keep getting action/state count exceptions or math domain exceptions.

What are the proper P and R inputs for the following problem?