Coder Social home page Coder Social logo

pymdptoolbox's People

Contributors

sawcordwell avatar thejat avatar yasserglez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pymdptoolbox's Issues

max_iteration parameters is not respected in the ValueIteration Algorithm

Using the forest example from the quickstart guide and modifying the max_iter does not seem to work.

It looks like it always defaults to 39 and stops.

import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9, max_iter=100)
vi.run()
vi.policy
print vi.V
print vi.max_iter

(5.051970000000001, 8.291970000000001, 12.291970000000001)
39

I am experimenting with number of iterations to see how the algorithm is performing and it would be very beneficial to collect values at each iteration

Update:

Setting MDP to verbose shows that only 4 iterations were executed. So what does 39 means here?


  Iteration     V-variation
    1         4.0
    2         2.43
    3         0.81
    4         0.0
Iterating stopped, epsilon-optimal policy found.
(5.051970000000001, 8.291970000000001, 12.291970000000001)
39

Question about undiscounted model

I tried to use this library for the policy iteration. I used this code:

P = np.array([[[0.25, 0.25, 0.5],
[0.75, 0, 0.25],
[0.5, 0.5, 0]],

          [[0,    0.25, 0.75],
           [0.25, 0,    0.75],
           [0.25, 0.25, 0.5]]]) # A * S * S

R = np.array([[0.55, 0.75], [1, 0.8], [1.2, 1]]) # S * A

pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.000000001, [0, 0, 0])
pi.run()
print(pi.policy)

I want to solve this problem actually without discounting, and if this is not possible with a very small discount but whether I use 1, 0.000000001 or 0.999999 I get the same result which is wrong since we solved this exercise in my class and the results there are correct. What am I doing wrong?

Linear programming class is broken

The policies that the LP class gives are incorrect in (every?) case. This may be because the class used to call cvxopt with the glpk option but now it uses the native cvxopt algorithms, so it may need different options now?

how to train it?

Dose it cannot be trained? After reading the document, i have no opinion about how to train this model.
Somebody can tell me how to train it here?

pip install issues

I'm fairly unfamiliar with how pip works, however I guess the newer versions only allow installing of certain release number formats. so the default "pip install <>" isn't working.
This SO question/answer talks about the format/convention that is necessary for the newer pip installs.

http://stackoverflow.com/questions/18230956/could-not-find-a-version-that-satisfies-the-requirement-pytz

Pre-release Versions

Starting with v1.4, pip will only install stable versions as specified by PEP426 by default. If a version cannot be parsed as a compliant PEP426 version then it is assumed to be a pre-release.

If a Requirement specifier includes a pre-release or development version (e.g. >=0.0.dev0) then pip will allow pre-release and development versions for that requirement. This does not include the != flag.

The pip install command also supports a --pre flag that will enable installing pre-releases and development releases.

So I finally tried

pip install --pre pymdptoolbox 

and it worked

Model-free algorithms depend on model

It seems that all the algorithms require that you pass a transition probability table and reward vector, however most of the usefullness of algorithms such as QLearning relies on the fact that it doesn't need these values to estimate policies.

Is this by design? A good update to the library would be to enable model-free learning, because most of the time you don't know the model, you have to simulate it. This would make it much more useful to more people.

ValueIterationGS _boundIter is incorrect

The _boundIter method causes the algorithm to stop before it has reached the epsilon-optimal value function. Steps to reproduce:

import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vigs = mdptoolbox.mdp.ValueIterationGS(P, R, 0.96)
vigs.setVerbose()
vigs.run()

What is expected: _boundIter should compute something that the algorithm can finish before, or it should not be used.

skip_check

I'm using the backward induction (Finite Horizon).
When I put this fh = mdptoolbox.mdp.FiniteHorizon(P, R, d, T, skip_check=True) in my code, I get the following error:

TypeError: init() got an unexpected keyword argument 'skip_check'

Is it a bug or am I not skipping the check the right way?

Thanks

MDP where not all actions are always available

In many relevant MDPs, some actions are unavailable in some states. An example is the "Recycling Robot" (Example 3.3 in Sutton & Barto). In short, a robot cannot recharge its battery when the battery is fully charged.

In pymdptoolbox, however, it's necessary to define a probability of state transition for all actions and all states. If an action is not available in a certain state, these probabilities have no meaning. Setting them to 0 is inadequate and causes errors, because the state transition matrix needs to have row sums of 1 for each action.

Another possibility would be to assign a probability of 1 and reward of 0 (or less) to "remaining in the same state", but this seems a bit artificial, since the action isn't actually available.

Do you have any ideas how to circumvent this problem in a more elegant way?

Recycling robot MDP

I need some advice for with the recycling robot mdp problem, even though the robot can't charge at high level I add it so the transition matrix be stochastic and also for the search reward array do I put both states rewards as 5? because we can get the -3 while searching in low.
In my case the reward for search is 5 and 2 for wait, alpha is 0.8 and beta is 0.4.
`
#Recycling Robot MDP
import mdptoolbox as mdpT
import numpy as np;
from numpy.linalg import matrix_power
from numpy import linalg as LA

S=2 #number of states High & Low
A=3 #number of possible actions
w=0 # wait
s=1 #search
r = 2 # recharge

#discount factor
gamma=0.9

#transitions array
#---------------------
P = np.zeros((A, S, S))
#transition matrix for wait
P[w] = [[1,0],[0,1]]
#transition matrix for search
P[s] =[[0.8,0.2], [0.6,0.4]]
#transition matrix for recharge
P[r] =[[1,0], [1,0]]

#Reward array
R = np.zeros((S,A))
R[:,w]=[1,1]
R[:,s]=[5,5]
R[:,r]=[0,0]

print("P=\n", P)
print("R=\n", R)

#finding optimal policy using policy iteration

prob =mdpT.mdp.PolicyIteration(P,R,gamma)
prob.run()
print("Optimal policy:",prob.policy)
print("Optimal values:",prob.V)
`

Stopping Criteria || pymdptoolbox || boundIter() function

I was trying to understand the source code of boundIter() function which computes a bound for the number of iterations as per Proposition 6.6.5 in “Markov Decision Processes”, M. L. Puterman, 1994:

image

Pymdptoolbox: https://github.com/sawcordwell/pymdptoolbox/blob/master/src/mdptoolbox/mdp.py

image

Although the Hajnal measure (k) is defined by Theorem 6.6.6, and its implementation code is clear, I can’t understand the overall meaning behind the formula:

image

It’s very difficult for me to relate the equation used to calculate maximum iteration to the formula used in the code:

max_iter = (log((epsilon * (1 - discount) / discount) / span) / log(discount * k))

Can you please explain that formula to me, so that I can understand why the logarithm is used and how Hajnal measure (k) can be used to calculate the maximum iteration?

Your Support is much appreciated!

Improper Assertion Statement.

assert self.max_iter >= 10000, "'n_iter' should be greater than 10000."

This line in QLearning is inconvenient and arbitrary. It should be either removed or turned into a warning.

I am interested in hacking these classes, for instance, such that they can iteratively learn as the Reward matrix changes. As such, small values of this value are necessary.

Linear Programming algo

Hi,
Is there an example to test-use the LP algo in the documentation? I was able to run all the examples in the docs but couldn't find any references to the LP, which I believe is still beta-testing? I wouldn't mind using it even if results are not 100% correct.
Thanks,
Gui

Why the values of policy iteration and the values of value iteration are different?

Code:

import mdptoolbox.example

P, R = mdptoolbox.example.small()                                     
S = [i for i in range(len(R))]                                                  
A = [i for i in range(len(R[0]))]                                               
gamma = 0.99 
PI = mdptoolbox.mdp.PolicyIteration(P, R, gamma)
PI.run()                                                                        
print("Optimal policy for policy iteration: {}".format([PI.policy[s] for s in S]))                
print("Optimal value for policy iteration: {}".format([PI.V[s] for s in S]))      
VI = mdptoolbox.mdp.ValueIteration(P, R, gamma)
VI.run()
print("Optimal policy for value iteration: {}".format( [ VI.policy[s]  for s in S]))
print("Optimal value for value iteration: {}".format( [VI.V[s] for s in S]))

Output:

Optimal policy for policy iteration: [1, 0]
Optimal value for policy iteration: [392.29910714285614, 386.16071428571325]
Optimal policy for value iteration: [1, 0]
Optimal value for value iteration: [150.86673787890174, 144.72830416664837]

Also for QLearning, PolicyInterationModified, ValueIterationGS, all of them are different.

Numpy Version

Just downloaded Anaconda and installed mdptoolbox. However, import failed due to API of mdptoolbox having been compiled against API version 9 of numpy and Anaconda has version 7 of numpy. Recommendation?

thanks

use it helps me solve my problem easily!

User guide?

Is there a type of user manual for this toolbox?

Errors with sparse transition and/or reward matrices

The following code runs fine with dense matrices, but fails in different ways when sparse matrices are used (i.e. when the lines P = list(map(csr_matrix, P)) and/or R = list(map(csr_matrix, R)) are uncommented):

import mdptoolbox
import numpy as np
from scipy.sparse import csr_matrix

P = [None] * 2
P[0] = np.array([
    [0.0, 0.0, 0.6, 0.4, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0],
    [0.0, 0.0, 1.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 1.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0]
])
P[1] = np.array([
    [0.0, 0.4, 0.0, 0.0, 0.6],
    [0.0, 1.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0],
    [0.0, 0.0, 0.0, 0.0, 1.0],
    [0.0, 0.0, 0.0, 0.0, 1.0]
])

R = [None] * 2
R[0] = np.zeros((5, 5))
R[1] = np.array([
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0]
])

# P = list(map(csr_matrix, P))
# R = list(map(csr_matrix, R))

vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
vi.run()
assert vi.policy[0] == 1

pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
pi.run()
assert pi.policy[0] == 1

With sparse P and dense R:

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
  File "mdptoolbox/mdp.py", line 1300, in __init__
    self._boundIter(epsilon)
  File "mdptoolbox/mdp.py", line 1345, in _boundIter
    null, value = self._bellmanOperator()
  File "mdptoolbox/mdp.py", line 232, in _bellmanOperator
    Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
ValueError: setting an array element with a sequence.

With dense P and sparse R:

Traceback (most recent call last):
  File "mdptoolbox/mdp.py", line 279, in _computePR
    if R.ndim == 1:
AttributeError: 'list' object has no attribute 'ndim'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
  File "mdptoolbox/mdp.py", line 1288, in __init__
    MDP.__init__(self, transitions, reward, discount, epsilon, max_iter)
  File "mdptoolbox/mdp.py", line 187, in __init__
    self._computePR(transitions, reward)
  File "mdptoolbox/mdp.py", line 291, in _computePR
    for aa in range(self.A))
  File "mdptoolbox/mdp.py", line 291, in <genexpr>
    for aa in range(self.A))
AttributeError: 'NotImplementedType' object has no attribute 'sum'

With sparse P and sparse R:

mdptoolbox/mdp.py:1348: RuntimeWarning: divide by zero encountered in double_scalars
  getSpan(value - Vprev) ) / _math.log(self.discount * k))
Traceback (most recent call last):
  File "example.py", line 34, in <module>
    vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
  File "mdptoolbox/mdp.py", line 1300, in __init__
    self._boundIter(epsilon)
  File "mdptoolbox/mdp.py", line 1351, in _boundIter
    self.max_iter = int(_math.ceil(max_iter))
OverflowError: cannot convert float infinity to integer

I'm using Python 3.4.0, scipy 0.14.1, numpy 1.9.1 and mdptoolbox 5af7d2a.

Absolute difference is unnecessarily space-inefficient

absdiff = (_np.abs(matrix.sum(axis=1) - _np.ones(matrix.shape[0])))

Due to either a bug or feature in numpy (I'm running numpy 1.14.0 and python 3.5.2), the above line returns a matrix of differences instead of a vector. This results in a memory error for relatively small vectors (in my case matrix.shape = (97200, 97200) and I have 40GB RAM). I think the fix is easy, just force the second term to be a proper column vector:

(_np.abs(matrix.sum(axis=1) - _np.ones((matrix.shape[0],1))))

Sparse rewards are converted to dense arrays

Due to the way that scipy works, if the rewards are given in sparse format they are at the moment converted to dense arrays. Specifically sparse.multiply(dense) will result in a dense numpy.matrix and sparse.sum(1) will result in a numpy.matrix.

It may be dersirable to have a way of keeping them sparse.

Solution for basic grid world example

I'm trying to model the canonical grid world example, but I'm having a lot of trouble figuring out how exactly the inputs should be defined. I keep getting action/state count exceptions or math domain exceptions.

What are the proper P and R inputs for the following problem?

reinforcement-learning-example-mdp-gridworld1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.