tensorforce / tensorforce Goto Github PK

Tensorforce: a TensorFlow library for applied reinforcement learning

License: Apache License 2.0

Python 99.92% Shell 0.08%

reinforcement-learning tensorflow deep-reinforcement-learning tensorflow-library tensorforce control system-control

tensorforce's Issues

Logging consistency with external library.

Using tensorforce as a library which itself uses the logging module creates a conflict on logging handlers (multiple handlers added).

Quick start example raises TypeError

Fresh install. Command from http://tensorforce.readthedocs.io/en/latest/#quick-start:

python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_agent.json -n examples/configs/trpo_network.json

Gives:

[2017-07-24 22:51:58,560] Making new env: CartPole-v0
Traceback (most recent call last):
  File "examples/openai_gym.py", line 121, in <module>
    main()
  File "examples/openai_gym.py", line 70, in main
    agent = agents[args.agent](config=agent_config)
  File "/home/tensorforce/tensorforce/agents/batch_agent.py", line 50, in __init__
    super(BatchAgent, self).__init__(config)
  File "/home/tensorforce/tensorforce/agents/agent.py", line 143, in __init__
    self.model = self.__class__.model(config)
  File "/home/tensorforce/tensorforce/models/trpo_model.py", line 54, in __init__
    super(TRPOModel, self).__init__(config)
  File "/home/tensorforce/tensorforce/models/policy_gradient_model.py", line 81, in __init__
    self.baseline = Baseline.from_config(config=config.baseline)
  File "/home/tensorforce/tensorforce/core/baselines/baseline.py", line 43, in from_config
    predefined=tensorforce.core.baselines.baselines
  File "/home/tensorforce/tensorforce/util.py", line 123, in get_object
    return obj(**full_kwargs)
TypeError: __init__() takes at least 2 arguments (1 given)

obj from util.py:119 is <class 'tensorforce.core.baselines.mlp.MLPBaseline'>, kwargs is None and full_kwargs is {}.

MLPBaseline's __init__ indeed takes at least 2 arguments.

Generic distributed TF API

Other models should be be able to be used with the distributed runner where sensible.

Cannot install

on docker, this just hangs:

Step 7/8 : RUN pip install tensorforce[tf] -e .
 ---> Running in 55d5d05d7049
Obtaining file:///code/tensorforce

setup.py and tensorflow with/without gpu

Hi,

Unless the goal is not to support tensorflow with gpu, I would recommend to move the tensorflow requirement to "extra_requires". I have seen this pattern in both sonnet and tensor2tensor.

For example:

extra_packages = {
'tensorflow': ['tensorflow>=1.0.1'],
'tensorflow with gpu': ['tensorflow-gpu>=1.0.1']
}

install_requires=[
'numpy',
'six',
'scipy',
'pillow',
'pytest'
]

setup_requires=['numpy', 'recommonmark', 'mistune']

setup(name='tensorforce',
version='0.2',
description='Reinforcement learning for TensorFlow',
url='http://github.com/reinforceio/tensorforce',
author='reinforce.io',
author_email='[email protected]',
license='Apache 2.0',
packages=['tensorforce'],
install_requires=install_requires,
extra_requires=extra_packages,
setup_requires=setup_requires,
zip_safe=False)

Regards,

Pedro

PS Will spend my weekend understanding tensorforce. Great work!

Check state type in act()

Currently, not all iterables seem to work in agent.ac(), e.g. a tuple is expected and a nd-array of the correct shape can cause a tensorflow freeze without any error message.

Act needs to either:

Check incoming state type and shape against the given state config and raise an Error
Convert other types to tuples

Configs should not change when passing to an agent

Currently, a configuration contains additional default and internal values after the initialization of an agent. This should not be the case, instead the agent could, for instance, create a copy of the configuration before modification.

Documentation for epsilon decay

It's not the linear decay based on the remaining I was expecting.

self.epsilon -= ((self.epsilon - self.epsilon_final) / self.epsilon_timesteps) * timestep

So over 100 steps that takes about 30-40 to get "close" to epsilon_final.

Potential for optional decay mode.

API: allow update from external batch

Agent API needs to allow to pass in a batch of experiences to update from - for use cases where data is collected in a way where passing it sample by sample to TensorForce isn't needed/creates too much I/O.

Add log level as config param

Options of strategy about experience sampling at Replay.get_batch.

Current Replay.get_batch return the samples as continuous range of original sequence of experiences.
I'd like to get batch data whose each sample is picked up from memory at random to get rid of bias of samples.
I would like to add an option to change the strategy about sample in Replay.get_batch.

See #59

Incorrect number of columns computing lower triangular matrix in NAF agent

In naf_model.py, lines 71-79:

     if num_actions > 1:
                offset = num_actions
                l_columns = list()
                for zeros, size in enumerate(xrange(num_actions - 1, 0, -1), 1):
                    column = tf.pad(l_entries[:, offset: offset + size], ((0, 0), (zeros, 0)))
                    l_columns.append(column)
                    offset += size
                l_matrix += tf.stack(l_columns, 1)

I believe the number of columns given to tf.stack is incorrect (one too few). I think there needs to be an extra column, e.g. by adding something like:

l_columns.append(tf.zeros_like(l_columns[0]))

Is this correct?

The error I'm getting is:

ValueError: Dimensions must be equal, but are 59 and 58 for 'training_outputs/add' (op: 'Add') with input shapes: [?,59,59], [?,58,59].

from the line

l_matrix += tf.stack(l_columns, 1)

Test coverage for quick start example

Since it's very easy to forget updating example configs after refactorings, quickstart test needs to be included into CI.

Calling finalize() on the graph

The runner should probably call finalize on the graph, but if the runner is not used, we should also call finalize internally somewhere.

Make Gaussian initial std configurable

From #26:

'Another thing I noticed in continuous state spaces is that the standard deviation of the Gaussian (exploration) noise is not parameterized. That seems like a bad default for this kind of on-policy method. It's an easy fix since the required code in the Gaussian class is just commented out, but enabling this does not seem possible without low-level adjustments at the moment.'

Issues with multiple continuous actions

Hi,
first of all, thanks for the hard work that is going into this project. You are saving me a ton of work.
Second, I encountered some strange behavior when trying to define an agent with multiple continuous actions. All code below was run in a Jupyter notebook with Anaconda and Python 3.5:

#Configuration, adapted from config in readme
config = Configuration(
    batch_size=100,
    states=dict(shape=(4,), type='float'),
    actions=dict(opt_a = dict(continuous=True, min_value = 0, max_value = 2),
                opt_b = dict(continuous=True, min_value = 0, max_value = 2)),
    network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)

# Create a TRPO agent
agent = TRPOAgent(config=config)

This code crashes with the trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-70-b10cf4edc1d7> in <module>()
      1 # Create a VPGA agent
----> 2 agent = TRPOAgent(config=config)

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/batch_agent.py in __init__(self, config)
     48     def __init__(self, config):
     49         config.default(BatchAgent.default_config)
---> 50         super(BatchAgent, self).__init__(config)
     51         self.batch_size = config.batch_size
     52         self.batch = None

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/agent.py in __init__(self, config)
    141         self.actions_config = config.actions
    142 
--> 143         self.model = self.__class__.model(config)
    144 
    145         self.episode = 0

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in __init__(self, config)
     52     def __init__(self, config):
     53         config.default(TRPOModel.default_config)
---> 54         super(TRPOModel, self).__init__(config)
     55 
     56         self.override_line_search = config.override_line_search

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/policy_gradient_model.py in __init__(self, config)
     81             self.baseline = Baseline.from_config(config=config.baseline)
     82 
---> 83         super(PolicyGradientModel, self).__init__(config)
     84 
     85         # advantage estimation

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in __init__(self, config)
    118                 scope = scope_context.__enter__()
    119 
--> 120             self.create_tf_operations(config)
    121 
    122             if config.distributed:

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in create_tf_operations(self, config)
    117 
    118             gradients = tf.gradients(fixed_kl_divergence, variables)
--> 119             gradient_vector_product = [tf.reduce_sum(g * t) for (g, t) in zip(gradients, tangents)]
    120 
    121             self.flat_variable_helper = FlatVarHelper(variables)

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in <listcomp>(.0)
    117 
    118             gradients = tf.gradients(fixed_kl_divergence, variables)
--> 119             gradient_vector_product = [tf.reduce_sum(g * t) for (g, t) in zip(gradients, tangents)]
    120 
    121             self.flat_variable_helper = FlatVarHelper(variables)

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py in r_binary_op_wrapper(y, x)
    895   def r_binary_op_wrapper(y, x):
    896     with ops.name_scope(None, op_name, [x, y]) as name:
--> 897       x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name="x")
    898       return func(x, y, name=name)
    899 

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
    649       name=name,
    650       preferred_dtype=preferred_dtype,
--> 651       as_ref=False)
    652 
    653 

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
    714 
    715         if ret is None:
--> 716           ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    717 
    718         if ret is NotImplemented:

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    174                                          as_ref=False):
    175   _ = as_ref
--> 176   return constant(v, dtype=dtype, name=name)
    177 
    178 

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name, verify_shape)
    163   tensor_value = attr_value_pb2.AttrValue()
    164   tensor_value.tensor.CopyFrom(
--> 165       tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
    166   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
    167   const_tensor = g.create_op(

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
    358   else:
    359     if values is None:
--> 360       raise ValueError("None values not supported.")
    361     # if dtype is provided, forces numpy array to be the type
    362     # provided if possible.

ValueError: None values not supported.

I tried different agents and encountered another strange behavior:

# Create a VPG agent
agent = VPGAgent(config=config)
state = np.array([1,2,3,4])
agent.act(state)

Crashes with:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-73-565d0bd87882> in <module>()
----> 1 agent.act(state)

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/agent.py in act(self, state, deterministic)
    194 
    195         # model action
--> 196         self.current_action, self.next_internal = self.model.get_action(state=self.current_state, internal=self.current_internal, deterministic=deterministic)
    197 
    198         # exploration

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in get_action(self, state, internal, deterministic)
    219         fetches.update({n: internal_output for n, internal_output in enumerate(self.internal_outputs)})
    220 
--> 221         feed_dict = {state_input: (state[name],) for name, state_input in self.state.items()}
    222         feed_dict.update({internal_input: (internal[n],) for n, internal_input in enumerate(self.internal_inputs)})
    223         feed_dict[self.deterministic] = deterministic

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in <dictcomp>(.0)
    219         fetches.update({n: internal_output for n, internal_output in enumerate(self.internal_outputs)})
    220 
--> 221         feed_dict = {state_input: (state[name],) for name, state_input in self.state.items()}
    222         feed_dict.update({internal_input: (internal[n],) for n, internal_input in enumerate(self.internal_inputs)})
    223         feed_dict[self.deterministic] = deterministic

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

But when I redefine config, that is, I run

#Configuration, adapted from config in readme
config = Configuration(
    batch_size=100,
    states=dict(shape=(4,), type='float'),
    actions=dict(opt_a = dict(continuous=True, min_value = 0, max_value = 2),
                opt_b = dict(continuous=True, min_value = 0, max_value = 2)),
    network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)

again, it does not crash, but it occasionally outputs negative values for actions, although min_value = 0
{'opt_a': 0.28892395, 'opt_b': -0.10657883}
The PPO agent displays the same behavior as the VPG Agent.

I have tried this with many slightly different configurations, it seems to be a consistent issue.
Please let me know if you need any more code / info / data to reproduce the issue. Kindly, Jannes

Future plans and DDPG implementation

Hi,

Can you share some plans about roadmap and what algorithms will be added? In particular are there any plans about DDPG implementation with recent improvements: https://arxiv.org/abs/1704.03073 and https://arxiv.org/abs/1707.01495 ?

Always create tf.saver

Load and test a learned policy

Hi, I was wondering if there's currently a straightforward way to load a saved policy and run that policy with an environment without training updates, or do I have to write my own runner for this purpose. Thanks.

result logging and policy saving

Hi, maybe I'm missing something but where do you save the various training metrics (returns, entropy, etc) and is there a mechanism to save the trained model or do we have to implement that. Thanks!

TRPO struggling with CartPole-v0 from quick start

After running python examples/quickstart.py (3000 episodes), the average reward from last 100 episodes is only 33.38. I would expect it to be close to the maximum, 200. Especially that it reached it couple of times before, e.g. on episode 1469, however later it deteriorates.

I also tried running it with provided command:

python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_cartpole.json -n examples/configs/trpo_cartpole_network.json

However the results were also unsatisfactory:

[2017-07-24 23:58:58,363] Finished episode 4050 after 61 timesteps
[2017-07-24 23:58:58,363] Episode reward: 61.0
[2017-07-24 23:58:58,363] Average of last 500 rewards: 63.346
[2017-07-24 23:58:58,364] Average of last 100 rewards: 62.33

(Example of) support for multi-valued Box actions?

When trying to run the TRPO agent on BipedalWalker, as follows, I run into:

foo$ PYTHONPATH=. python examples/openai_gym.py BipedalWalker-v2 -D -a TRPOAgent -c examples/configs/trpo_agent.json -n examples/configs/trpo_network.json
....
File "/../tensorforce/tensorforce/environments/openai_gym.py", line 67, in execute
 state, reward, terminal, _ = self.gym.step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 99, in step
 return self._step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/wrappers/time_limit.py", line 36, in _step
 observation, reward, done, info = self.env.step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 99, in step
 return self._step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/envs/box2d/bipedal_walker.py", line 372, in _step
 self.joints[1].motorSpeed     = float(SPEED_KNEE    * np.sign(action[1]))
IndexError: list index out of range

Looking at OpenAIGym.actions, it doesn't seem to unravel that environment's Box(4) action space as wanted - am I just failing to configure the agent as required, or are such action spaces not handled right now?

Error running the example

[egor@host tensorforce]$ python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_cartpole.json -n examples/configs/trpo_cartpole_network.json -s /home/egor/Software/tensorforce/examples/output
[2017-07-19 00:35:06,206] Making new env: CartPole-v0
2017-07-19 00:35:06.922073: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922107: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922116: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922128: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922135: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
[2017-07-19 00:35:06,977] Starting TRPOAgent for Environment 'OpenAIGym(CartPole-v0)'
[2017-07-19 00:35:08,600] Finished episode 50 after 12 timesteps
[2017-07-19 00:35:08,600] Episode reward: 12.0
[2017-07-19 00:35:08,600] Average of last 500 rewards: 2.332
[2017-07-19 00:35:08,600] Average of last 100 rewards: 11.66
Saving agent after episode 100
Traceback (most recent call last):
  File "examples/openai_gym.py", line 121, in <module>
    main()
  File "examples/openai_gym.py", line 112, in main
    runner.run(args.episodes, args.max_timesteps, episode_finished=episode_finished)
  File "/home/egor/Software/tensorforce/tensorforce/execution/runner.py", line 158, in run
    self.agent.save_model(self.save_path)
  File "/home/egor/Software/tensorforce/tensorforce/agents/agent.py", line 238, in save_model
    self.model.save_model(path)
  File "/home/egor/Software/tensorforce/tensorforce/models/model.py", line 274, in save_model
    self.saver.save(self.session, path)
AttributeError: 'NoneType' object has no attribute 'save'

Create new example + runner for universe

Old one is using deprecated API and has been deleted, need a new example here.

Prioritized replay index out-of-range

Traceback (most recent call last):
  File "examples/openai_gym.py", line 121, in <module>
    main()
  File "examples/openai_gym.py", line 112, in main
    runner.run(args.episodes, args.max_timesteps, episode_finished=episode_finished)
  File "/home/yellow/work/tf/tensorforce/tensorforce/execution/runner.py", line 144, in run
    self.agent.observe(reward=reward, terminal=terminal)
  File "/home/yellow/work/tf/tensorforce/tensorforce/agents/dqn_agent.py", line 94, in observe
    super(DQNAgent, self).observe(reward=reward, terminal=terminal)
  File "/home/yellow/work/tf/tensorforce/tensorforce/agents/memory_agent.py", line 84, in observe
    internal=self.current_internal
  File "/home/yellow/work/tf/tensorforce/tensorforce/core/memories/prioritized_replay.py", line 55, in add_observation
    priority, _ = self.observations.pop(self.positive_priority_index)
IndexError: pop index out of range

Docker container integration

flake8: F821 undefined name in docs/m2r.py

https://github.com/reinforceio/tensorforce/blob/master/docs/m2r.py#L513

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./docs/m2r.py:513:43: F821 undefined name 'SafeString'
                              (self.name, SafeString(path)))
                                          ^

./docs/m2r.py:516:43: F821 undefined name 'ErrorString'
                              (self.name, ErrorString(error)))
                                          ^

./docs/m2r.py:523:43: F821 undefined name 'ErrorString'
                              (self.name, ErrorString(error)))
                                          ^

Investigate occasional NaN in TRPO

TRPO occasionally fails to produce a robust update with the langrange multiplier being None, need to check if gradient computation can produce None

Integrate internal states for LSTM policies

Check TRPO with multiple (continuous) actions

There seems to be a problem with some gradients being undefined in the case of multiple (continuous) actions for TRPO.

Gaussian distribution parameters ignored

If I create a distribution with Gaussian(distribution=(0, 0.1)), the parameters (0, 0.1) are ignored and instead the result from Gaussian.create_tf_operations is used. At the very least I would expect the parameters that I pass to Gaussian to be used as initial guesses for the parameterization.

In general the initial variance of the policy cannot be specified right now. In practice that's an important tuning parameter. The easiest way to do this might be to allow users to pass an instance of the distribution as part of the config, rather than the class.

Lastly, the sigmoid rescaling of the policy within Gaussian seems hacky. What if I already provide a custom network that has properly scaled actions? In that case I wouldn't want another sigmoid nonlinearity to be applied. I think this would better fit into the network_builder.

simple_q_agent example

When I try to run the simple_q_agent.py script I get the following error:

  File "/Users/aidanrocke/Desktop/open_ai_solutions/tensor_force/examples/simple_dqn.py", line 214, in main
    runner.run(max_episodes, max_timesteps, episode_finished=episode_finished)

  File "/Users/aidanrocke/tensorforce/tensorforce/execution/runner.py", line 58, in run
    action = self.agent.get_action(processed_state, self.episode)

  File "/Users/aidanrocke/tensorforce/tensorforce/agents/memory_agent.py", line 94, in get_action
    action = self.model.get_action(*args, **kwargs)

AttributeError: 'NoneType' object has no attribute 'get_action'

Default configuration locations

Some default configurations are in separate (python files), some are still in the models/agents. Needs to be cleared up.

Allow custom distribution implementations

Currently, only Gaussian and Categorical are possible. This new feature, however, requires to somewhere specify the distribution per action.

Entropy regularization for policy gradient model

Exploration could be max until learning starts

Perhaps first_update could be copied into config['exploration'].

Would be used like this:

https://github.com/reinforceio/tensorforce/pull/56/files#diff-3a20a353542fac38371e6c75dccfe10fR31

Similarly for EpsilonDecay:

self.epsilon -= ((self.epsilon - self.epsilon_final) / (self.epsilon_timesteps - self.first_update)) * (timestep - self.first_update)

edit: With a min and/or max to ensure timestep - first_update doesn't throw things off.

Add check in replay memory to prevent batch_size > memory_size

Min/max values for continuous actions

Currently it is possible to define min_value and max_value for continuous actions, but this value is never actually used. Part of the problem is that the so far only continuous distribution Gaussian does not naturally bound its possible samples.

tensorforce / tensorforce Goto Github PK

tensorforce's Issues

Recommend Projects

Recommend Topics

Recommend Org