tensorforce / tensorforce Goto Github PK
View Code? Open in Web Editor NEWTensorforce: a TensorFlow library for applied reinforcement learning
License: Apache License 2.0
Tensorforce: a TensorFlow library for applied reinforcement learning
License: Apache License 2.0
Using tensorforce as a library which itself uses the logging module creates a conflict on logging handlers (multiple handlers added).
Fresh install. Command from http://tensorforce.readthedocs.io/en/latest/#quick-start:
python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_agent.json -n examples/configs/trpo_network.json
Gives:
[2017-07-24 22:51:58,560] Making new env: CartPole-v0
Traceback (most recent call last):
File "examples/openai_gym.py", line 121, in <module>
main()
File "examples/openai_gym.py", line 70, in main
agent = agents[args.agent](config=agent_config)
File "/home/tensorforce/tensorforce/agents/batch_agent.py", line 50, in __init__
super(BatchAgent, self).__init__(config)
File "/home/tensorforce/tensorforce/agents/agent.py", line 143, in __init__
self.model = self.__class__.model(config)
File "/home/tensorforce/tensorforce/models/trpo_model.py", line 54, in __init__
super(TRPOModel, self).__init__(config)
File "/home/tensorforce/tensorforce/models/policy_gradient_model.py", line 81, in __init__
self.baseline = Baseline.from_config(config=config.baseline)
File "/home/tensorforce/tensorforce/core/baselines/baseline.py", line 43, in from_config
predefined=tensorforce.core.baselines.baselines
File "/home/tensorforce/tensorforce/util.py", line 123, in get_object
return obj(**full_kwargs)
TypeError: __init__() takes at least 2 arguments (1 given)
obj
from util.py:119
is <class 'tensorforce.core.baselines.mlp.MLPBaseline'>
, kwargs
is None
and full_kwargs
is {}
.
MLPBaseline
's __init__
indeed takes at least 2 arguments.
Other models should be be able to be used with the distributed runner where sensible.
on docker, this just hangs:
Step 7/8 : RUN pip install tensorforce[tf] -e .
---> Running in 55d5d05d7049
Obtaining file:///code/tensorforce
Hi,
Unless the goal is not to support tensorflow with gpu, I would recommend to move the tensorflow requirement to "extra_requires". I have seen this pattern in both sonnet and tensor2tensor.
For example:
extra_packages = {
'tensorflow': ['tensorflow>=1.0.1'],
'tensorflow with gpu': ['tensorflow-gpu>=1.0.1']
}
install_requires=[
'numpy',
'six',
'scipy',
'pillow',
'pytest'
]
setup_requires=['numpy', 'recommonmark', 'mistune']
setup(name='tensorforce',
version='0.2',
description='Reinforcement learning for TensorFlow',
url='http://github.com/reinforceio/tensorforce',
author='reinforce.io',
author_email='[email protected]',
license='Apache 2.0',
packages=['tensorforce'],
install_requires=install_requires,
extra_requires=extra_packages,
setup_requires=setup_requires,
zip_safe=False)
Regards,
Pedro
PS Will spend my weekend understanding tensorforce. Great work!
Currently, not all iterables seem to work in agent.ac(), e.g. a tuple is expected and a nd-array of the correct shape can cause a tensorflow freeze without any error message.
Act needs to either:
Currently, a configuration contains additional default and internal values after the initialization of an agent. This should not be the case, instead the agent could, for instance, create a copy of the configuration before modification.
It's not the linear decay based on the remaining I was expecting.
self.epsilon -= ((self.epsilon - self.epsilon_final) / self.epsilon_timesteps) * timestep
So over 100 steps that takes about 30-40 to get "close" to epsilon_final
.
Potential for optional decay mode.
Agent API needs to allow to pass in a batch of experiences to update from - for use cases where data is collected in a way where passing it sample by sample to TensorForce isn't needed/creates too much I/O.
Current Replay.get_batch return the samples as continuous range of original sequence of experiences.
I'd like to get batch data whose each sample is picked up from memory at random to get rid of bias of samples.
I would like to add an option to change the strategy about sample in Replay.get_batch.
See #59
In naf_model.py, lines 71-79:
if num_actions > 1:
offset = num_actions
l_columns = list()
for zeros, size in enumerate(xrange(num_actions - 1, 0, -1), 1):
column = tf.pad(l_entries[:, offset: offset + size], ((0, 0), (zeros, 0)))
l_columns.append(column)
offset += size
l_matrix += tf.stack(l_columns, 1)
I believe the number of columns given to tf.stack is incorrect (one too few). I think there needs to be an extra column, e.g. by adding something like:
l_columns.append(tf.zeros_like(l_columns[0]))
Is this correct?
The error I'm getting is:
ValueError: Dimensions must be equal, but are 59 and 58 for 'training_outputs/add' (op: 'Add') with input shapes: [?,59,59], [?,58,59].
from the line
l_matrix += tf.stack(l_columns, 1)
Since it's very easy to forget updating example configs after refactorings, quickstart test needs to be included into CI.
The runner should probably call finalize on the graph, but if the runner is not used, we should also call finalize internally somewhere.
From #26:
'Another thing I noticed in continuous state spaces is that the standard deviation of the Gaussian (exploration) noise is not parameterized. That seems like a bad default for this kind of on-policy method. It's an easy fix since the required code in the Gaussian class is just commented out, but enabling this does not seem possible without low-level adjustments at the moment.'
Hi,
first of all, thanks for the hard work that is going into this project. You are saving me a ton of work.
Second, I encountered some strange behavior when trying to define an agent with multiple continuous actions. All code below was run in a Jupyter notebook with Anaconda and Python 3.5:
#Configuration, adapted from config in readme
config = Configuration(
batch_size=100,
states=dict(shape=(4,), type='float'),
actions=dict(opt_a = dict(continuous=True, min_value = 0, max_value = 2),
opt_b = dict(continuous=True, min_value = 0, max_value = 2)),
network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)
# Create a TRPO agent
agent = TRPOAgent(config=config)
This code crashes with the trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-70-b10cf4edc1d7> in <module>()
1 # Create a VPGA agent
----> 2 agent = TRPOAgent(config=config)
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/batch_agent.py in __init__(self, config)
48 def __init__(self, config):
49 config.default(BatchAgent.default_config)
---> 50 super(BatchAgent, self).__init__(config)
51 self.batch_size = config.batch_size
52 self.batch = None
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/agent.py in __init__(self, config)
141 self.actions_config = config.actions
142
--> 143 self.model = self.__class__.model(config)
144
145 self.episode = 0
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in __init__(self, config)
52 def __init__(self, config):
53 config.default(TRPOModel.default_config)
---> 54 super(TRPOModel, self).__init__(config)
55
56 self.override_line_search = config.override_line_search
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/policy_gradient_model.py in __init__(self, config)
81 self.baseline = Baseline.from_config(config=config.baseline)
82
---> 83 super(PolicyGradientModel, self).__init__(config)
84
85 # advantage estimation
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in __init__(self, config)
118 scope = scope_context.__enter__()
119
--> 120 self.create_tf_operations(config)
121
122 if config.distributed:
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in create_tf_operations(self, config)
117
118 gradients = tf.gradients(fixed_kl_divergence, variables)
--> 119 gradient_vector_product = [tf.reduce_sum(g * t) for (g, t) in zip(gradients, tangents)]
120
121 self.flat_variable_helper = FlatVarHelper(variables)
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in <listcomp>(.0)
117
118 gradients = tf.gradients(fixed_kl_divergence, variables)
--> 119 gradient_vector_product = [tf.reduce_sum(g * t) for (g, t) in zip(gradients, tangents)]
120
121 self.flat_variable_helper = FlatVarHelper(variables)
/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py in r_binary_op_wrapper(y, x)
895 def r_binary_op_wrapper(y, x):
896 with ops.name_scope(None, op_name, [x, y]) as name:
--> 897 x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name="x")
898 return func(x, y, name=name)
899
/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
649 name=name,
650 preferred_dtype=preferred_dtype,
--> 651 as_ref=False)
652
653
/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
714
715 if ret is None:
--> 716 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
717
718 if ret is NotImplemented:
/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
174 as_ref=False):
175 _ = as_ref
--> 176 return constant(v, dtype=dtype, name=name)
177
178
/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name, verify_shape)
163 tensor_value = attr_value_pb2.AttrValue()
164 tensor_value.tensor.CopyFrom(
--> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
167 const_tensor = g.create_op(
/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
358 else:
359 if values is None:
--> 360 raise ValueError("None values not supported.")
361 # if dtype is provided, forces numpy array to be the type
362 # provided if possible.
ValueError: None values not supported.
I tried different agents and encountered another strange behavior:
# Create a VPG agent
agent = VPGAgent(config=config)
state = np.array([1,2,3,4])
agent.act(state)
Crashes with:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-73-565d0bd87882> in <module>()
----> 1 agent.act(state)
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/agent.py in act(self, state, deterministic)
194
195 # model action
--> 196 self.current_action, self.next_internal = self.model.get_action(state=self.current_state, internal=self.current_internal, deterministic=deterministic)
197
198 # exploration
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in get_action(self, state, internal, deterministic)
219 fetches.update({n: internal_output for n, internal_output in enumerate(self.internal_outputs)})
220
--> 221 feed_dict = {state_input: (state[name],) for name, state_input in self.state.items()}
222 feed_dict.update({internal_input: (internal[n],) for n, internal_input in enumerate(self.internal_inputs)})
223 feed_dict[self.deterministic] = deterministic
/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in <dictcomp>(.0)
219 fetches.update({n: internal_output for n, internal_output in enumerate(self.internal_outputs)})
220
--> 221 feed_dict = {state_input: (state[name],) for name, state_input in self.state.items()}
222 feed_dict.update({internal_input: (internal[n],) for n, internal_input in enumerate(self.internal_inputs)})
223 feed_dict[self.deterministic] = deterministic
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
But when I redefine config, that is, I run
#Configuration, adapted from config in readme
config = Configuration(
batch_size=100,
states=dict(shape=(4,), type='float'),
actions=dict(opt_a = dict(continuous=True, min_value = 0, max_value = 2),
opt_b = dict(continuous=True, min_value = 0, max_value = 2)),
network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)
again, it does not crash, but it occasionally outputs negative values for actions, although min_value = 0
{'opt_a': 0.28892395, 'opt_b': -0.10657883}
The PPO agent displays the same behavior as the VPG Agent.
I have tried this with many slightly different configurations, it seems to be a consistent issue.
Please let me know if you need any more code / info / data to reproduce the issue. Kindly, Jannes
Hi,
Can you share some plans about roadmap and what algorithms will be added? In particular are there any plans about DDPG implementation with recent improvements: https://arxiv.org/abs/1704.03073 and https://arxiv.org/abs/1707.01495 ?
Hi, I was wondering if there's currently a straightforward way to load a saved policy and run that policy with an environment without training updates, or do I have to write my own runner for this purpose. Thanks.
Hi, maybe I'm missing something but where do you save the various training metrics (returns, entropy, etc) and is there a mechanism to save the trained model or do we have to implement that. Thanks!
After running python examples/quickstart.py
(3000 episodes), the average reward from last 100 episodes is only 33.38. I would expect it to be close to the maximum, 200. Especially that it reached it couple of times before, e.g. on episode 1469, however later it deteriorates.
I also tried running it with provided command:
python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_cartpole.json -n examples/configs/trpo_cartpole_network.json
However the results were also unsatisfactory:
[2017-07-24 23:58:58,363] Finished episode 4050 after 61 timesteps
[2017-07-24 23:58:58,363] Episode reward: 61.0
[2017-07-24 23:58:58,363] Average of last 500 rewards: 63.346
[2017-07-24 23:58:58,364] Average of last 100 rewards: 62.33
When trying to run the TRPO agent on BipedalWalker, as follows, I run into:
foo$ PYTHONPATH=. python examples/openai_gym.py BipedalWalker-v2 -D -a TRPOAgent -c examples/configs/trpo_agent.json -n examples/configs/trpo_network.json
....
File "/../tensorforce/tensorforce/environments/openai_gym.py", line 67, in execute
state, reward, terminal, _ = self.gym.step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 99, in step
return self._step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/wrappers/time_limit.py", line 36, in _step
observation, reward, done, info = self.env.step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 99, in step
return self._step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/envs/box2d/bipedal_walker.py", line 372, in _step
self.joints[1].motorSpeed = float(SPEED_KNEE * np.sign(action[1]))
IndexError: list index out of range
Looking at OpenAIGym.actions
, it doesn't seem to unravel that environment's Box(4) action space as wanted - am I just failing to configure the agent as required, or are such action spaces not handled right now?
[egor@host tensorforce]$ python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_cartpole.json -n examples/configs/trpo_cartpole_network.json -s /home/egor/Software/tensorforce/examples/output
[2017-07-19 00:35:06,206] Making new env: CartPole-v0
2017-07-19 00:35:06.922073: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922107: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922116: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922128: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922135: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
[2017-07-19 00:35:06,977] Starting TRPOAgent for Environment 'OpenAIGym(CartPole-v0)'
[2017-07-19 00:35:08,600] Finished episode 50 after 12 timesteps
[2017-07-19 00:35:08,600] Episode reward: 12.0
[2017-07-19 00:35:08,600] Average of last 500 rewards: 2.332
[2017-07-19 00:35:08,600] Average of last 100 rewards: 11.66
Saving agent after episode 100
Traceback (most recent call last):
File "examples/openai_gym.py", line 121, in <module>
main()
File "examples/openai_gym.py", line 112, in main
runner.run(args.episodes, args.max_timesteps, episode_finished=episode_finished)
File "/home/egor/Software/tensorforce/tensorforce/execution/runner.py", line 158, in run
self.agent.save_model(self.save_path)
File "/home/egor/Software/tensorforce/tensorforce/agents/agent.py", line 238, in save_model
self.model.save_model(path)
File "/home/egor/Software/tensorforce/tensorforce/models/model.py", line 274, in save_model
self.saver.save(self.session, path)
AttributeError: 'NoneType' object has no attribute 'save'
Old one is using deprecated API and has been deleted, need a new example here.
Traceback (most recent call last):
File "examples/openai_gym.py", line 121, in <module>
main()
File "examples/openai_gym.py", line 112, in main
runner.run(args.episodes, args.max_timesteps, episode_finished=episode_finished)
File "/home/yellow/work/tf/tensorforce/tensorforce/execution/runner.py", line 144, in run
self.agent.observe(reward=reward, terminal=terminal)
File "/home/yellow/work/tf/tensorforce/tensorforce/agents/dqn_agent.py", line 94, in observe
super(DQNAgent, self).observe(reward=reward, terminal=terminal)
File "/home/yellow/work/tf/tensorforce/tensorforce/agents/memory_agent.py", line 84, in observe
internal=self.current_internal
File "/home/yellow/work/tf/tensorforce/tensorforce/core/memories/prioritized_replay.py", line 55, in add_observation
priority, _ = self.observations.pop(self.positive_priority_index)
IndexError: pop index out of range
https://github.com/reinforceio/tensorforce/blob/master/docs/m2r.py#L513
$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
./docs/m2r.py:513:43: F821 undefined name 'SafeString'
(self.name, SafeString(path)))
^
./docs/m2r.py:516:43: F821 undefined name 'ErrorString'
(self.name, ErrorString(error)))
^
./docs/m2r.py:523:43: F821 undefined name 'ErrorString'
(self.name, ErrorString(error)))
^
TRPO occasionally fails to produce a robust update with the langrange multiplier being None, need to check if gradient computation can produce None
There seems to be a problem with some gradients being undefined in the case of multiple (continuous) actions for TRPO.
If I create a distribution with Gaussian(distribution=(0, 0.1))
, the parameters (0, 0.1) are ignored and instead the result from Gaussian.create_tf_operations
is used. At the very least I would expect the parameters that I pass to Gaussian
to be used as initial guesses for the parameterization.
In general the initial variance of the policy cannot be specified right now. In practice that's an important tuning parameter. The easiest way to do this might be to allow users to pass an instance of the distribution as part of the config, rather than the class.
Lastly, the sigmoid rescaling of the policy within Gaussian
seems hacky. What if I already provide a custom network that has properly scaled actions? In that case I wouldn't want another sigmoid nonlinearity to be applied. I think this would better fit into the network_builder
.
When I try to run the simple_q_agent.py script I get the following error:
File "/Users/aidanrocke/Desktop/open_ai_solutions/tensor_force/examples/simple_dqn.py", line 214, in main
runner.run(max_episodes, max_timesteps, episode_finished=episode_finished)
File "/Users/aidanrocke/tensorforce/tensorforce/execution/runner.py", line 58, in run
action = self.agent.get_action(processed_state, self.episode)
File "/Users/aidanrocke/tensorforce/tensorforce/agents/memory_agent.py", line 94, in get_action
action = self.model.get_action(*args, **kwargs)
AttributeError: 'NoneType' object has no attribute 'get_action'
Some default configurations are in separate (python files), some are still in the models/agents. Needs to be cleared up.
Currently, only Gaussian and Categorical are possible. This new feature, however, requires to somewhere specify the distribution per action.
Perhaps first_update
could be copied into config['exploration']
.
Would be used like this:
https://github.com/reinforceio/tensorforce/pull/56/files#diff-3a20a353542fac38371e6c75dccfe10fR31
Similarly for EpsilonDecay
:
self.epsilon -= ((self.epsilon - self.epsilon_final) / (self.epsilon_timesteps - self.first_update)) * (timestep - self.first_update)
edit: With a min
and/or max
to ensure timestep - first_update
doesn't throw things off.
Currently it is possible to define min_value
and max_value
for continuous actions, but this value is never actually used. Part of the problem is that the so far only continuous distribution Gaussian
does not naturally bound its possible samples.
Add an example for DeepMind lab.
Ideally, we would want to allow to specify float precisions everywhere. Currently, we only use this in a few classes and inconsistently.
get_action should optionally return action means and stds
Currently, memories only support adding observation by observation. This is impractical for importing larger amounts of data.
Move configs to separate folder and create overview of configs corresponding to specific environments/papers
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.