rlgraph / rlgraph Goto Github PK
View Code? Open in Web Editor NEWRLgraph: Modular computation graphs for deep reinforcement learning
License: Apache License 2.0
RLgraph: Modular computation graphs for deep reinforcement learning
License: Apache License 2.0
The current signature of define_graph_api is weird and leads to pep8 warnings in all agents because we pass through arbitrary other args. Splitting the list of subcomponents also seems suboptimal,
maybe we should pass them as a dict and look them up where needed. Long lines like
preprocessor, merger, memory, splitter, policy, exploration, loss_function, optimizer, value_function, \
vf_optimizer = sub_components
should be avoided.
In PyTorch, we currently manage state for get/set weights via a wrapper object called PyTorchVariable which accesses layer weights.
However, in define by run backends we also may want to use lists and numpy arrays to manage state and set/get it through the executor interface, e.g. in buffers. Performing space inference is difficult on raw lists so we could consider wrapping the 'list' variables with an object storing the desired spaces.
Working with parameters passed as dictionaries is inconvenient - no auto complete, no explicit documentation, no explicit defaults, fail at later point, etc. And can lead to bad practices such as adding undocumented fields instead of extending a class.
I saw that there is the Specifiable class and there are few classes that extend it but the usage seems inconsistent:
From agent.py
policy_spec (Optional[dict]): An optional dict for further kwargs passing into the Policy c'tor.
value_function_spec (list): Neural network specification for baseline.
exploration_spec (Optional[dict]): The spec-dict to create the Exploration Component.
execution_spec (Optional[dict,Execution]): The spec-dict specifying execution settings.
optimizer_spec (Optional[dict,Optimizer]): The spec-dict to create the Optimizer for this Agent.
value_function_optimizer_spec (dict): Optimizer config for value function otpimizer. If None, the optimizer
spec for the policy is used (same learning rate and optimizer type).
observe_spec (Optional[dict]): Spec-dict to specify `Agent.observe()` settings.
update_spec (Optional[dict]): Spec-dict to specify `Agent.update()` settings.
summary_spec (Optional[dict]): Spec-dict to specify summary settings.
saver_spec (Optional[dict]): Spec-dict to specify saver settings.
For example optimizer_spec
can be provided as Optimizer
, but value_function_optimizer_spec
needs to be dict -- the code below assumes this. Also update_spec
is dictionary and some of the fields are specific to the particular algorithm.
Is there a specific reason for the difference in the parameters?
Here's a test to demonstrate this:
def test_policy_for_continuous_action_space(self):
# state_space (NN is a simple single fc-layer relu network (2 units), random biases, random weights).
state_space = FloatBox(shape=(4,), add_batch_rank=True)
# action_space (5 possible actions).
action_space = FloatBox(low=-1.0, high=1.0, add_batch_rank=True)
policy = Policy(network_spec=config_from_path("configs/test_simple_nn.json"), action_space=action_space)
test = ComponentTest(
component=policy,
input_spaces=dict(
nn_input=state_space,
actions=action_space,
logits=FloatBox(shape=(2, ), add_batch_rank=True),
probabilities=FloatBox(add_batch_rank=True)
),
action_space=action_space
)
test.read_variable_values(policy.variables)
This test fails with:
self = <rlgraph.components.policies.policy.Policy object at 0x12ebb08d0>
key = '_T0_'
probabilities = <tf.Tensor 'policy/action-adapter-0/Squeeze:0' shape=(?,) dtype=float32>
@graph_fn(flatten_ops=True, split_ops=True, add_auto_key_as_first_param=True)
def _graph_fn_get_distribution_entropies(self, key, probabilities):
"""
Pushes the given `probabilities` through all our distributions' `entropy` API-methods and returns a
DataOpDict with the keys corresponding to our `action_space`.
Args:
probabilities (DataOp): The parameters to define a distribution.
Returns:
FlattenedDataOp: A DataOpDict with the different distributions' `entropy` outputs. Keys always correspond to
structure of `self.action_space`.
"""
> return self.distributions[key].entropy(probabilities)
E KeyError: '_T0_'
Problem: We currently automatically test mostly the tf-versions of our Components and Agents. PyTorch is not well represented in our test cases, which can lead to uncaught bugs. E.g. our travis testing container is tf-based and does not even have PyTorch installed.
Solution: Start converting all test cases from currently pure tf-based to a more flexible setup, where the same test case can be used on both backends. Note that most tests should already be working under both backends, but this needs to be - yes - tested.
Currently: A container Op-record (o) in an API-method (e.g. an op-record that's holding a DataOpTuple) cannot be accessed per-item by doing e.g. o[1] for a tuple or o["key-a"] for a dict. Instead, extra components (such as ContainerSplitter, ContainerMerger) need to be added tediously to the parent component and then used in the API-method to do these tasks.
Suggestion: Accessing an op-rec via the []-operator (by index or key) inside an API-method should automatically add the above steps and thus make handling of container op-recs inside an API-method more intuitive.
Coming from tensorforce (where agent.reset() is called in episode loop), and by reading the doc on agents/agent.py, it seems agent.reset() is supposed to be called before starting a new episode. However currently it does not seem to be called in SingleThreadedWorker nor RayWorker before new episodes, although preprocessors stack seems to have been reset explicitly.
It would be nice if you could clarify the purpose of agent.reset() and when it is supposed to be called. Would appreciate some examples..
def reset(self):
"""
Must be implemented to define some reset behavior (before starting a new episode).
This could include resetting the preprocessor and other Components.
"""
pass # optional
Refs:
https://github.com/rlgraph/rlgraph/blob/master/rlgraph/agents/agent.py
https://github.com/rlgraph/rlgraph/blob/master/rlgraph/execution/single_threaded_worker.py
Understanding the build process is currently quite difficult because it happens partly in the graph builder, in static and non-static parts of Component, and in various utils.
We should:
The build procedure and the decorators contain many variable pairs named with name/name_, args/args_ etc. This makes the decorators much harder to read than necessary.
We should rename them and clearly identify which is used for what, e.g. inferred_name instead of name_.
Memories allow to sample episodes or time steps, but the worker only supports time step based updates. For variable length episodes, we need episode-based updating for multi-env updates.
The current visualization utils refer to an outdated component API model and should either be deprecated entirely or updated so we can produce visualizations again (desirable but not high priority).
When doing multi-env policy gradient updates, we have no way of distinguishing
i) terminal episode fragments
ii) non-terminal episode fragments from different environments
In the single env case, this is irrelevant because the terminal marker tells us all we need to know. In the multi-env case, we may want to update from multiple non-terminal fragments from different environments. If we then just artificially set them to terminal, boot-strapping in GAE is not correct.
Proposed solution would require an additional marker in the memory to distinguish episodes from different environments.
Overall not high priority because one can just call update from external.
Following #21, there are some type inference issues in PyTorch. Namely, the record space arrives as FloatBoxes in the buffer when e.g. actions are actually meant to be IntBox. This then requires tedious casting.
List a few basic how tos:
How to:
Example script names should highlight distributed/single node nature and we should add dedicated examples for e.g. distributed Impala, distributed Ray, multi GPU. Even if these are just a few flags, it's easier to just have dedicated examples to use.
Policy weights cannot be serialised out of the box any more and are currently wrapped in a RayWeight object. Investigate which object is reponsible for problematic serialisation.
Potentially unpack/unnest/flatten weights before returning from agent API.
We use dtype (now renamed to convert_dtype) from util.py in various places with the "tf" default arg in "to".
Maybe we should make the "to" arg non-optional so it's always clear what representation we are converting to (tf, numpy, pytorch)?
Otherwise, TF code does not use the "to" arg, but all other code has to use it, which makes things inconsistent/potentially confusing to read.
It's quite difficult atm to get an idea what features are supported beyond the obvious agents - a detailed feature list in the readme would help.
Hi,
While learning rlgraph by running the examples in README.md, we found multiple small typos or api changes that cause errors. It would be great if the example can be updated, so that it's easier for people to try it out.
Here is the modified example:
from rlgraph.agents import DQNAgent
from rlgraph.environments import OpenAIGymEnv
environment = OpenAIGymEnv('CartPole-v0')
# Create from .json file or dict, see agent API for all
# possible configuration parameters.
agent = DQNAgent.from_file(
"configs/dqn_cartpole.json",
state_space=environment.state_space,
action_space=environment.action_space
)
# Get an action, take a step, observe reward.
state = environment.reset()
preprocessed_state, action = agent.get_action(
states=state,
extra_returns="preprocessed_states"
)
# Execute step in environment.
next_state, reward, terminal, info = environment.step(action)
# Observe result.
agent.observe(
preprocessed_states=preprocessed_state,
actions=action,
internals=[],
next_states=next_state,
rewards=reward,
terminals=terminal
)
# Call update when desired:
loss = agent.update()
https://github.com/pytorch/pytorch/releases/tag/v1.0.0 is out, need to update various operators and shape utilities, in particular wrt now supported broadcasting and 0 dim tensors.
Use existing IMPALA components and expose as simple PG agent.
Add runtime support for container splitting/merging, flattening etc.
Due to problematic performance in gridworld test environments, there is a high probability duelling networks are not working as intended with container actions. We need to determine how to design dueling architectures for container (one dueling set per action?).
A key problem in fully unifying internal state management across backends is that _variables() returns internally registered variable references.
When writing in Python, raw ints/floats (e..g buffer indices) are not refs, so their internally registered values are not updated, so variables() does not return updated values. Example is the ring-buffer class -> variable creation is unified, but getting variables() in the tests is problematic.
A simple solution is that components implement variables() to return all variables making up their internal state. This would allow to both return native python types, tensorflow variables, and torch parameters without further wrapping any ops.
Watch out! Your documentation says, that I can feed a NeuralNetwork object into the agent. But that is not true with the current implementation. Relevant code section:
https://github.com/YARL-project/YARL/blob/bc2e65a5f4430629d0558b23d45edd1029d42178/yarl/agents/agent.py#L97
Add an option to either/both:
i) Use an internal decay mechanism via global step + a learning_rate_spec as part of the update_spec
ii) Allow for an optional learning_rate parameter in update() to externally manipulate learning rates based on whatever scheme desired
Slight preference for ii) because it allows easier experimentation with irregular decay schemes.
In the following test-case:
test_gpu_strategies.py::test_multi_gpu_ppo_agent_learning_test_gridworld_2x2
Variable assignment occasionally fails in the ring buffer, likely because non-deterministic reads. Investigate read-write order of all variables.
For useful applied tasks, we need to allow container action spaces. Currently raises an error, need where this actually causes a problem.
Currently execute() accepts the method name provided as string, which prevent linters to detect typos or changes due to refactoring. I propose to allow passing the method directly - linters and auto-complete works.
Current state:
graph_executor.execute("get_policy_weights")
Proposed change:
graph_executor.execute(self.root_component.get_policy_weights)
The Policy Component needs some cleanup as its API-methods are becoming less and less organized.
@janislavjankov has suggested the following:
The comment I had for the agent's component is that it looks cleaner to me if it was extracted and defined as a separate class - no need to attach the methods within the define_graph_api - just have a regular class (extending Component) that can be instantiated there.
So we could for example in the DQNAgent module have a class that implements the API of DQN as simple python methods.
Here is a test case to demonstrate this:
def test_call_in_comprehension(self):
container = Component(scope="container")
sub_comps = [Dummy1To1(scope="dummy-{}".format(i)) for i in range(3)]
container.add_components(*sub_comps)
# Define container's API:
@rlgraph_api(name="test", component=container)
def container_test(self_, input_):
# results = []
# for i in range(len(sub_comps)):
# results.append(sub_comps[i].run(input_))
results = [x.run(input_) for x in sub_comps]
return self_._graph_fn_sum(*results)
@graph_fn(component=container)
def _graph_fn_sum(self_, *inputs):
return sum(inputs)
test = ComponentTest(component=container, input_spaces=dict(input_=float))
test.test(("test", 1.23), expected_outputs=len(sub_comps) * (1.23 + 1), decimals=2)
The commented out code above works, while the equivalent list comprehension fails with
self = <rlgraph.tests.dummy_components.Dummy1To1 object at 0x129f86748>
args = (<rlgraph.utils.op_records.DataOpRecord object at 0x129ee2fd0>,)
kwargs = {}, api_fn_name = 'run'
api_method_rec = <rlgraph.utils.op_records.APIMethodRecord object at 0x129ee2048>
in_op_column = <rlgraph.utils.op_records.DataOpRecordColumnIntoAPIMethod object at 0x129f1b048>
minimum_num_call_params = 1
all_args = [(0, <rlgraph.utils.op_records.DataOpRecord object at 0x129ee2fd0>)]
flex = None, i = 0, key = 0
...
rlgraph.utils.rlgraph_errors.RLGraphError: API-method 'run' must have as 1st parameter (the component) either `root` or `self`. Other names are not allowed!
The Agent API method get_action
should return a dict, instead of currently a 2-tuple (action, preprocessed_state).
The dict would have the keys "action" and "preprocessed_state". This is already good practice in many of the Policy, RNN and other classes' API methods that may sometimes have more complex return structures.
I am using tensorforce but I see that the development paced is reducing a bit there maybe. It fits my need well and works very fine though, so I am very happy with it and I recommend people to give it a look ;) (note: I am a user, by no ways a developer there).
Just by curiosity, what is the difference between tensorforce and rlgraph from an end-user point of view?
Of course I understand that developers of rlgraph may be a bit biaised on this question, but asking for your opinion nonetheless ;) (especially @michaelschaarschmidt who seems to work on both? ;) ).
Hi, it would be great if you can provide a working example with LSTM cells. I saw there are tests with LSTM, but a complete example is missing. Thanks
For distributed synchronisation, we need to sync separate value networks via get/set weights.
Default-enabling observe buffering just creates unexpected behaviour for users with small batch sizes.
Currently, input arguments to the pytorch executor are always converted to torch tensors. This is not really desirable for e.g. memory inserts or things that could just be executed in native Python, and there are likely some unneeded conversions hurting performance and also causing type inference problems.
Ideally we would have an option to tell API methods if this conversion is needed. The problem is that for TF, everything is auto-converted.
Global step is created in the executor and used for checkpointing. An increment op would need to be created e.g. in the generic agent and called together with act.
With tf 2.0 preview builds out, we need to plan how to transition to 2.0.
In particular:
Container actions come as a dict from Agent.get_action()
where each key contains a batch of values.
This format needs to be flipped (to a list of dicts) via:
ret = [{key: value[i] for key, value in ret.items()} for i in range(len(ret[next(iter(ret))]))] (ret is the session-returned dict)
... before sending single actions to an env.
NOTE: The original session output (no flipping) is needed when python-buffering and/or observing via Agent.observe()
!
So far, unsplitting is done implicitly, whenever split_ops=True
. This is normally ok, but there are cases, where the output of the graph_fn has nothing to do in terms of nested structure with the input to the graph_fn. In these cases, it is important to e.g. switch splitting off and unsplitting on.
Example:
policy has a graph_fn into which we send the nn_input (e.g. a dict state with keys "a" and "b"). This function should then output actions (e.g. non-container simple actions). So the input (dict state space) does not match the output (flat actions) in terms of the nesting structure and the assumption to unsplit (the actions) in the same way as the split (of the state space) was performed will fail here.
For distributed policy optimization on ray, we need a simple synchronous executor which merges all worker batches and applies them synchronously into one update.
The current implementations grew out from experimental design around multi-backend support. The get_backend() checks are undesirable and do not make for readable implementations.
This issue is meant to collect design improvements.
Proposal 1: Components will be reorganised into a base component and backend-specific sub-classes.
A package tensorflow_components/pytorch_components will mirror the folder structure of the base components and contain the specific implementations.
Advantages: Avoid backend checks in implementations, clearly separates backends from interfaces
Disadvantages: Multiplies number of components, potentially irritating to see mirrored folder structure of
components/
memories/base_memory
tf_components/memories/tf_memory
pytorch_components/memories/pytorch_memory
versus keeping everything in one folder memories (which would make imports more difficult for the package:
components/
memories/
base_memory
tf_memory
pytorch_memory
Hi, I read rlgraph paper and wanted to give it a try, so I wanted to set up and run a few examples, but the first one breaks for me. Repro:
virtualenv -p python3 venv
source venv/bin/activate
pip install rlgraph
pip install rlgraph[ray]
pip install gym[atari]
pip install tensorflow-gpu
pip install psutil
pip install setproctitle
# Start ray on the head machine
ray start --head --redis-port 6379
# Optionally join to this cluster from other machines with ray start --redis-address=...
# Run script
python apex_pong.py
After ~1 minute it breaks with:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,84,84,4] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node prioritized-replay/memorynext_states/Assign (defined at /media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/spaces/box_space.py:192) = Assign[T=DT_FLOAT, _grappler_relax_allocator_constraints=true, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prioritized-replay/memorynext_states, prioritized-replay/memorynext_states/Initializer/Const)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Am I doing anything wrong, or the default example is not working on Linux?
Info about my machine:
OS: Ubuntu 18.04.1 LTS
CPU: AMD ThreadRipper
GPU: GeForce 1080ti
RAM: 32gb
VRAM: 11gb
Adding a few learning/performance results to the readme is always a good idea.
For distributed policy optimisation, we can consider (with view of the problems in #13), to add an update method to pg agents which assumes GAE discounts have already been pre-computed for fragments.
There is no reason, this should be a private API method. Causes PEP-8 warnings everywhere.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.