rlgraph / rlgraph Goto Github PK

RLgraph: Modular computation graphs for deep reinforcement learning

License: Apache License 2.0

Python 98.96% Dockerfile 0.31% C++ 0.73%

deep-learning deep-reinforcement-learning dqn machine-learning neural-networks ppo pytorch reinforcement-learning tensorflow

rlgraph's People

Contributors

Stargazers

Watchers

Forkers

pascalwhoop mugenzebra samialabed janislavjankov megayeye arjunchandra lazyfunctor dantodor ml-lab kismuz thesoenke sumeyyeemir awesomemachinelearning benzei cohencohenchen wook133 jon-chuang zizai kyrie-zhao afcarl tianzq lt310 fuyabo dxu23nc sushe-shakya jcassiojr tanxiangtj wangyu1997 mahbubkhoda fbudrowski theonll beiliwzl budavarapu eru1206 orpia1337 rl-code-lib phil-u-u filipbolt keshavaspanda

rlgraph's Issues

[Core] Refactor define_graph_api

The current signature of define_graph_api is weird and leads to pep8 warnings in all agents because we pass through arbitrary other args. Splitting the list of subcomponents also seems suboptimal,
maybe we should pass them as a dict and look them up where needed. Long lines like

        preprocessor, merger, memory, splitter, policy, exploration, loss_function, optimizer, value_function, \
            vf_optimizer = sub_components

should be avoided.

[Core] Improve define-by-run state management.

In PyTorch, we currently manage state for get/set weights via a wrapper object called PyTorchVariable which accesses layer weights.

However, in define by run backends we also may want to use lists and numpy arrays to manage state and set/get it through the executor interface, e.g. in buffers. Performing space inference is difficult on raw lists so we could consider wrapping the 'list' variables with an object storing the desired spaces.

*_spec parameters as dictionaries are inconvenient

Working with parameters passed as dictionaries is inconvenient - no auto complete, no explicit documentation, no explicit defaults, fail at later point, etc. And can lead to bad practices such as adding undocumented fields instead of extending a class.
I saw that there is the Specifiable class and there are few classes that extend it but the usage seems inconsistent:

From agent.py

            policy_spec (Optional[dict]): An optional dict for further kwargs passing into the Policy c'tor.
            value_function_spec (list): Neural network specification for baseline.

            exploration_spec (Optional[dict]): The spec-dict to create the Exploration Component.
            execution_spec (Optional[dict,Execution]): The spec-dict specifying execution settings.
            optimizer_spec (Optional[dict,Optimizer]): The spec-dict to create the Optimizer for this Agent.

            value_function_optimizer_spec (dict): Optimizer config for value function otpimizer. If None, the optimizer
                spec for the policy is used (same learning rate and optimizer type).

            observe_spec (Optional[dict]): Spec-dict to specify `Agent.observe()` settings.
            update_spec (Optional[dict]): Spec-dict to specify `Agent.update()` settings.
            summary_spec (Optional[dict]): Spec-dict to specify summary settings.
            saver_spec (Optional[dict]): Spec-dict to specify saver settings.

For example optimizer_spec can be provided as Optimizer, but value_function_optimizer_spec needs to be dict -- the code below assumes this. Also update_spec is dictionary and some of the fields are specific to the particular algorithm.

Is there a specific reason for the difference in the parameters?

Building policy with continuous action space throws error

Here's a test to demonstrate this:

    def test_policy_for_continuous_action_space(self):
        # state_space (NN is a simple single fc-layer relu network (2 units), random biases, random weights).
        state_space = FloatBox(shape=(4,), add_batch_rank=True)

        # action_space (5 possible actions).
        action_space = FloatBox(low=-1.0, high=1.0, add_batch_rank=True)

        policy = Policy(network_spec=config_from_path("configs/test_simple_nn.json"), action_space=action_space)
        test = ComponentTest(
            component=policy,
            input_spaces=dict(
                nn_input=state_space,
                actions=action_space,
                logits=FloatBox(shape=(2, ), add_batch_rank=True),
                probabilities=FloatBox(add_batch_rank=True)
            ),
            action_space=action_space
        )

        test.read_variable_values(policy.variables)

This test fails with:

self = <rlgraph.components.policies.policy.Policy object at 0x12ebb08d0>
key = '_T0_'
probabilities = <tf.Tensor 'policy/action-adapter-0/Squeeze:0' shape=(?,) dtype=float32>

    @graph_fn(flatten_ops=True, split_ops=True, add_auto_key_as_first_param=True)
    def _graph_fn_get_distribution_entropies(self, key, probabilities):
        """
        Pushes the given `probabilities` through all our distributions' `entropy` API-methods and returns a
        DataOpDict with the keys corresponding to our `action_space`.
    
        Args:
            probabilities (DataOp): The parameters to define a distribution.
    
        Returns:
            FlattenedDataOp: A DataOpDict with the different distributions' `entropy` outputs. Keys always correspond to
                structure of `self.action_space`.
        """
>       return self.distributions[key].entropy(probabilities)
E       KeyError: '_T0_'

[Testing] Start making test cases (and travis setup) backend agnostic.

Problem: We currently automatically test mostly the tf-versions of our Components and Agents. PyTorch is not well represented in our test cases, which can lead to uncaught bugs. E.g. our travis testing container is tf-based and does not even have PyTorch installed.

Solution: Start converting all test cases from currently pure tf-based to a more flexible setup, where the same test case can be used on both backends. Note that most tests should already be working under both backends, but this needs to be - yes - tested.

[Core] Add auto-op-rec-slicing, splitting, merging to API-methods.

Currently: A container Op-record (o) in an API-method (e.g. an op-record that's holding a DataOpTuple) cannot be accessed per-item by doing e.g. o[1] for a tuple or o["key-a"] for a dict. Instead, extra components (such as ContainerSplitter, ContainerMerger) need to be added tediously to the parent component and then used in the API-method to do these tasks.

Suggestion: Accessing an op-rec via the []-operator (by index or key) inside an API-method should automatically add the above steps and thus make handling of container op-recs inside an API-method more intuitive.

Agent reset() not called before starting new episodes in SingleThreadedWorker

Coming from tensorforce (where agent.reset() is called in episode loop), and by reading the doc on agents/agent.py, it seems agent.reset() is supposed to be called before starting a new episode. However currently it does not seem to be called in SingleThreadedWorker nor RayWorker before new episodes, although preprocessors stack seems to have been reset explicitly.

It would be nice if you could clarify the purpose of agent.reset() and when it is supposed to be called. Would appreciate some examples..

def reset(self):
        """
        Must be implemented to define some reset behavior (before starting a new episode).
        This could include resetting the preprocessor and other Components.
        """
        pass  # optional

Refs:
https://github.com/rlgraph/rlgraph/blob/master/rlgraph/agents/agent.py
https://github.com/rlgraph/rlgraph/blob/master/rlgraph/execution/single_threaded_worker.py

[Core] Improve naming/documentation of IR

Understanding the build process is currently quite difficult because it happens partly in the graph builder, in static and non-static parts of Component, and in various utils.

We should:

Make fully clear the purpose of each build Op
Fully document the Structure of the IR generated by the two builds (potentially revive visualisation project for this)
Clarify the use of Build ops in graph functions and API methods -> This did not matter much in static build mode but it is confusing when in define by run mode, build-time Ops are used to pass around data.

[Core] Rename vars with trailing _ for readability

The build procedure and the decorators contain many variable pairs named with name/name_, args/args_ etc. This makes the decorators much harder to read than necessary.

We should rename them and clearly identify which is used for what, e.g. inferred_name instead of name_.

[Execution] Episode update mode

Memories allow to sample episodes or time steps, but the worker only supports time step based updates. For variable length episodes, we need episode-based updating for multi-env updates.

[Core] Update or remove component visualization utils

The current visualization utils refer to an outdated component API model and should either be deprecated entirely or updated so we can produce visualizations again (desirable but not high priority).

[Core] Sequence option to separate non-terminal episodes for multiple environments

When doing multi-env policy gradient updates, we have no way of distinguishing
i) terminal episode fragments
ii) non-terminal episode fragments from different environments

In the single env case, this is irrelevant because the terminal marker tells us all we need to know. In the multi-env case, we may want to update from multiple non-terminal fragments from different environments. If we then just artificially set them to terminal, boot-strapping in GAE is not correct.

Proposed solution would require an additional marker in the memory to distinguish episodes from different environments.

Overall not high priority because one can just call update from external.

[Core] Investigate type inference in PyTorch buffers

Following #21, there are some type inference issues in PyTorch. Namely, the record space arrives as FloatBoxes in the buffer when e.g. actions are actually meant to be IntBox. This then requires tedious casting.

[Documentation] Create Howto-FAQ

List a few basic how tos:

How to:

Use Different backends
Use container actions
Use GPUs
Use Ray
Configure device maps
...

[Examples] Add distributed Impala example/ clarify distributed modes in example

Example script names should highlight distributed/single node nature and we should add dedicated examples for e.g. distributed Impala, distributed Ray, multi GPU. Even if these are just a few flags, it's easier to just have dedicated examples to use.

[Execution] Investigate Ray serialisation bug.

Policy weights cannot be serialised out of the box any more and are currently wrapped in a RayWeight object. Investigate which object is reponsible for problematic serialisation.

Potentially unpack/unnest/flatten weights before returning from agent API.

[Core] Always use "to" in type conversions/ remove unneeded conversions

We use dtype (now renamed to convert_dtype) from util.py in various places with the "tf" default arg in "to".

Maybe we should make the "to" arg non-optional so it's always clear what representation we are converting to (tf, numpy, pytorch)?

Otherwise, TF code does not use the "to" arg, but all other code has to use it, which makes things inconsistent/potentially confusing to read.

[Documentation] Add feature list to readme

It's quite difficult atm to get an idea what features are supported beyond the obvious agents - a detailed feature list in the readme would help.

[Documentation] readme example usage throws error

Hi,
While learning rlgraph by running the examples in README.md, we found multiple small typos or api changes that cause errors. It would be great if the example can be updated, so that it's easier for people to try it out.

Here is the modified example:

from rlgraph.agents import DQNAgent
from rlgraph.environments import OpenAIGymEnv

environment = OpenAIGymEnv('CartPole-v0')

# Create from .json file or dict, see agent API for all
# possible configuration parameters.
agent = DQNAgent.from_file(
  "configs/dqn_cartpole.json",
  state_space=environment.state_space, 
  action_space=environment.action_space
)

# Get an action, take a step, observe reward.
state = environment.reset()
preprocessed_state, action = agent.get_action(
  states=state,
  extra_returns="preprocessed_states"
)

# Execute step in environment.
next_state, reward, terminal, info =  environment.step(action)

# Observe result.
agent.observe(
    preprocessed_states=preprocessed_state,
    actions=action,
    internals=[],
    next_states=next_state,
    rewards=reward,
    terminals=terminal
)

# Call update when desired:
loss = agent.update()

[Core] Update PyTorch to 1.0

https://github.com/pytorch/pytorch/releases/tag/v1.0.0 is out, need to update various operators and shape utilities, in particular wrt now supported broadcasting and 0 dim tensors.

[Algorithm] Implement basic policy gradient agent from IMPALA

Use existing IMPALA components and expose as simple PG agent.

[Core] Implement advanced decorator options for Pytorch

Add runtime support for container splitting/merging, flattening etc.

[Algorithm] Investigate duelling container architectures

Due to problematic performance in gridworld test environments, there is a high probability duelling networks are not working as intended with container actions. We need to determine how to design dueling architectures for container (one dueling set per action?).

[Core] Make components describe their own state.

A key problem in fully unifying internal state management across backends is that _variables() returns internally registered variable references.

When writing in Python, raw ints/floats (e..g buffer indices) are not refs, so their internally registered values are not updated, so variables() does not return updated values. Example is the ring-buffer class -> variable creation is unified, but getting variables() in the tests is problematic.

A simple solution is that components implement variables() to return all variables making up their internal state. This would allow to both return native python types, tensorflow variables, and torch parameters without further wrapping any ops.

NN can only be initialized from spec, not from NN object

Watch out! Your documentation says, that I can feed a NeuralNetwork object into the agent. But that is not true with the current implementation. Relevant code section:
https://github.com/YARL-project/YARL/blob/bc2e65a5f4430629d0558b23d45edd1029d42178/yarl/agents/agent.py#L97

[Core] Add learning rate decays

Add an option to either/both:

i) Use an internal decay mechanism via global step + a learning_rate_spec as part of the update_spec
ii) Allow for an optional learning_rate parameter in update() to externally manipulate learning rates based on whatever scheme desired

Slight preference for ii) because it allows easier experimentation with irregular decay schemes.

[Core] Investigate potential off-by-one error in ring-buffer

In the following test-case:

test_gpu_strategies.py::test_multi_gpu_ppo_agent_learning_test_gridworld_2x2

Variable assignment occasionally fails in the ring buffer, likely because non-deterministic reads. Investigate read-write order of all variables.

[Core] Container action spaces in Policy

For useful applied tasks, we need to allow container action spaces. Currently raises an error, need where this actually causes a problem.

Allow providing the method to call directly in GraphExecutor.execute()

Currently execute() accepts the method name provided as string, which prevent linters to detect typos or changes due to refactoring. I propose to allow passing the method directly - linters and auto-complete works.
Current state:

graph_executor.execute("get_policy_weights")

Proposed change:

graph_executor.execute(self.root_component.get_policy_weights)

[ Components ] Policy Component needs API-method cleanup and return value cleanup

The Policy Component needs some cleanup as its API-methods are becoming less and less organized.

Some API methods are called "...parameters_log_probs". Log probs are only really returned for discrete action spaces, so the suffix "_log_probs" should be removed from the API's name entirely and the log-probs should only be returned for categorical distributions (for all others, these "log_probs" are currently actually log(mean) or log(stddev), ...).
API methods to get the actual log-likelihoods for pdf-type continuous distribution functions, will be better named and organized and the actual log-likelihood returned for a certain action will have the key: "log_likelihood", rather than "log_probs".

[Algorithms] Separate component for root api methods.

@janislavjankov has suggested the following:

The comment I had for the agent's component is that it looks cleaner to me if it was extracted and defined as a separate class - no need to attach the methods within the define_graph_api - just have a regular class (extending Component) that can be instantiated there.

So we could for example in the DQNAgent module have a class that implements the API of DQN as simple python methods.

[Core] List comprehensions with graph calls don't compile within the API

Here is a test case to demonstrate this:

    def test_call_in_comprehension(self):
        container = Component(scope="container")
        sub_comps = [Dummy1To1(scope="dummy-{}".format(i)) for i in range(3)]
        container.add_components(*sub_comps)

        # Define container's API:
        @rlgraph_api(name="test", component=container)
        def container_test(self_, input_):
            # results = []
            # for i in range(len(sub_comps)):
            #     results.append(sub_comps[i].run(input_))
            results = [x.run(input_) for x in sub_comps]
            return self_._graph_fn_sum(*results)

        @graph_fn(component=container)
        def _graph_fn_sum(self_, *inputs):
            return sum(inputs)

        test = ComponentTest(component=container, input_spaces=dict(input_=float))
        test.test(("test", 1.23), expected_outputs=len(sub_comps) * (1.23 + 1), decimals=2)

The commented out code above works, while the equivalent list comprehension fails with

self = <rlgraph.tests.dummy_components.Dummy1To1 object at 0x129f86748>
args = (<rlgraph.utils.op_records.DataOpRecord object at 0x129ee2fd0>,)
kwargs = {}, api_fn_name = 'run'
api_method_rec = <rlgraph.utils.op_records.APIMethodRecord object at 0x129ee2048>
in_op_column = <rlgraph.utils.op_records.DataOpRecordColumnIntoAPIMethod object at 0x129f1b048>
minimum_num_call_params = 1
all_args = [(0, <rlgraph.utils.op_records.DataOpRecord object at 0x129ee2fd0>)]
flex = None, i = 0, key = 0

...

rlgraph.utils.rlgraph_errors.RLGraphError: API-method 'run' must have as 1st parameter (the component) either `root` or `self`. Other names are not allowed!

[Algorithms] Return dict (instead of 2-tuple) from API method: get_action.

The Agent API method get_action should return a dict, instead of currently a 2-tuple (action, preprocessed_state).
The dict would have the keys "action" and "preprocessed_state". This is already good practice in many of the Policy, RNN and other classes' API methods that may sometimes have more complex return structures.

compared with tensorforce?

I am using tensorforce but I see that the development paced is reducing a bit there maybe. It fits my need well and works very fine though, so I am very happy with it and I recommend people to give it a look ;) (note: I am a user, by no ways a developer there).

Just by curiosity, what is the difference between tensorforce and rlgraph from an end-user point of view?

Of course I understand that developers of rlgraph may be a bit biaised on this question, but asking for your opinion nonetheless ;) (especially @michaelschaarschmidt who seems to work on both? ;) ).

[Example] Add LSTM example

Hi, it would be great if you can provide a working example with LSTM cells. I saw there are tests with LSTM, but a complete example is missing. Thanks

[Core] Include non-shared value functions in get/set weight utilities.

For distributed synchronisation, we need to sync separate value networks via get/set weights.

[Execution] Consider setting buffering to false default

Default-enabling observe buffering just creates unexpected behaviour for users with small batch sizes.

[Core] Improve PyTorch dataflow/tensor conversions

Currently, input arguments to the pytorch executor are always converted to torch tensors. This is not really desirable for e.g. memory inserts or things that could just be executed in native Python, and there are likely some unneeded conversions hurting performance and also causing type inference problems.

Ideally we would have an option to tell API methods if this conversion is needed. The problem is that for TF, everything is auto-converted.

Test

[Core] Create global-step increment op

Global step is created in the executor and used for checkpointing. An increment op would need to be created e.g. in the generic agent and called together with act.

[Core] Make plan for TensorFlow 2.0 transition

With tf 2.0 preview builds out, we need to plan how to transition to 2.0.

In particular:

new consolidated API via keras
tf.function/autograph integration at the graph_fn level
scope/variable management

[Execution] Add container action support to ray-worker (already done in single-threaded worker).

Container actions come as a dict from Agent.get_action() where each key contains a batch of values.

This format needs to be flipped (to a list of dicts) via:

ret = [{key: value[i] for key, value in ret.items()} for i in range(len(ret[next(iter(ret))]))] (ret is the session-returned dict)

... before sending single actions to an env.

NOTE: The original session output (no flipping) is needed when python-buffering and/or observing via Agent.observe()!

[Core] graph_fn decorator needs an explicit `unsplit_ops` option.

So far, unsplitting is done implicitly, whenever split_ops=True. This is normally ok, but there are cases, where the output of the graph_fn has nothing to do in terms of nested structure with the input to the graph_fn. In these cases, it is important to e.g. switch splitting off and unsplitting on.

Example:
policy has a graph_fn into which we send the nn_input (e.g. a dict state with keys "a" and "b"). This function should then output actions (e.g. non-container simple actions). So the input (dict state space) does not match the output (flat actions) in terms of the nesting structure and the assumption to unsplit (the actions) in the same way as the split (of the state space) was performed will fail here.

[Execution] SynchronousBatchExecutor

For distributed policy optimization on ray, we need a simple synchronous executor which merges all worker batches and applies them synchronously into one update.

[Algorithm] Implement soft-actor-critic

Seems a good candidate for inclusion:

https://arxiv.org/abs/1801.01290

Applications:

https://arxiv.org/abs/1812.05905

[Core] Design discussion: Refactor component/backend organisation

The current implementations grew out from experimental design around multi-backend support. The get_backend() checks are undesirable and do not make for readable implementations.

This issue is meant to collect design improvements.

Proposal 1: Components will be reorganised into a base component and backend-specific sub-classes.
A package tensorflow_components/pytorch_components will mirror the folder structure of the base components and contain the specific implementations.

Advantages: Avoid backend checks in implementations, clearly separates backends from interfaces
Disadvantages: Multiplies number of components, potentially irritating to see mirrored folder structure of

components/
       memories/base_memory
       tf_components/memories/tf_memory
       pytorch_components/memories/pytorch_memory

versus keeping everything in one folder memories (which would make imports more difficult for the package:

components/
       memories/
            base_memory
            tf_memory
            pytorch_memory

README example not working in Linux

Hi, I read rlgraph paper and wanted to give it a try, so I wanted to set up and run a few examples, but the first one breaks for me. Repro:

virtualenv -p python3 venv
source venv/bin/activate
pip install rlgraph
pip install rlgraph[ray]
pip install gym[atari]
pip install tensorflow-gpu
pip install psutil
pip install setproctitle

# Start ray on the head machine
ray start --head --redis-port 6379
# Optionally join to this cluster from other machines with ray start --redis-address=...

# Run script
python apex_pong.py

After ~1 minute it breaks with:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,84,84,4] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node prioritized-replay/memorynext_states/Assign (defined at /media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/spaces/box_space.py:192)  = Assign[T=DT_FLOAT, _grappler_relax_allocator_constraints=true, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prioritized-replay/memorynext_states, prioritized-replay/memorynext_states/Initializer/Const)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Am I doing anything wrong, or the default example is not working on Linux?

Info about my machine:
OS: Ubuntu 18.04.1 LTS
CPU: AMD ThreadRipper
GPU: GeForce 1080ti
RAM: 32gb
VRAM: 11gb

Some screenshots: