Coder Social home page Coder Social logo

rlgraph / rlgraph Goto Github PK

View Code? Open in Web Editor NEW
315.0 315.0 39.0 8.05 MB

RLgraph: Modular computation graphs for deep reinforcement learning

License: Apache License 2.0

Python 98.96% Dockerfile 0.31% C++ 0.73%
deep-learning deep-reinforcement-learning dqn machine-learning neural-networks ppo pytorch reinforcement-learning tensorflow

rlgraph's People

Contributors

janislavjankov avatar jon-chuang avatar krfricke avatar michaelschaarschmidt avatar samialabed avatar sven1977 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rlgraph's Issues

[Core] Refactor define_graph_api

The current signature of define_graph_api is weird and leads to pep8 warnings in all agents because we pass through arbitrary other args. Splitting the list of subcomponents also seems suboptimal,
maybe we should pass them as a dict and look them up where needed. Long lines like

        preprocessor, merger, memory, splitter, policy, exploration, loss_function, optimizer, value_function, \
            vf_optimizer = sub_components

should be avoided.

[Core] Improve define-by-run state management.

In PyTorch, we currently manage state for get/set weights via a wrapper object called PyTorchVariable which accesses layer weights.

However, in define by run backends we also may want to use lists and numpy arrays to manage state and set/get it through the executor interface, e.g. in buffers. Performing space inference is difficult on raw lists so we could consider wrapping the 'list' variables with an object storing the desired spaces.

*_spec parameters as dictionaries are inconvenient

Working with parameters passed as dictionaries is inconvenient - no auto complete, no explicit documentation, no explicit defaults, fail at later point, etc. And can lead to bad practices such as adding undocumented fields instead of extending a class.
I saw that there is the Specifiable class and there are few classes that extend it but the usage seems inconsistent:

From agent.py

            policy_spec (Optional[dict]): An optional dict for further kwargs passing into the Policy c'tor.
            value_function_spec (list): Neural network specification for baseline.

            exploration_spec (Optional[dict]): The spec-dict to create the Exploration Component.
            execution_spec (Optional[dict,Execution]): The spec-dict specifying execution settings.
            optimizer_spec (Optional[dict,Optimizer]): The spec-dict to create the Optimizer for this Agent.

            value_function_optimizer_spec (dict): Optimizer config for value function otpimizer. If None, the optimizer
                spec for the policy is used (same learning rate and optimizer type).

            observe_spec (Optional[dict]): Spec-dict to specify `Agent.observe()` settings.
            update_spec (Optional[dict]): Spec-dict to specify `Agent.update()` settings.
            summary_spec (Optional[dict]): Spec-dict to specify summary settings.
            saver_spec (Optional[dict]): Spec-dict to specify saver settings.

For example optimizer_spec can be provided as Optimizer, but value_function_optimizer_spec needs to be dict -- the code below assumes this. Also update_spec is dictionary and some of the fields are specific to the particular algorithm.

Is there a specific reason for the difference in the parameters?

Building policy with continuous action space throws error

Here's a test to demonstrate this:

    def test_policy_for_continuous_action_space(self):
        # state_space (NN is a simple single fc-layer relu network (2 units), random biases, random weights).
        state_space = FloatBox(shape=(4,), add_batch_rank=True)

        # action_space (5 possible actions).
        action_space = FloatBox(low=-1.0, high=1.0, add_batch_rank=True)

        policy = Policy(network_spec=config_from_path("configs/test_simple_nn.json"), action_space=action_space)
        test = ComponentTest(
            component=policy,
            input_spaces=dict(
                nn_input=state_space,
                actions=action_space,
                logits=FloatBox(shape=(2, ), add_batch_rank=True),
                probabilities=FloatBox(add_batch_rank=True)
            ),
            action_space=action_space
        )

        test.read_variable_values(policy.variables)

This test fails with:

self = <rlgraph.components.policies.policy.Policy object at 0x12ebb08d0>
key = '_T0_'
probabilities = <tf.Tensor 'policy/action-adapter-0/Squeeze:0' shape=(?,) dtype=float32>

    @graph_fn(flatten_ops=True, split_ops=True, add_auto_key_as_first_param=True)
    def _graph_fn_get_distribution_entropies(self, key, probabilities):
        """
        Pushes the given `probabilities` through all our distributions' `entropy` API-methods and returns a
        DataOpDict with the keys corresponding to our `action_space`.
    
        Args:
            probabilities (DataOp): The parameters to define a distribution.
    
        Returns:
            FlattenedDataOp: A DataOpDict with the different distributions' `entropy` outputs. Keys always correspond to
                structure of `self.action_space`.
        """
>       return self.distributions[key].entropy(probabilities)
E       KeyError: '_T0_'

[Testing] Start making test cases (and travis setup) backend agnostic.

Problem: We currently automatically test mostly the tf-versions of our Components and Agents. PyTorch is not well represented in our test cases, which can lead to uncaught bugs. E.g. our travis testing container is tf-based and does not even have PyTorch installed.

Solution: Start converting all test cases from currently pure tf-based to a more flexible setup, where the same test case can be used on both backends. Note that most tests should already be working under both backends, but this needs to be - yes - tested.

[Core] Add auto-op-rec-slicing, splitting, merging to API-methods.

Currently: A container Op-record (o) in an API-method (e.g. an op-record that's holding a DataOpTuple) cannot be accessed per-item by doing e.g. o[1] for a tuple or o["key-a"] for a dict. Instead, extra components (such as ContainerSplitter, ContainerMerger) need to be added tediously to the parent component and then used in the API-method to do these tasks.

Suggestion: Accessing an op-rec via the []-operator (by index or key) inside an API-method should automatically add the above steps and thus make handling of container op-recs inside an API-method more intuitive.

Agent reset() not called before starting new episodes in SingleThreadedWorker

Coming from tensorforce (where agent.reset() is called in episode loop), and by reading the doc on agents/agent.py, it seems agent.reset() is supposed to be called before starting a new episode. However currently it does not seem to be called in SingleThreadedWorker nor RayWorker before new episodes, although preprocessors stack seems to have been reset explicitly.

It would be nice if you could clarify the purpose of agent.reset() and when it is supposed to be called. Would appreciate some examples..

def reset(self):
        """
        Must be implemented to define some reset behavior (before starting a new episode).
        This could include resetting the preprocessor and other Components.
        """
        pass  # optional

Refs:
https://github.com/rlgraph/rlgraph/blob/master/rlgraph/agents/agent.py
https://github.com/rlgraph/rlgraph/blob/master/rlgraph/execution/single_threaded_worker.py

[Core] Improve naming/documentation of IR

Understanding the build process is currently quite difficult because it happens partly in the graph builder, in static and non-static parts of Component, and in various utils.

We should:

  • Make fully clear the purpose of each build Op
  • Fully document the Structure of the IR generated by the two builds (potentially revive visualisation project for this)
  • Clarify the use of Build ops in graph functions and API methods -> This did not matter much in static build mode but it is confusing when in define by run mode, build-time Ops are used to pass around data.

[Core] Rename vars with trailing _ for readability

The build procedure and the decorators contain many variable pairs named with name/name_, args/args_ etc. This makes the decorators much harder to read than necessary.

We should rename them and clearly identify which is used for what, e.g. inferred_name instead of name_.

[Execution] Episode update mode

Memories allow to sample episodes or time steps, but the worker only supports time step based updates. For variable length episodes, we need episode-based updating for multi-env updates.

[Core] Sequence option to separate non-terminal episodes for multiple environments

When doing multi-env policy gradient updates, we have no way of distinguishing
i) terminal episode fragments
ii) non-terminal episode fragments from different environments

In the single env case, this is irrelevant because the terminal marker tells us all we need to know. In the multi-env case, we may want to update from multiple non-terminal fragments from different environments. If we then just artificially set them to terminal, boot-strapping in GAE is not correct.

Proposed solution would require an additional marker in the memory to distinguish episodes from different environments.

Overall not high priority because one can just call update from external.

[Execution] Investigate Ray serialisation bug.

Policy weights cannot be serialised out of the box any more and are currently wrapped in a RayWeight object. Investigate which object is reponsible for problematic serialisation.

Potentially unpack/unnest/flatten weights before returning from agent API.

[Core] Always use "to" in type conversions/ remove unneeded conversions

We use dtype (now renamed to convert_dtype) from util.py in various places with the "tf" default arg in "to".

Maybe we should make the "to" arg non-optional so it's always clear what representation we are converting to (tf, numpy, pytorch)?

Otherwise, TF code does not use the "to" arg, but all other code has to use it, which makes things inconsistent/potentially confusing to read.

[Documentation] readme example usage throws error

Hi,
While learning rlgraph by running the examples in README.md, we found multiple small typos or api changes that cause errors. It would be great if the example can be updated, so that it's easier for people to try it out.

Here is the modified example:

from rlgraph.agents import DQNAgent
from rlgraph.environments import OpenAIGymEnv

environment = OpenAIGymEnv('CartPole-v0')

# Create from .json file or dict, see agent API for all
# possible configuration parameters.
agent = DQNAgent.from_file(
  "configs/dqn_cartpole.json",
  state_space=environment.state_space, 
  action_space=environment.action_space
)

# Get an action, take a step, observe reward.
state = environment.reset()
preprocessed_state, action = agent.get_action(
  states=state,
  extra_returns="preprocessed_states"
)

# Execute step in environment.
next_state, reward, terminal, info =  environment.step(action)

# Observe result.
agent.observe(
    preprocessed_states=preprocessed_state,
    actions=action,
    internals=[],
    next_states=next_state,
    rewards=reward,
    terminals=terminal
)

# Call update when desired:
loss = agent.update()

[Algorithm] Investigate duelling container architectures

Due to problematic performance in gridworld test environments, there is a high probability duelling networks are not working as intended with container actions. We need to determine how to design dueling architectures for container (one dueling set per action?).

[Core] Make components describe their own state.

A key problem in fully unifying internal state management across backends is that _variables() returns internally registered variable references.

When writing in Python, raw ints/floats (e..g buffer indices) are not refs, so their internally registered values are not updated, so variables() does not return updated values. Example is the ring-buffer class -> variable creation is unified, but getting variables() in the tests is problematic.

A simple solution is that components implement variables() to return all variables making up their internal state. This would allow to both return native python types, tensorflow variables, and torch parameters without further wrapping any ops.

[Core] Add learning rate decays

Add an option to either/both:

i) Use an internal decay mechanism via global step + a learning_rate_spec as part of the update_spec
ii) Allow for an optional learning_rate parameter in update() to externally manipulate learning rates based on whatever scheme desired

Slight preference for ii) because it allows easier experimentation with irregular decay schemes.

[Core] Investigate potential off-by-one error in ring-buffer

In the following test-case:

test_gpu_strategies.py::test_multi_gpu_ppo_agent_learning_test_gridworld_2x2

Variable assignment occasionally fails in the ring buffer, likely because non-deterministic reads. Investigate read-write order of all variables.

Allow providing the method to call directly in GraphExecutor.execute()

Currently execute() accepts the method name provided as string, which prevent linters to detect typos or changes due to refactoring. I propose to allow passing the method directly - linters and auto-complete works.
Current state:

graph_executor.execute("get_policy_weights")

Proposed change:

graph_executor.execute(self.root_component.get_policy_weights)

[ Components ] Policy Component needs API-method cleanup and return value cleanup

The Policy Component needs some cleanup as its API-methods are becoming less and less organized.

  • Some API methods are called "...parameters_log_probs". Log probs are only really returned for discrete action spaces, so the suffix "_log_probs" should be removed from the API's name entirely and the log-probs should only be returned for categorical distributions (for all others, these "log_probs" are currently actually log(mean) or log(stddev), ...).
  • API methods to get the actual log-likelihoods for pdf-type continuous distribution functions, will be better named and organized and the actual log-likelihood returned for a certain action will have the key: "log_likelihood", rather than "log_probs".

[Algorithms] Separate component for root api methods.

@janislavjankov has suggested the following:

The comment I had for the agent's component is that it looks cleaner to me if it was extracted and defined as a separate class - no need to attach the methods within the define_graph_api - just have a regular class (extending Component) that can be instantiated there.

So we could for example in the DQNAgent module have a class that implements the API of DQN as simple python methods.

[Core] List comprehensions with graph calls don't compile within the API

Here is a test case to demonstrate this:

    def test_call_in_comprehension(self):
        container = Component(scope="container")
        sub_comps = [Dummy1To1(scope="dummy-{}".format(i)) for i in range(3)]
        container.add_components(*sub_comps)

        # Define container's API:
        @rlgraph_api(name="test", component=container)
        def container_test(self_, input_):
            # results = []
            # for i in range(len(sub_comps)):
            #     results.append(sub_comps[i].run(input_))
            results = [x.run(input_) for x in sub_comps]
            return self_._graph_fn_sum(*results)

        @graph_fn(component=container)
        def _graph_fn_sum(self_, *inputs):
            return sum(inputs)

        test = ComponentTest(component=container, input_spaces=dict(input_=float))
        test.test(("test", 1.23), expected_outputs=len(sub_comps) * (1.23 + 1), decimals=2)

The commented out code above works, while the equivalent list comprehension fails with

self = <rlgraph.tests.dummy_components.Dummy1To1 object at 0x129f86748>
args = (<rlgraph.utils.op_records.DataOpRecord object at 0x129ee2fd0>,)
kwargs = {}, api_fn_name = 'run'
api_method_rec = <rlgraph.utils.op_records.APIMethodRecord object at 0x129ee2048>
in_op_column = <rlgraph.utils.op_records.DataOpRecordColumnIntoAPIMethod object at 0x129f1b048>
minimum_num_call_params = 1
all_args = [(0, <rlgraph.utils.op_records.DataOpRecord object at 0x129ee2fd0>)]
flex = None, i = 0, key = 0

...

rlgraph.utils.rlgraph_errors.RLGraphError: API-method 'run' must have as 1st parameter (the component) either `root` or `self`. Other names are not allowed!

[Algorithms] Return dict (instead of 2-tuple) from API method: get_action.

The Agent API method get_action should return a dict, instead of currently a 2-tuple (action, preprocessed_state).
The dict would have the keys "action" and "preprocessed_state". This is already good practice in many of the Policy, RNN and other classes' API methods that may sometimes have more complex return structures.

compared with tensorforce?

I am using tensorforce but I see that the development paced is reducing a bit there maybe. It fits my need well and works very fine though, so I am very happy with it and I recommend people to give it a look ;) (note: I am a user, by no ways a developer there).

Just by curiosity, what is the difference between tensorforce and rlgraph from an end-user point of view?

Of course I understand that developers of rlgraph may be a bit biaised on this question, but asking for your opinion nonetheless ;) (especially @michaelschaarschmidt who seems to work on both? ;) ).

[Example] Add LSTM example

Hi, it would be great if you can provide a working example with LSTM cells. I saw there are tests with LSTM, but a complete example is missing. Thanks

[Core] Improve PyTorch dataflow/tensor conversions

Currently, input arguments to the pytorch executor are always converted to torch tensors. This is not really desirable for e.g. memory inserts or things that could just be executed in native Python, and there are likely some unneeded conversions hurting performance and also causing type inference problems.

Ideally we would have an option to tell API methods if this conversion is needed. The problem is that for TF, everything is auto-converted.

[Core] Create global-step increment op

Global step is created in the executor and used for checkpointing. An increment op would need to be created e.g. in the generic agent and called together with act.

[Core] Make plan for TensorFlow 2.0 transition

With tf 2.0 preview builds out, we need to plan how to transition to 2.0.

In particular:

  • new consolidated API via keras
  • tf.function/autograph integration at the graph_fn level
  • scope/variable management

[Execution] Add container action support to ray-worker (already done in single-threaded worker).

Container actions come as a dict from Agent.get_action() where each key contains a batch of values.

This format needs to be flipped (to a list of dicts) via:

ret = [{key: value[i] for key, value in ret.items()} for i in range(len(ret[next(iter(ret))]))] (ret is the session-returned dict)

... before sending single actions to an env.

NOTE: The original session output (no flipping) is needed when python-buffering and/or observing via Agent.observe()!

[Core] graph_fn decorator needs an explicit `unsplit_ops` option.

So far, unsplitting is done implicitly, whenever split_ops=True. This is normally ok, but there are cases, where the output of the graph_fn has nothing to do in terms of nested structure with the input to the graph_fn. In these cases, it is important to e.g. switch splitting off and unsplitting on.

Example:
policy has a graph_fn into which we send the nn_input (e.g. a dict state with keys "a" and "b"). This function should then output actions (e.g. non-container simple actions). So the input (dict state space) does not match the output (flat actions) in terms of the nesting structure and the assumption to unsplit (the actions) in the same way as the split (of the state space) was performed will fail here.

[Execution] SynchronousBatchExecutor

For distributed policy optimization on ray, we need a simple synchronous executor which merges all worker batches and applies them synchronously into one update.

[Core] Design discussion: Refactor component/backend organisation

The current implementations grew out from experimental design around multi-backend support. The get_backend() checks are undesirable and do not make for readable implementations.

This issue is meant to collect design improvements.

Proposal 1: Components will be reorganised into a base component and backend-specific sub-classes.
A package tensorflow_components/pytorch_components will mirror the folder structure of the base components and contain the specific implementations.

Advantages: Avoid backend checks in implementations, clearly separates backends from interfaces
Disadvantages: Multiplies number of components, potentially irritating to see mirrored folder structure of

components/
       memories/base_memory
       tf_components/memories/tf_memory
       pytorch_components/memories/pytorch_memory

versus keeping everything in one folder memories (which would make imports more difficult for the package:

components/
       memories/
            base_memory
            tf_memory
            pytorch_memory

README example not working in Linux

Hi, I read rlgraph paper and wanted to give it a try, so I wanted to set up and run a few examples, but the first one breaks for me. Repro:

virtualenv -p python3 venv
source venv/bin/activate
pip install rlgraph
pip install rlgraph[ray]
pip install gym[atari]
pip install tensorflow-gpu
pip install psutil
pip install setproctitle

# Start ray on the head machine
ray start --head --redis-port 6379
# Optionally join to this cluster from other machines with ray start --redis-address=...

# Run script
python apex_pong.py

After ~1 minute it breaks with:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,84,84,4] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node prioritized-replay/memorynext_states/Assign (defined at /media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/spaces/box_space.py:192)  = Assign[T=DT_FLOAT, _grappler_relax_allocator_constraints=true, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prioritized-replay/memorynext_states, prioritized-replay/memorynext_states/Initializer/Const)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Am I doing anything wrong, or the default example is not working on Linux?

Info about my machine:
OS: Ubuntu 18.04.1 LTS
CPU: AMD ThreadRipper
GPU: GeForce 1080ti
RAM: 32gb
VRAM: 11gb

Some screenshots:
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.