Exactly reproducible random environments

Every experiment must be reproducible or the environments have not much worth for research because we can't compare runs properly. There needs to be a random number generator which gets a seed for each environment and agent so that one can run identical experiments if they choose to. The particle environment can be used as an example--it sets the random number generator for the action spaces as well. This is a huge thing in RL.

Mapp communications in grid mode

Support communication in grid-mode observations. This will likely mean that the agents need to have a full observation space, even if they are created with partial observability. Then we need to create "fog" for regions where the agent cannot see. When communication occurs, that fog is "lifted" and the agent can observer that region of space.

In Epic #5
In Epic #10

Epic PredatorPrey Enhancements

Consolidation of all open efforts related to the predator prey environment.

Adding agents mid simulation

How can we add agents mid simulation? Some important factors to think about:

When we copy an agent, it copies the seed too, so the random actions will be exactly the same. We'll need to adjust the seed to allow for variety.
Where to place new agent in a non-overlapping grid?
Can the current agents dictionary infrastructure support adding agents?
- Consider things like all done and the way the managers loop over the agents.
- Max encodings may need to be adjusted
- If an observation is agent based (rather than grid based), then the observation spaces of potentially all the agents will grow mid-simulation. Is this even possible?
- If the agent dict is modified during the simulation, what is the agent dict at the start of the next episode?
Can these new agents use policies that already exist? Does this screw up the training?
If new agents require their own policies, do the policies need to be pre-configured and added to the trainer before training starts? That is, can new policies spawn mid-episode?

Manager and component assumptions off

The components allow agents in self.agents that are not acting and not observing agents. However, the manager assumes that all agents in this list is an actual agent (acting and observing; trainable).

Mapp Distance observation mode see edge

In distance observation mode, how will the agent know that it is near the edge of the region?

In Epic #10

Boundaries observation channel

Position observations produce the relative position of other agents in the grid, and they also produce the relative position of boundaries. Perhaps it would be better to break down the boundaries observation as its own channel, separate from the agents (and from the resources).

In Epic #12

Agent decorator

Agent classes are really just dictionaries, and they're all pretty much the same. We should modify the dataclass decorator to give agents a configured function automatically and use this decorator everywhere we define an Agent to save from writing so much boilerplate code.

In epic #12

Continuous Movement Actor Component

Include Continuous Movement Actor Component from Particle Env.

Movement is represented as velocity in x and y. These agents have a max speed. Velocity is damped, like friction, which we can treat like an entropy on the velocity. Velocity is updated by acceleration according to dv = a * dt. Position is then updated like dp = v * dt. dt is a scaling on the velocity and position updates, which we can normalize away.

Actions can be discrete UDLR, which translates to x, y acceleration vectors with magnitude = move_speed.
Actions can also directly be the x, y acceleration vector.

In Epic #39

Code duplication in similar components

Some components are almost exactly the same with very small differences. For example, take a look at the attacking components. Team based attack is almost the same as non-team based attack, just adds one check on the team. Is there a way to use inheritance or some other code design to capture and reduce this duplication?

Comes out of #26
In epic #12

Get the Life and Death Component from the hackathon prototype

Life and death component was prototyped in hackathon. Get that feature into its own component and figure out resolution to the bug where they all take the same actions. Create example demonstrating reproduction with this feature.

In epic #12

Highly componentized design of environments

PredatorPreyEnv currently processes some interactions between the predators and prey. Each agent can move around on the grid and the predators can attack the prey. With #53, the prey can also harvest resources. This could be split up into 3 components: a movement handler, an agent-agent-interaction handler, and an agent-environment-interaction handler. This is a significant redesign, so it should be considered very carefully and not as part of the main development push.

The core of the environment is a grid with agents in cells. So each agent has a position.
Features are then layered onto this grid in the form of wrappers:

Movement (agent-env): Agents can choose to move. Env processes the move action. Agents have a movement parameter.
Resource harvest (agent-env): Resources is an addition to the environment because now there are values that exist on the grid. This is an update to the state, so it naturally works as a wrapper. Agents can choose to harvest. Env process the harvest action. Agents have a harvest parameter.
Attack (agent-agent): Agents can choose to attack each other. Env processes these actions. Agents have attack parameter.
Communicate (agent-agent): Agents can choose to share observations. This modifies the state of the env because now there is communication buffer, so this naturally works as a wrapper. Env processes action. Agents have communication parameters.

Two important questions:

How does the environment process the actions? Do they happen in order form outside in--communicate, then attack, then harvest, then move? Each action updating the state of the wrapper before going down to the next level?
What if we want to restrict actions from happening in the same step? Okay, we can define the action policy to only choose certain actions in this mutually exclusive way. But how do we define the exclusivity in the conglomeration of wrappers?

In Epic #10

Update

Using wrappers for each feature is poor design. We are using component-based design instead.

Support 3 dimensional spaces

Add support for movement and positions in 3 dimensions.

Corridor and MultiCorridor with components

We should convert the Corridor and MultiCorridor example environments to use components.

Collision Position State Component

Agents can collide with each other and with landmarks.

Sometimes, that collision results in "bouncing" the "entities" away from each other. Collisions occur when agents are within some distance of each other. Keep in mind that agents have sizes that effect the collision calculation.

In Epic #39

Improved component design

Following the initial efforts in #12, we can try to improve the design of the components by framing them in terms of the part of the state that they control. For example, suppose an environment's state can be broken down into the following:

environment resources
agent positions
agent health

Consider that there are "sub-state handlers" that are made up certain components. For example, we could have:

resources_handler: made up of components for generating obs, regrowing resources, harvesting resources, etc.
position_handler: made up of components for generating obs, moving, etc.
health_handler: made up of components for generating obs, increasing health (e.g. through harvesting), decreasing health (e.g. through attacking or entropy).

Then, we can consider that an AgentBasedSimulation is composed of the "sub-state handlers".

In Epic #12

Distinguish Acting, Observing, and State agents

We need to explicitly indicate when agents should be acting agents, observing agents, or just state agents. For some agent classes, we've combined the actions and observations together. For example, SpeedAngle agent is given speed and angle action space by the actor and observation space by the observer. LifeAgent is given an observation space by the observer. However, we sometimes don't want agents to have certain actions and/or observations while still having the state. For example, VelocityAgent should split out the acceleration capabilities so that we can take advantage of velocity state without giving an action space (such as for using landmarks/dumb agents). We have a good example of this with resources, where we have separated HarvestAgent and ResourceObservingAgent because we don't want Predators to harvest but we do want them to observe resources. We should do this for all agents.

Incoming and Outgoing wrappers

It would be nice to have wrappers that can convert gym and MultiAgent environments to our AgentEnvironment interface so that all the wrappers can be used. We should also have wrappers to go back out to those types.

Important question to answer is that Abmarl simulation is composed of Agents in an Agent Based Simulation managed by a Simulation Manager. In a gym environment, the conversion to ABS is simple because there is only a single agent; however, the other environment types allow multiple agents, so it's not as clear.

Tensorflow and np.int

Tensorflow cannot work with integer Boxes because tensorflow random uniform cannot do the necessary broadcast with integers.

Current workaround is to use np.float and offset by 0.5, have the component floor the input/output, and convert the array to int with ndarray.astype(int). Potential fix could be a component wrapper.

Monte Carlo Algorithms to work with AgentEnvironment interface

The Monte Carlo algorithms only work with the gym environment interface, which is made explicit in #62. We should enable multi-agent monte-carlo learning. We can approach it in the following way:

Make the monte carlo algorithms work with AgentEnvironment interface, but enforce that there is only one agent in the dictionary since we only train a single policy.
Support multiple-agents in the AgentEnvironment as long as they all train different policies. This is easier than the same policy because we have exclusivity in the SAR tuples. If there are multiple experiences training the same policy, then we have to start paying attention to how we combine the experiences from other agents into one update.
Support multiple-agents in the AgentEnvironment that can train one or many policies.

Consolidate continuous components

This can be broken into Speed and Angle agents. Then we can merge the accelerating part with AcceleratingAgent below and the actors and states can determine how to use that information (velocity updater vs speed-angle updater).

Originally posted by @rusu24edward in #50 (comment)

Allow continuous action and state spaces

Build components for continuous movement and built example environment.

Communication as component

Add the communication component.

In Epic #12

Collision detection and processing

The whole thing is more complex than how I handled it so far... if we do it properly we would have to do all these things in this picture... so 3 steps: get position without collision, rollback to point where the circles collide, and move the rest of the way in the right direction... if we don't do the last step, the traveled distance between two timesteps would be shortened...

There might also be a case where we don't even detect a collision because maybe they move perpendicular and after the timestep it looks as though they just passed through each other... so if we want to make it super correct we would have to calculate rays or something... very annoying... on the other hand we could just ignore the cases where objetcs pass through each other and stop after undoing the overlap as we are doing so far...

we can also just ignore it and make smaller steps so the overlaps are much smaller. I think this is probably the best solution, either smaller time intervals or lower velocities. I would do smaller time intervals for smoother movement but make the agent decision only after every couple of frames (this is also done in Atari games for example where a decision is only made every 4th frame).

This looks a lot like what we do. They completely ignore overlap and it looks nice in the gif
https://github.com/xnx/collision

This guy describes the continuous collision detection and finding the point where two objects collide.
https://www.toptal.com/game/video-game-physics-part-ii-collision-detection-for-solid-objects

Add communication channels

Add communication channels to the environment. Things to consider sharing:

Observations (This is what I saw) #5
Intentions (action that I will/did take)
Experiences (rollout fragments)
Knowledge about the environment (policy weights)

Type "abmarl" in command line should print out the usage

It currently just prints out failure:

>>> abmarl
Traceback (most recent call last):
  File "/Users/rusu1/.virtual_envs/abmarl/bin/abmarl", line 11, in <module>
    load_entry_point('abmarl', 'console_scripts', 'abmarl')()
  File "/Users/rusu1/abmarl/abmarl/scripts/scripts.py", line 33, in cli
    path_config = os.path.join(os.getcwd(), parameters.configuration)
AttributeError: 'Namespace' object has no attribute 'configuration'

>>> abmarl --help
usage: abmarl [-h] {train,analyze,play} ...

Train, analyze, and play MARL policies.

positional arguments:
  {train,analyze,play}
    train               Train MARL policies
    analyze             Analyze MARL policies
    play                Play MARL policies

optional arguments:
  -h, --help            show this help message and exit

Example usage for training:
    abmarl train my_experiment.py

Example usage for analysis:
    abmarl analyze my_experiment_directory/ my_analysis_script.py

Example usage for playing:
    abmarl play my_experiment_directory/ --some-args

Components instance checks

Some of the components instance checks can be loosed a bit, and we should explore the best way to do this.

Box single values returned as scalars

Box observations must be np.array to work with RLlib. This is pretty straightforward when the array has multiple elements, but it is not intuitive when there is only a single output. Rather than changing all scalar values to 1-element arrays, we can just create a wrapper that does the conversion for us.

Stochasticity in observations and actions

Add stochasticity in the observations and actions. Observations can be filtered through a Bernoulli distribution that are correlated to distance. For attack actions, we want to correlate to distance and to the number of observable prey.

In epic #10

RTD site with highlights page

Make a read the docs page with all documentation and add a highlights page that showcases what people have done using this software.

Actor handles null case

If the actor does not receive an input, it should handle the null case itself rather than requiring the environment to do it. For example, instead of actor.process_move(agent, action.get('move', np.zeros(2))), we could just do actor.process_move(agent, actor.get('move')), and the process_move function can have logic for dealing with no move.

Create current AgentBasedSimulation environments out of the components

Environments like PredatorPrey can be mostly recreated from the components we already have. Just a few minor changes needed to add different observation modes.

Cannot replicate communications until we finish #19

In epic #12

Combine fundamental agent attributes into one

Fundamentals are position, team and life. The attribute agents should just be one and we default the parameters as needed.

Observation design struggle

It is not entirely clear what GridPositionComponent does. It keeps track of agents' positions, yes, but then it also has some capabilities of generating observations of other agents based on that position. As a result, if we want to change how observations are made, then we need a new class, or a subclass. This becomes more ambiguous when we start to think about obstacles in the grid.

Furthermore, we still struggle to define how env features should be observed in conjunction with the agents' positions, such as resources.

Related to #21
In epic #16

Social Networking and Polarization

Network model. Each node has a set of preferences--likes and dislikes. Each node can also choose to share something, either its own opinion or something that it hears from its neighbors. Nodes can choose to break links and add links (dynamics are simpler than voting model because the node doesn't change teams). Nodes might choose to do so based on what they hear from various agents. For example, if a node always hears something it doesn't like from another node, then it may choose to break that link and select a new one from the network.

The idea here is that we can study the polarization of a population under different reward functions.

Extensions:

Not all nodes can share their own opinion. Some nodes are originators of content and can produce new information. Most nodes just share what they hear in the network.

In Epic #2

Movement component design

Design of the movement component is a bit ambiguous. On the one hand, it seems like the movement component should take the position and movement and output the desired new location, without constraints. Then, the environment can decide what to do about the movement output (e.g. if the new location is out of bounds, then don't move there). Currently, movement component does this constraint internally, checking for out of bounds movement, which it can do because it has the region. But as we add more constraints, such as landmarks and other agents' locations, the movement component will need to be updated to handle all these constraints. The question is: does it make sense for movement components to handle this internally, or is it better for movement to just process the desired location and let the environment or another component process the constraints? This has parallel considerations with the attack component with the question of processing the agents' health.

In epic #12

Attack variations

Attack Types

Binary Attack -- single random agent
This is where the agent blindly chooses attack or not. The Actor randomly picks an attackable agent in the attack range. We already have this. Action space is Binary
Selective Attack
This is where the agent chooses which cells in its attack range to attack. Action space is
Box(0, 1, (2 * attack_range + 1,2 * attack_range + 1), int)
K-Restricted Selective Attack.
This is where the agent chooses which cells in its attack range to attack, up to K cells. If the agent must choose a different cell each time, then we have sum over N choose i options, where N is (2 * attack_range + 1)^2 and i goes from 1 to K. The action space would be MultiDiscrete(N+1, N, N-1, ...), up to K entries. Each element in the attack vector would be converted to a grid cell or no attack.
If it can attack the same cell twice, then this goes up to (N+1)^K, where we have N+1 because the agent can choose not to use one of its K attacks. The action space would be MultiDiscrete(N+1, N+1, N+1, ...) up to K entries. Each element of the attack vector would be converted to a grid cell or no attack.

Min-max attack (MAYBE LATER)
Here, the agent can attack between a minimum range and a maximum range. So it can attack up to max cells away, and must attack at least min cells away. The attack grid can be broken into four pieces, and we can apply the same logic above to each of those pieces.

Encoding based attack
The attack_mapping parameter specifies which agent types can attack other agent types. We use this in the simulation dynamics already, and I think I can make an attack actor that makes each channel explicit. This would use the binary attack logic under the hood.

Variations
Each of the attack types above can have variations. For example,

Agents can have a number of attacks, which the attack actors can interpret for their own processing.

The ability to attack agents in masked cells. That is to say, does masking block the attack? Right now, it does, but should it? This should be done in #202.

We can have many more variations, but this is a good starting point.

Design

Each actor shares a common structure and set of attributes. We should combine them.
Determining the attack is unique among the actors, but even this still follows more or less the same structure. We should consider how we can combine this together as well.

Unify Position State and movement

Position state is a bit clumpy right now. We have a Grid representation and two Continuous representations, one which uses velocity and another which uses speed and angle. In addition, the corridor example can reference agents based on their positions by mapping from cell to agent object. NetLogo unifies the position between grid and continuous spaces by tracking an agents continuous position and by mapping that agent to a grid cell based on its position. We should do this too. We should also do the mapping that corridor has.

Plume Component

The plume has a concentration source with wind that pushes the concentration a certain direction. This is set at the beginning of the episode and never modified.

Parameters: strength, diffusion factor in y and z, noise, upper and lower bounds for concentrations.

In Epic #39

Attack strength usage

Attack strength is specified by AttackingAgent but it is not used by the corresponding component. It is used by the environment, which can modify the agents' healths based on the attacking agent's strength. Perhaps we should add some attacking components that have this feature?

related to #20
in epic #12

Movement types

I think in total there are 4 cases to be considered.

Continuous actions with continuous movement -> continuous state space (action changes acceleration)
Continuous actions with discrete movement -> continuous state space (action changes position directly)
Discrete actions with continuous movement -> continuous state space (action changes acceleration)
Discrete actions with discrete movement -> discrete state space (action changes position directly)

2 and 4 are variants of the existing behavior in predator prey with the difference that in one of them the step size can be learned as well (it’s like a cheap way to regulate velocity).

1 and 3 however both operate on a more “realistic” basis in that they let agents move and interact with other objects in a continuous space which makes sense for collisions and such things. I think both of 1+2 are common ways of controlling movement and if possible we should have them both.

Partial Observation as Component Wrapper

AgentObservingAgent allows the observer to mask the agent's observation, resulting in partial observability through the agent_view parameter. This is an awkward design, and we should try to rework it. Perhaps we can use composition through wrappers on the observation component itself?

Port Particle Plume env

Bring over the Particle Plume env from the dance repo to this repo. Create all the necessary components and the examples scenarios.

Supported actions and features in particle-plume:

Do nothing
Move (up, down, left, right)
Collide with agent and landmarks
Sense plume (collect data)
Update belief/posterior
Communicate (collect data from other agents).

Components breakdown:

Landmarks #40
Agents with size #42
Landmark Observing Agent #40
Continuous Movement Actor and State #41
Collision Position State #42
Plume State, with underlying plume model and wind effects #43
Plume Sensing Agent
Posterior/Belief Agent

Scenarios:

Broadcast communication

Broadcasting agent with broadcast range. Agents within this range will receive the message. If the agent is on the same team, it's observation will fuse the observation from the broadcasting agent as appropriate. For example, if the broadcasting agent sends position, team, and life information; but the receiving agent only observes position and life channels, then the receiving agent will only fuse the channels it supports. If the receiving agent is on another team, then it will only receive the state information of the broadcasting agent, also only the observation channels that its supports.

Actor: agent can choose to broadcast
State: which agent is broadcasting this step
Observer: Update obs as described above. This will be implemented as a wrapper, and should go after partial observation wrappers.

Does not support "grid" observation style. See #5, #6, #19 for thoughts there.

In Epic #10

Team attacking matrix

Support that some teams can/cannot attack other teams.

Renderers

It might be nice to give states the "render" function that allows them to turn the state into an image. This is what we were doing before #26, and the design we had hinted at the necessary design for all renderers to work together to create a single composite image. Currently, the examples just render per use case.

Out of #26
In epic #12

Agents communicating in Mapp

Updates

Agents can share observations in DistanceMode using the CommunicationWrapper.

Thoughts

Support agents sharing their observations with other agents. Some thoughts on how to do this:

Start with distance mode because the observation space easily allows for this.
Sharing information should be an action.
1. How is this executed? This is a little tricky. Other actions are governed by the environment--if the predator attacks, nearby prey are eaten. There's not "handshake" involved. However, with information sharing, we may want to consider a handshake protocol. Some options are
  1. Agents choose to REQUEST information from other agents, and the request is satisfied according to some environment rule. Some models might be:
    1. Always yes/no based on agent type. For example, prey will always share with prey but never with predators.
    2. Probabilistically yes/no based on some preset probability between agents.
    3. According to some communication/network model that takes into account the agent's distance from one another.
  2. Agents choose to SEND information from other agents, and the message is received according to some environment rule (see above examples).
  3. Agents must AGREE to communicate. An agent sees a buffer of message sends/requests. It can choose as an action which send/request to trust/satisfy.
    - Should this happen in a multi-step process or in the same step? Multi-step may be more realistic, but same-step doesn't introduce a delay so the SARS tuple is more concise: My action was to request information from some neighboring agent, and the resulting next observation included that (or I was rejected). This is probably not feasible because it means the other agent has to have take an action on my request. But other other agent can choose that action whether or not my request is there.... but that action would be based off no obs...
  4. Agents broadcast messages, and all other agents within the range will receive it.
  5. Do communicated messages automatically update the agent's observation? Or is the agent able to trust and ignore messages based on the sender/context/content of the message?
    - How do the agents deal with the relative distance issue?
2. What is the communication action space? Do all agents see all other agents? Is there a "partial observability" with the communications too? Are agents connected in a network? Does that network change over time? Can agents communicate and move/attack in the same step?
  1. Start simple: All agents connected to all others (we already have this in distance mode).
Do I see my neighbor's current observation or next observation? In RL, the next state is the state after all actions are processed. So if my neighbor moves, I see my neighbor's new location because that's the next state. So what if my neighbor accepts my message request? The state that I observe is one that accounts for his action, namely to accept my message request. So I will observe the next state because my observation will include his observation. But which one of his observations do I use? Do I use his observation of the current state or the next state?
- Consider the fully observable case, where I can see all the agents and communicate with them. Suppose in the previous step, I had sent message requests to all of them, and they all took the action to fulfill my request. What will I observe at the end of this state? In my direct observation I will see the next location of each agent because it is an observation of the next state. What will I see in my incoming buffer? Will it be the next location of the agents, which will match my own observation, or will it be the current location of all the agents, which will be delayed by one step?
  - Start simple: Shared observations will show the agent an observation of the next state. This may not be entirely realistic, but that's okay. Just start simple and work up from there.

Approaches

Based on the above ideas, there are two approaches that make the most sense to me:

Agents request messages, other agents can chose to fulfill or deny them.

+ -------- + ------------------------------- + ------------------ + ------------------------ + -------------------- +
| timestep | agent0 obs                      | agent0 action      | agent1 obs               | agent1 action        |
+ -------- + ------------------------------- + ------------------ + ------------------------ + -------------------- +
| 0        | augmented_obs[]                 | request agent1 obs | incoming_request[]       | some_action          |
| 1        | augmented_obs[]                 | some_action        | incoming_request[agent0] | fulfill/deny request |
| 2        | augmented_obs[agent1 fulfilled] | some_action        | incoming_request[]       | some_action          |
+ -------- + ------------------------------- + ------------------ + ------------------------ + -------------------- +

Receiving messages does directly affect the agent's own observation space.

Agents send/broadcast messages, other agents can choose to trust or ignore them.

+ -------- + ------------------------------- + ------------------ + ------------------------------ + -------------------- +
| timestep | agent0 obs                      | agent0 action      | agent1 obs                     | agent1 action        |
+ -------- + ------------------------------- + ------------------ + ------------------------------ + -------------------- +
| 0        | broadcasted_messages[]          | broadcast message  | broadcasted_messages[]         | some_action          |
| 1        | broadcasted_messages[]          | some_action        | broadcasted_messages[agent0]   | trust/ignore message |
| 2        | broadcasted_messages[]          | some_action        | broadcasted_messages[]         | some_action          |
+ -------- + ------------------------------- + ------------------ + ------------------------------ + -------------------- +

Sending messages does not directly affect the agent's own observation space. The agent's observation space is effected by others sending me messages.

Dead agents lose their parameters

Dead agents can still be observed, which we don't want since we want them to be effectively removed from the simulation. We cannot remove them from the agent dict without doing a deepcopy at each reset. The observers right now force the agent to be alive to be observed. What if we just set their position to None so the observers ignore them?

Landmarks component

Include landmarks in the environment.

Landmarks have a position and a velocity (and a color for rendering). Some may also have specific id's/type that the agent needs to observe if it needs to go to a specific landmark. They are similar to agents, but they don't decide actions. Agents can collide with landmarks.

Is the landmark's movement/position effected by a collision with the agent? Or is it treated as "infinite mass" and only the agents bounce off of it. What happens if two landmarks collide?

In Epic #39

Example Simulations

This epic ticket tracks what example simulations we implement in Abmarl. Here is the list, each new game should spawn a new ticket in this epic:

llnl / abmarl Goto Github PK

abmarl's People

Contributors

Stargazers

Watchers

Forkers

abmarl's Issues

Update

Updates

Thoughts

Approaches

Agents request messages, other agents can chose to fulfill or deny them.

Agents send/broadcast messages, other agents can choose to trust or ignore them.

Other questions

Recommend Projects

Recommend Topics

Recommend Org