shamilmamedov / flexible_arm Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 286.85 MB

Python 96.31% CMake 0.30% C++ 3.07% Shell 0.32%

flexible_arm's Introduction

Safe Imitation Learning of Nonlinear Model Predictive Control for Flexible Robots

Method

Nonlinear MPC performance

nmpc_performance.mp4

The proposed method vs NMPC

nmpc_vs_method.mp4

Installation of acados:

Installation of acados according to the following instructions: https://docs.acados.org/python_interface/index.html

Imitation Library Fork (submodule):

Current (21 August 2023) version on imitation library does not yet support Gymnasium. So we are using our own fork of it with necessary modifications.

After cloning this repo:

git submodule init
git submodule update
cd imitation
pip install -e .

Hyperparameters of the IL, RL and IRL algorithms

Hyper-parameter	Value
COMMON: Learning Rate	0.0003
COMMON: Number of Expert Demos	100
COMMON: Number of Training Steps	2,000,000
PPO: Net. Arch.	pi:[256, 256] vf:[256, 256]
PPO: Batch Size	64
SAC: Net. Arch.	pi:[256, 256] qf:[256, 256]
SAC: Batch Size	256
BC: Net. Arch.	pi:[32, 32] qf:[32, 32]
BC: Batch Size	32
DAgger: Online Episodes	500
Density: Kernel type	Gaussian
Density: Kernel bandwidth	0.5
Density: Net. Arch.	pi:[256, 256] qf:[256, 256]
GAIL: Reward Net Arch.	[32, 32]
GAIL: Policy Net Arch.	pi:[256, 256] qf:[256, 256]
GAIL: Policy Replay Buffer Capacity	512
GAIL: Batch Size	128
AIRL: Reward Net Arch.	[32, 32]
AIRL: Policy Net Arch.	pi:[256, 256] qf:[256, 256]
AIRL: Batch Size	128
AIRL: Policy Replay Buffer Capacity	512

NMPC parameters

Parameter	Value
Hessian Approximation	Gauss-Newton
SQP type	real-time iterations
$\Delta t$, $N$, $n_\mathrm{seg}$	$5$ ms, 125, 3
$Q$ weights $w_{q_a}$, $\dot w_{q_a}$, $w_{q_p}$, $\dot{w}_{q_p}$	$0.01 ; 0.1 ; 0.01 ; 10$
$P_N$	diag($[1,1,1,0,0,0])\cdot 10^4$
$P$	diag($[1,1,1,0,0,0])\cdot 2\cdot10^3$
$R$	diag($[1,10,10]$)
$S$, $s$	diag($[1,1,1]\cdot 10^6$), $[1,1,1]^\top\cdot 10^4$
$\delta_\mathrm{ee}, \delta_\mathrm{elb}$ , $\delta_\mathrm{x}$	$0.01\mathrm{m}, ;0.005\mathrm{m}$, ; $0\cdot 1_{n_x}$
$\overline{\dot{q_a}}=-\underline{\dot{q_a}}$	$[2.5, 3.5, 3.5]^\top;s^{-1}$
$\overline{u}=-\underline{u}$	$[20,10,10]^\top$ Nm

Safety Filter parameters

Parameter	Value
$\Delta t_\mathrm{SF}$, $N_\mathrm{SF}$, $n_\mathrm{seg}$	$10$ ms, $25$, $1$
$\bar{R}$	diag($[1,1,1]$)
${R}_\mathrm{SF}$	diag($[1,1,1]$) $\cdot 10^{-5}$

flexible_arm's People

Contributors

Stargazers

Watchers

flexible_arm's Issues

Paper structure

Hi, here I would like to discuss the paper structure. Here is what I propose

Introduction
1. Problem formulation and contributions
  
  Given a flexible robot with nonlinear dynamics and stiff equations of motion, design a controller with high bandwidth that can reach any point in a workspace of a robot from any other point while avoiding an obstacle.
2. Related work
Safe approximate NMPC

Here we lay down our hypothesis without going into details of NMPC, IL or SF
Setup
Simulation experiments
1. NMPC
2. SF
3. IL + network
4. baseline models
5. task
6. results
Conclusions
Appendix

Please let me know what you think about it.

Timing the inference process for different algorithms.

Description:
we would like to be able to time the inference of difference algorithms and report them compared to each other.

Implementation Suggestion:
Simple: Since the success of failure of the algorithm here is not the concern we could just have a script that sets an environment and runs each algorithm for N-number of steps (1000?) and reports the average inference per step time time.
Over Engineered: Create a TimeIt wrapper and wrap each algorithm with a TimeIt wrapper. The wrapper adds a moving average (or just an array) to the policy and measures and adds every time the predict function is called.

I would go with the simple solution.

Pipeline for training AIRL for imitation learning (Evaluation IRL)

Description:
GAIL (which learn an imitation learning policy but not the reward function) has not been able to perform well out of the box. Hence we should try AIRL which is also trained in an adversarial fashion however aims to recover the reward function as well. Let's see if AIRL will perform better than GAIL.

Tuning of the RL and IRL algorithms

Description:
The RL and IRL algorithms need tuning to perform well (especially the Adversarial ones). We need to put some time and tune them and see if they can perform well if we want to use them as baselines.

Acceptance Criteria:
Tune the hyper parameters such that they are at least acting reasonably so they can be used as baselines. You can re-run the trained models and see how they perform visually. Record their hyperparameters.

Pipeline for training GAIL for imitation learning (Evaluation IRL)

Description:
Collect data using MPC on the new (set-to-set) FlexibleArmEnv and train and evaluate a GAIL agent on it. This will be the first of adversarial Imitation learning algorithms to try.

Visualizations / Plots

Description:
To help us decide how to run the evaluations and which data to save during training / evaluation, we should decide on what plots we would like to see on the paper.

Training RL with safety filter

For completeness and to address one of the reviewers comments, we should also train RL, including the safety filter.

@Erfi: Where in the current setup can I add the safety filter (which is kind of a MPC) after the RL policy? Can you point that out or prepare the evaluation/data collection routine, such that it accepts this module after the RL policy?

The goal shere and the robot end-effector do not completely align

*Description:
MPC can control the arm and bring it to the goal, but the robot arm is always slightly misaligned with the goal sphere. The misalignment seems consistent in direction: as if the end-effector is slightly longer.

Suggestions:
No idea yet, but the coordinates of the goal set by the env and received by the controller are the same. Perhaps it is a URDF thing?
Is the sphere's coordinate at the center?

Acceptance Criteria:
Run the data collection script (tests/test_mpc_data_collection.py) and see the controller bringing the arm's end-effector exactly at to the goal sphere.

Saving and Loading models

Description:
Training the models is quite time consuming, so it makes sense to save the m and load them for evaluation. This is also important since we can then evaluate the same trained model in environments with different start/end points to check for generalization and robustness.

Implementation:
Since all models are based of pytorch, their saving a loading should be done using the pytorch framework

Acceptance Criteria:
In the tests scripts, have a flag allowing for training or loading the model prior to evaluation. Both should work and the model should be saved after training and prior to evaluation.

MPC does not work when using the estimator

Description:
The MPC controller does not seem to do the right thing when we are using the estimator in the tests.test_mpc_data_collection.py file. When the estimator is set to None everything seems to work fine.

Acceptance Criteria:
Change the tests.test_mpc_data_collection.py file to use the estimator in the env.
Run the data collection using python -m tests.test_mpc_data_collection and see the realtime simulation can control the robot arm using the estimator.

Collect data from MPC and run BC for Imitaltion Learning (Evaluation IRL)

Description:
Collect data using MPC on the new (set-to-set) FlexibleArmEnv and train and evaluate a Behavioral Cloning agent on it. This will be the first of Imitation learning algorithms to try.

Goal position should be visible in the GUI as a globe

Description:
Both for the purpose of debugging and for aesthetic reasons, the desired goal position should be visible.

Implementation:
Make a call withing the reset() to the simulation and create the goal sphere according to the position sampled from the observation space.

Acceptance Criteria:
Call to env.reset() + env.render() should show the desired goal as a sphere (preferably colored in a visible way).

Pipeline for training Dagger for imitation learning (Evaluation IRL)

Description:
Collect data using MPC on the new (set-to-set) FlexibleArmEnv and train and evaluate a Dagger agent on it. This will be the second of Imitation learning algorithms to try.

Add the wall to the URDF tree

Description:
There is a wall (or should be) in the simulation near the goal point. If the objective is not to hit the wall it makes sense to draw the wall as part of the URDF both for drugging and for visualization purposes.

Implementation:
Change the URDF tree to include a wall, in the correct place.

Acceptance Criteria:
Running the simulation (e.g. tests/test_mpc_data_collection.py) should show the wall as well as the robot arm in the environment.

Add parametric wall

The parametric wall should be part of the observation, sampling, MPC and safety filter.
The wall parameters are exactly two 3D vectors. a vector from the origin to the wall and the normal vector pointing towards the feasible region.

@shamilmamedov , @Erfi : Please think of a way where to integrate it.

Addressing comments of L4DC reviewers

I would like to discuss several comments and recommendations provided by the reviewers of L4DC. I believe it's crucial that we carefully consider these points as we work on revising the paper for ICRA. I will briefly outline the relevant comments so that we can discuss the actions we can take to address them in the ICRA paper.

Reviewer 1

"Moreover, there are no insights offered on how the training is performed, what kind of NN is used, how many layers, what kind of activation functions, etc." Indeed, experimenting with various architectures, optimizers, and hyperparameters to find the best combination for approximating NMPC might be valuable for the community. @Erfi In your pipeline, is it easy to experiments with those variables?
"It would be more interesting if the authors trained the NN with the safe filter." Although it is possible, backpropogating through safety-filter NMPC would be very slow. IMHO, we should leave it for the future.
"Main comparison metric (computation time) is not fair, for NN what is more important is generalizability and success rate". How can we show the generalizability? During training and test shall we sample configurations from different sets? @Erfi How do people in RL show the generalizability?

Reviewer 2

"Just one experiment and just one manipulator." Introducing a new robot setup, such as a flexible cart-pendulum, would involve a substantial time investment. Moreover, implementing NMPC for this new configuration would require even more time. @Erfi @RudolfReiter In your opinion does it worth setting up new environment/robot?
"Lack of comparison of imitating the NMPC policy directly versus imitation of dynamics + simple MPC on the learned dynamics." Potentially, NMPC with approximated dynamics by LSTM might be faster. But generating rich enough data, training LSTM and implementing NMPC for LSTM model will take considerable amount of time.

Looking forward for your opinion about the reviewers comments.

Goal position communication between MPC and FlexibleArmEnv (set-to-set)

Description:
The goal position can be selected randomly withing the FlexibleArmEnv class via a call to reset() if the qa_range_end is set to values other than zeros inside the FlexibleArmEnvOptions. MPC controller's goal is set using controller.set_reference_point(...) prior to use, hence it will be unaware of change to the goal after each call to the env.reset(). This will mean that the controller will drive the robot to its reference point even if the goal has changed. How should we implement the connection between the two?

Possible Implementation:

Extend/modify the imitation library's rollout.rollout function to allow passing of information between env and policy.
Call rollout.rollout in a loop and collect only one trajectory. Before each loop reset the goal and pass it to MPC (Hack)
Extend Observation Space of the environment to include the goal. This is highly related to #9

Acceptance Criteria:
One should be able to run the data_collection script (test_mpc_data_collection.py) using a non-zero range to the qa_range_end and see the MPC drive the robot arm to the correct goal position.

Covariance propagation for safety filter

I think it would be quite reasonable to propagate the model uncertainty at the last linearized solution of the safety filter trajectory w.r.t. the end-effector and elbow position, in order to define chance constraints.
Otherwise the safety bound would need to be quite conservative.
@shamilmamedov: What do you think?

Evaluation Procedure

We need a script or a workflow to take one of the configurations

RL + environment
MPC + enviornment
safety filter + environment

with options

random wall / fixed wall
sampling close to wall / sampling randomly in feasible region

and collect the following data

computation time of algorithm
constraint violation of wall
time to reach epsilon region
closed loop cost as defined in MPC (Q,R)

Getting wall info from URDF instead of hardcoding

Description:
Currently we are putting hard-coded values here and there for the env, controller and the safety filter. This is quite error prone. Instead we should be able to take this info from else where and store it in the env and other modules can take this info from the env.

Implementation Suggestion:
If it is possible to infer the vectors that describe the wall from URDF then we can build a model from the URDF inside the env and infer all the wall info from there. Otherwise perhaps storing them in a file and loading it from there (or later from a config file like hydra?)

Extend environment using multiple goals (set-to-set)

Description:
We would like the environment to reset both the starting and goal withing a reasonable range of the work space

Suggested Implementation:
Subclass the current FlexibleRobotArm environment such that it's observation space is extended by three dimensions. The added dimensions describe the position of the goal (relative or absolute).
The observation needs to be extended since the RL and IRL algorithms down the pipeline use the observation and action space of the env in order to define the input size of their neural nets.

Acceptance Criteria:
env.reset() resets the start position of the robot arm as well as the goal position and returns the observation containing the goal position in it's last three elements.

Limit the observation space for the start and goal position to a reasonable range

Description:
The observation space for the flexible arm environment needs to be setup such that the starting location of the robot and the desired goal position are within a reasonable range of the workspace

Implementation:
Check what the range of values should be for each dimension of the observation space and change the values accordingly in the code.

Acceptance Criteria:
env.reset() + env.render() in a loop should show the robot start and goal positions to be in the desired range.

Delete old files

At some point, all old files should be deleted. Perhaps, we move them first in an "old_files" folder.

Move to using hydra configuration

Description:
There are a lot of parameters involved in running this code and would great if we could set them all outside of the code using hydra configs.

Usage/Implementation Suggestion:
Hydra config should resemble the structure of the codebase for ease of use so the code base might need some refactoring and repackaging itself. Hydra has instantiation methods both from the configs directly using _target_ key or in the code using instantiate(cfg) for every class that needs instantiating. Although this may seem cleaner for production level code, it hides the classes names that are being instantiated in the config file. I suggest using hydra only for configuration handling and do the class instantiations ourselves in the code using the configs that are passed through Hydra. In my experience this will result in less back and forth between config/code as we continue development.

Acceptance Criteria:
The final scripts for training/evaluation/visualization need to be configurable through hydra configs and through hydra configs alone. (No setting configurations, some in the code and some in the hydra configs)

KPI: How to do these measurements from the evaluation trajectories?

Description:
There is a kpi.py file that can do the following:

path_length
execution_time
constraint_violation

Constraint Violation (last one) is of importance to us since it will reflect the effectiveness of the safety filter.

Implementation:
These functions (or at least the last one) needs to be updated so that they can work with the trajectories returned by the evaluation method.

The evaluation method should also be modified with a "return_trajectories" argument so it can return the trajectory of the robot during the evaluation procedure (currently it only returns the reward and episode length).

There also needs to be a script that loads every algorithm, evaluates it, runs the KPI measurements, adds results to a (json, yaml) file and creates a plot from the KPI measurements to compare the constraint violations of each algorithm.

Environment and expert options

Hi guys!

Here, I'd like to discuss the environment and expert NMPC options in line with the statements made in the paper. The main points are:

The computation time of the approximate MPC policy is significantly less than that of NMPC.
Consequently, the control rate of ANMPC is much higher than that of NMPC, which allows us to handle model-plant mismatch and other disturbances more effectively.

While it's true that NMPC enables you to adjust the sampling time as needed, this doesn't apply to the learned policy, as it relies on the expert NMPC's sampling time. It wouldn't be reasonable to gather expert data at 250Hz for training the policy and then deploy the policy at a different frequency.

On the other hand, the sampling time of the final controller NN + safety filter will be limited by the sampling rate of the safety filter. @RudolfReiter, could you provide insight into how fast we can set up the safety filter? We should select tf and n of the expert NMPC in a way that aligns with the tf and n of the safety filter. And finally, set dt of the environment accordingly.