Coder Social home page Coder Social logo

tinkoff-ai / corl Goto Github PK

View Code? Open in Web Editor NEW
1.0K 16.0 114.0 3.41 MB

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC

Home Page: https://arxiv.org/abs/2210.07105

License: Apache License 2.0

Dockerfile 0.23% Python 99.77%
d4rl gym offline-reinforcement-learning reinforcement-learning

corl's Introduction

CORL (Clean Offline Reinforcement Learning)

Twitter arXiv Ruff

🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by cleanrl for online RL, check them out too!

  • 📜 Single-file implementation
  • 📈 Benchmarked Implementation for N algorithms
  • 🖼 Weights and Biases integration

  • ⭐ If you're interested in discrete control, make sure to check out our new library — Katakomba. It provides both discrete control algorithms augmented with recurrence and an offline RL benchmark for the NetHack Learning environment.

Getting started

git clone https://github.com/tinkoff-ai/CORL.git && cd CORL
pip install -r requirements/requirements_dev.txt

# alternatively, you could use docker
docker build -t <image_name> .
docker run --gpus=all -it --rm --name <container_name> <image_name>

Algorithms Implemented

Algorithm Variants Implemented Wandb Report
Offline and Offline-to-Online
Conservative Q-Learning for Offline Reinforcement Learning
(CQL)
offline/cql.py
finetune/cql.py
Offline

Offline-to-online
Accelerating Online Reinforcement Learning with Offline Datasets
(AWAC)
offline/awac.py
finetune/awac.py
Offline

Offline-to-online
Offline Reinforcement Learning with Implicit Q-Learning
(IQL)
offline/iql.py
finetune/iql.py
Offline

Offline-to-online
Offline-to-Online only
Supported Policy Optimization for Offline Reinforcement Learning
(SPOT)
finetune/spot.py Offline-to-online
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
(Cal-QL)
finetune/cal_ql.py Offline-to-online
Offline only
✅ Behavioral Cloning
(BC)
offline/any_percent_bc.py Offline
✅ Behavioral Cloning-10%
(BC-10%)
offline/any_percent_bc.py Offline
A Minimalist Approach to Offline Reinforcement Learning
(TD3+BC)
offline/td3_bc.py Offline
Decision Transformer: Reinforcement Learning via Sequence Modeling
(DT)
offline/dt.py Offline
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(SAC-N)
offline/sac_n.py Offline
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(EDAC)
offline/edac.py Offline
Revisiting the Minimalist Approach to Offline Reinforcement Learning
(ReBRAC)
offline/rebrac.py Offline
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size
(LB-SAC)
offline/lb_sac.py Offline Gym-MuJoCo

D4RL Benchmarks

You can check the links above for learning curves and details. Here, we report reproduced final and best scores. Note that they differ by a significant margin, and some papers may use different approaches, not making it always explicit which reporting methodology they chose. If you want to re-collect our results in a more structured/nuanced manner, see results.

Offline

Last Scores

Gym-MuJoCo
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
halfcheetah-medium-v2 42.40 ± 0.19 42.46 ± 0.70 48.10 ± 0.18 49.46 ± 0.62 47.04 ± 0.22 48.31 ± 0.22 64.04 ± 0.68 68.20 ± 1.28 67.70 ± 1.04 42.20 ± 0.26
halfcheetah-medium-replay-v2 35.66 ± 2.33 23.59 ± 6.95 44.84 ± 0.59 44.70 ± 0.69 45.04 ± 0.27 44.46 ± 0.22 51.18 ± 0.31 60.70 ± 1.01 62.06 ± 1.10 38.91 ± 0.50
halfcheetah-medium-expert-v2 55.95 ± 7.35 90.10 ± 2.45 90.78 ± 6.04 93.62 ± 0.41 95.63 ± 0.42 94.74 ± 0.52 103.80 ± 2.95 98.96 ± 9.31 104.76 ± 0.64 91.55 ± 0.95
hopper-medium-v2 53.51 ± 1.76 55.48 ± 7.30 60.37 ± 3.49 74.45 ± 9.14 59.08 ± 3.77 67.53 ± 3.78 102.29 ± 0.17 40.82 ± 9.91 101.70 ± 0.28 65.10 ± 1.61
hopper-medium-replay-v2 29.81 ± 2.07 70.42 ± 8.66 64.42 ± 21.52 96.39 ± 5.28 95.11 ± 5.27 97.43 ± 6.39 94.98 ± 6.53 100.33 ± 0.78 99.66 ± 0.81 81.77 ± 6.87
hopper-medium-expert-v2 52.30 ± 4.01 111.16 ± 1.03 101.17 ± 9.07 52.73 ± 37.47 99.26 ± 10.91 107.42 ± 7.80 109.45 ± 2.34 101.31 ± 11.63 105.19 ± 10.08 110.44 ± 0.33
walker2d-medium-v2 63.23 ± 16.24 67.34 ± 5.17 82.71 ± 4.78 66.53 ± 26.04 80.75 ± 3.28 80.91 ± 3.17 85.82 ± 0.77 87.47 ± 0.66 93.36 ± 1.38 67.63 ± 2.54
walker2d-medium-replay-v2 21.80 ± 10.15 54.35 ± 6.34 85.62 ± 4.01 82.20 ± 1.05 73.09 ± 13.22 82.15 ± 3.03 84.25 ± 2.25 78.99 ± 0.50 87.10 ± 2.78 59.86 ± 2.73
walker2d-medium-expert-v2 98.96 ± 15.98 108.70 ± 0.25 110.03 ± 0.36 49.41 ± 38.16 109.56 ± 0.39 111.72 ± 0.86 111.86 ± 0.43 114.93 ± 0.41 114.75 ± 0.74 107.11 ± 0.96
locomotion average 50.40 69.29 76.45 67.72 78.28 81.63 89.74 83.52 92.92 73.84
Maze2d
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
maze2d-umaze-v1 0.36 ± 8.69 12.18 ± 4.29 29.41 ± 12.31 82.67 ± 28.30 -8.90 ± 6.11 42.11 ± 0.58 106.87 ± 22.16 130.59 ± 16.52 95.26 ± 6.39 18.08 ± 25.42
maze2d-medium-v1 0.79 ± 3.25 14.25 ± 2.33 59.45 ± 36.25 52.88 ± 55.12 86.11 ± 9.68 34.85 ± 2.72 105.11 ± 31.67 88.61 ± 18.72 57.04 ± 3.45 31.71 ± 26.33
maze2d-large-v1 2.26 ± 4.39 11.32 ± 5.10 97.10 ± 25.41 209.13 ± 8.19 23.75 ± 36.70 61.72 ± 3.50 78.33 ± 61.77 204.76 ± 1.19 95.60 ± 22.92 35.66 ± 28.20
maze2d average 1.13 12.58 61.99 114.89 33.65 46.23 96.77 141.32 82.64 28.48
Antmaze
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
antmaze-umaze-v2 55.25 ± 4.15 65.75 ± 5.26 70.75 ± 39.18 57.75 ± 10.28 92.75 ± 1.92 77.00 ± 5.52 97.75 ± 1.48 0.00 ± 0.00 0.00 ± 0.00 57.00 ± 9.82
antmaze-umaze-diverse-v2 47.25 ± 4.09 44.00 ± 1.00 44.75 ± 11.61 58.00 ± 7.68 37.25 ± 3.70 54.25 ± 5.54 83.50 ± 7.02 0.00 ± 0.00 0.00 ± 0.00 51.75 ± 0.43
antmaze-medium-play-v2 0.00 ± 0.00 2.00 ± 0.71 0.25 ± 0.43 0.00 ± 0.00 65.75 ± 11.61 65.75 ± 11.71 89.50 ± 3.35 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-medium-diverse-v2 0.75 ± 0.83 5.75 ± 9.39 0.25 ± 0.43 0.00 ± 0.00 67.25 ± 3.56 73.75 ± 5.45 83.50 ± 8.20 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-play-v2 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 20.75 ± 7.26 42.00 ± 4.53 52.25 ± 29.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v2 0.00 ± 0.00 0.75 ± 0.83 0.00 ± 0.00 0.00 ± 0.00 20.50 ± 13.24 30.25 ± 3.63 64.00 ± 5.43 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze average 17.21 19.71 19.33 19.29 50.71 57.17 78.42 0.00 0.00 18.12
Adroit
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
pen-human-v1 71.03 ± 6.26 26.99 ± 9.60 -3.88 ± 0.21 81.12 ± 13.47 13.71 ± 16.98 78.49 ± 8.21 103.16 ± 8.49 6.86 ± 5.93 5.07 ± 6.16 67.68 ± 5.48
pen-cloned-v1 51.92 ± 15.15 46.67 ± 14.25 5.13 ± 5.28 89.56 ± 15.57 1.04 ± 6.62 83.42 ± 8.19 102.79 ± 7.84 31.35 ± 2.14 12.02 ± 1.75 64.43 ± 1.43
pen-expert-v1 109.65 ± 7.28 114.96 ± 2.96 122.53 ± 21.27 160.37 ± 1.21 -1.41 ± 2.34 128.05 ± 9.21 152.16 ± 6.33 87.11 ± 48.95 -1.55 ± 0.81 116.38 ± 1.27
door-human-v1 2.34 ± 4.00 -0.13 ± 0.07 -0.33 ± 0.01 4.60 ± 1.90 5.53 ± 1.31 3.26 ± 1.83 -0.10 ± 0.01 -0.38 ± 0.00 -0.12 ± 0.13 4.44 ± 0.87
door-cloned-v1 -0.09 ± 0.03 0.29 ± 0.59 -0.34 ± 0.01 0.93 ± 1.66 -0.33 ± 0.01 3.07 ± 1.75 0.06 ± 0.05 -0.33 ± 0.00 2.66 ± 2.31 7.64 ± 3.26
door-expert-v1 105.35 ± 0.09 104.04 ± 1.46 -0.33 ± 0.01 104.85 ± 0.24 -0.32 ± 0.02 106.65 ± 0.25 106.37 ± 0.29 -0.33 ± 0.00 106.29 ± 1.73 104.87 ± 0.39
hammer-human-v1 3.03 ± 3.39 -0.19 ± 0.02 1.02 ± 0.24 3.37 ± 1.93 0.14 ± 0.11 1.79 ± 0.80 0.24 ± 0.24 0.24 ± 0.00 0.28 ± 0.18 1.28 ± 0.15
hammer-cloned-v1 0.55 ± 0.16 0.12 ± 0.08 0.25 ± 0.01 0.21 ± 0.24 0.30 ± 0.01 1.50 ± 0.69 5.00 ± 3.75 0.14 ± 0.09 0.19 ± 0.07 1.82 ± 0.55
hammer-expert-v1 126.78 ± 0.64 121.75 ± 7.67 3.11 ± 0.03 127.06 ± 0.29 0.26 ± 0.01 128.68 ± 0.33 133.62 ± 0.27 25.13 ± 43.25 28.52 ± 49.00 117.45 ± 6.65
relocate-human-v1 0.04 ± 0.03 -0.14 ± 0.08 -0.29 ± 0.01 0.05 ± 0.03 0.06 ± 0.03 0.12 ± 0.04 0.16 ± 0.30 -0.31 ± 0.01 -0.17 ± 0.17 0.05 ± 0.01
relocate-cloned-v1 -0.06 ± 0.01 -0.00 ± 0.02 -0.30 ± 0.01 -0.04 ± 0.04 -0.29 ± 0.01 0.04 ± 0.01 1.66 ± 2.59 -0.01 ± 0.10 0.17 ± 0.35 0.16 ± 0.09
relocate-expert-v1 107.58 ± 1.20 97.90 ± 5.21 -1.73 ± 0.96 108.87 ± 0.85 -0.30 ± 0.02 106.11 ± 4.02 107.52 ± 2.28 -0.36 ± 0.00 71.94 ± 18.37 104.28 ± 0.42
adroit average 48.18 42.69 10.40 56.75 1.53 53.43 59.39 12.43 18.78 49.21

Best Scores

Gym-MuJoCo
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
halfcheetah-medium-v2 43.60 ± 0.14 43.90 ± 0.13 48.93 ± 0.11 50.06 ± 0.50 47.62 ± 0.03 48.84 ± 0.07 65.62 ± 0.46 72.21 ± 0.31 69.72 ± 0.92 42.73 ± 0.10
halfcheetah-medium-replay-v2 40.52 ± 0.19 42.27 ± 0.46 45.84 ± 0.26 46.35 ± 0.29 46.43 ± 0.19 45.35 ± 0.08 52.22 ± 0.31 67.29 ± 0.34 66.55 ± 1.05 40.31 ± 0.28
halfcheetah-medium-expert-v2 79.69 ± 3.10 94.11 ± 0.22 96.59 ± 0.87 96.11 ± 0.37 97.04 ± 0.17 95.38 ± 0.17 108.89 ± 1.20 111.73 ± 0.47 110.62 ± 1.04 93.40 ± 0.21
hopper-medium-v2 69.04 ± 2.90 73.84 ± 0.37 70.44 ± 1.18 97.90 ± 0.56 70.80 ± 1.98 80.46 ± 3.09 103.19 ± 0.16 101.79 ± 0.20 103.26 ± 0.14 69.42 ± 3.64
hopper-medium-replay-v2 68.88 ± 10.33 90.57 ± 2.07 98.12 ± 1.16 100.91 ± 1.50 101.63 ± 0.55 102.69 ± 0.96 102.57 ± 0.45 103.83 ± 0.53 103.28 ± 0.49 88.74 ± 3.02
hopper-medium-expert-v2 90.63 ± 10.98 113.13 ± 0.16 113.22 ± 0.43 103.82 ± 12.81 112.84 ± 0.66 113.18 ± 0.38 113.16 ± 0.43 111.24 ± 0.15 111.80 ± 0.11 111.18 ± 0.21
walker2d-medium-v2 80.64 ± 0.91 82.05 ± 0.93 86.91 ± 0.28 83.37 ± 2.82 84.77 ± 0.20 87.58 ± 0.48 87.79 ± 0.19 90.17 ± 0.54 95.78 ± 1.07 74.70 ± 0.56
walker2d-medium-replay-v2 48.41 ± 7.61 76.09 ± 0.40 91.17 ± 0.72 86.51 ± 1.15 89.39 ± 0.88 89.94 ± 0.93 91.11 ± 0.63 85.18 ± 1.63 89.69 ± 1.39 68.22 ± 1.20
walker2d-medium-expert-v2 109.95 ± 0.62 109.90 ± 0.09 112.21 ± 0.06 108.28 ± 9.45 111.63 ± 0.38 113.06 ± 0.53 112.49 ± 0.18 116.93 ± 0.42 116.52 ± 0.75 108.71 ± 0.34
locomotion average 70.15 80.65 84.83 85.92 84.68 86.28 93.00 95.60 96.36 77.49
Maze2d
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
maze2d-umaze-v1 16.09 ± 0.87 22.49 ± 1.52 99.33 ± 16.16 136.61 ± 11.65 92.05 ± 13.66 50.92 ± 4.23 162.28 ± 1.79 153.12 ± 6.49 149.88 ± 1.97 63.83 ± 17.35
maze2d-medium-v1 19.16 ± 1.24 27.64 ± 1.87 150.93 ± 3.89 131.50 ± 25.38 128.66 ± 5.44 122.69 ± 30.00 150.12 ± 4.48 93.80 ± 14.66 154.41 ± 1.58 68.14 ± 12.25
maze2d-large-v1 20.75 ± 6.66 41.83 ± 3.64 197.64 ± 5.26 227.93 ± 1.90 157.51 ± 7.32 162.25 ± 44.18 197.55 ± 5.82 207.51 ± 0.96 182.52 ± 2.68 50.25 ± 19.34
maze2d average 18.67 30.65 149.30 165.35 126.07 111.95 169.98 151.48 162.27 60.74
Antmaze
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
antmaze-umaze-v2 68.50 ± 2.29 77.50 ± 1.50 98.50 ± 0.87 78.75 ± 6.76 94.75 ± 0.83 84.00 ± 4.06 100.00 ± 0.00 0.00 ± 0.00 42.50 ± 28.61 64.50 ± 2.06
antmaze-umaze-diverse-v2 64.75 ± 4.32 63.50 ± 2.18 71.25 ± 5.76 88.25 ± 2.17 53.75 ± 2.05 79.50 ± 3.35 96.75 ± 2.28 0.00 ± 0.00 0.00 ± 0.00 60.50 ± 2.29
antmaze-medium-play-v2 4.50 ± 1.12 6.25 ± 2.38 3.75 ± 1.30 27.50 ± 9.39 80.50 ± 3.35 78.50 ± 3.84 93.50 ± 2.60 0.00 ± 0.00 0.00 ± 0.00 0.75 ± 0.43
antmaze-medium-diverse-v2 4.75 ± 1.09 16.50 ± 5.59 5.50 ± 1.50 33.25 ± 16.81 71.00 ± 4.53 83.50 ± 1.80 91.75 ± 2.05 0.00 ± 0.00 0.00 ± 0.00 0.50 ± 0.50
antmaze-large-play-v2 0.50 ± 0.50 13.50 ± 9.76 1.25 ± 0.43 1.00 ± 0.71 34.75 ± 5.85 53.50 ± 2.50 68.75 ± 13.90 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v2 0.75 ± 0.43 6.25 ± 1.79 0.25 ± 0.43 0.50 ± 0.50 36.25 ± 3.34 53.00 ± 3.00 69.50 ± 7.26 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze average 23.96 30.58 30.08 38.21 61.83 72.00 86.71 0.00 7.08 21.04
Adroit
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
pen-human-v1 99.69 ± 7.45 59.89 ± 8.03 9.95 ± 8.19 121.05 ± 5.47 58.91 ± 1.81 106.15 ± 10.28 127.28 ± 3.22 56.48 ± 7.17 35.84 ± 10.57 77.83 ± 2.30
pen-cloned-v1 99.14 ± 12.27 83.62 ± 11.75 52.66 ± 6.33 129.66 ± 1.27 14.74 ± 2.31 114.05 ± 4.78 128.64 ± 7.15 52.69 ± 5.30 26.90 ± 7.85 71.17 ± 2.70
pen-expert-v1 128.77 ± 5.88 134.36 ± 3.16 142.83 ± 7.72 162.69 ± 0.23 14.86 ± 4.07 140.01 ± 6.36 157.62 ± 0.26 116.43 ± 40.26 36.04 ± 4.60 119.49 ± 2.31
door-human-v1 9.41 ± 4.55 7.00 ± 6.77 -0.11 ± 0.06 19.28 ± 1.46 13.28 ± 2.77 13.52 ± 1.22 0.27 ± 0.43 -0.10 ± 0.06 2.51 ± 2.26 7.36 ± 1.24
door-cloned-v1 3.40 ± 0.95 10.37 ± 4.09 -0.20 ± 0.11 12.61 ± 0.60 -0.08 ± 0.13 9.02 ± 1.47 7.73 ± 6.80 -0.21 ± 0.10 20.36 ± 1.11 11.18 ± 0.96
door-expert-v1 105.84 ± 0.23 105.92 ± 0.24 4.49 ± 7.39 106.77 ± 0.24 59.47 ± 25.04 107.29 ± 0.37 106.78 ± 0.04 0.05 ± 0.02 109.22 ± 0.24 105.49 ± 0.09
hammer-human-v1 12.61 ± 4.87 6.23 ± 4.79 2.38 ± 0.14 22.03 ± 8.13 0.30 ± 0.05 6.86 ± 2.38 1.18 ± 0.15 0.25 ± 0.00 3.49 ± 2.17 1.68 ± 0.11
hammer-cloned-v1 8.90 ± 4.04 8.72 ± 3.28 0.96 ± 0.30 14.67 ± 1.94 0.32 ± 0.03 11.63 ± 1.70 48.16 ± 6.20 12.67 ± 15.02 0.27 ± 0.01 2.74 ± 0.22
hammer-expert-v1 127.89 ± 0.57 128.15 ± 0.66 33.31 ± 47.65 129.66 ± 0.33 0.93 ± 1.12 129.76 ± 0.37 134.74 ± 0.30 91.74 ± 47.77 69.44 ± 47.00 127.39 ± 0.10
relocate-human-v1 0.59 ± 0.27 0.16 ± 0.14 -0.29 ± 0.01 2.09 ± 0.76 1.03 ± 0.20 1.22 ± 0.28 3.70 ± 2.34 -0.18 ± 0.14 0.05 ± 0.02 0.08 ± 0.02
relocate-cloned-v1 0.45 ± 0.31 0.74 ± 0.45 -0.02 ± 0.04 0.94 ± 0.68 -0.07 ± 0.02 1.78 ± 0.70 9.25 ± 2.56 0.10 ± 0.04 4.11 ± 1.39 0.34 ± 0.09
relocate-expert-v1 110.31 ± 0.36 109.77 ± 0.60 0.23 ± 0.27 111.56 ± 0.17 0.03 ± 0.10 110.12 ± 0.82 111.14 ± 0.23 -0.07 ± 0.08 98.32 ± 3.75 106.49 ± 0.30
adroit average 58.92 54.58 20.51 69.42 13.65 62.62 69.71 27.49 33.88 52.60

Offline-to-Online

Scores

Task-Name AWAC CQL IQL SPOT Cal-QL
antmaze-umaze-v2 52.75 ± 8.67 → 98.75 ± 1.09 94.00 ± 1.58 → 99.50 ± 0.87 77.00 ± 0.71 → 96.50 ± 1.12 91.00 ± 2.55 → 99.50 ± 0.50 76.75 ± 7.53 → 99.75 ± 0.43
antmaze-umaze-diverse-v2 56.00 ± 2.74 → 0.00 ± 0.00 9.50 ± 9.91 → 99.00 ± 1.22 59.50 ± 9.55 → 63.75 ± 25.02 36.25 ± 2.17 → 95.00 ± 3.67 32.00 ± 27.79 → 98.50 ± 1.12
antmaze-medium-play-v2 0.00 ± 0.00 → 0.00 ± 0.00 59.00 ± 11.18 → 97.75 ± 1.30 71.75 ± 2.95 → 89.75 ± 1.09 67.25 ± 10.47 → 97.25 ± 1.30 71.75 ± 3.27 → 98.75 ± 1.64
antmaze-medium-diverse-v2 0.00 ± 0.00 → 0.00 ± 0.00 63.50 ± 6.84 → 97.25 ± 1.92 64.25 ± 1.92 → 92.25 ± 2.86 73.75 ± 7.29 → 94.50 ± 1.66 62.00 ± 4.30 → 98.25 ± 1.48
antmaze-large-play-v2 0.00 ± 0.00 → 0.00 ± 0.00 28.75 ± 7.76 → 88.25 ± 2.28 38.50 ± 8.73 → 64.50 ± 17.04 31.50 ± 12.58 → 87.00 ± 3.24 31.75 ± 8.87 → 97.25 ± 1.79
antmaze-large-diverse-v2 0.00 ± 0.00 → 0.00 ± 0.00 35.50 ± 3.64 → 91.75 ± 3.96 26.75 ± 3.77 → 64.25 ± 4.15 17.50 ± 7.26 → 81.00 ± 14.14 44.00 ± 8.69 → 91.50 ± 3.91
antmaze average 18.12 → 16.46 48.38 → 95.58 56.29 → 78.50 52.88 → 92.38 53.04 → 97.33
pen-cloned-v1 88.66 ± 15.10 → 86.82 ± 11.12 -2.76 ± 0.08 → -1.28 ± 2.16 84.19 ± 3.96 → 102.02 ± 20.75 6.19 ± 5.21 → 43.63 ± 20.09 -2.66 ± 0.04 → -2.68 ± 0.12
door-cloned-v1 0.93 ± 1.66 → 0.01 ± 0.00 -0.33 ± 0.01 → -0.33 ± 0.01 1.19 ± 0.93 → 20.34 ± 9.32 -0.21 ± 0.14 → 0.02 ± 0.31 -0.33 ± 0.01 → -0.33 ± 0.01
hammer-cloned-v1 1.80 ± 3.01 → 0.24 ± 0.04 0.56 ± 0.55 → 2.85 ± 4.81 1.35 ± 0.32 → 57.27 ± 28.49 3.97 ± 6.39 → 3.73 ± 4.99 0.25 ± 0.04 → 0.17 ± 0.17
relocate-cloned-v1 -0.04 ± 0.04 → -0.04 ± 0.01 -0.33 ± 0.01 → -0.33 ± 0.01 0.04 ± 0.04 → 0.32 ± 0.38 -0.24 ± 0.01 → -0.15 ± 0.05 -0.31 ± 0.05 → -0.31 ± 0.04
adroit average 22.84 → 21.76 -0.72 → 0.22 21.69 → 44.99 2.43 → 11.81 -0.76 → -0.79

Regrets

Task-Name AWAC CQL IQL SPOT Cal-QL
antmaze-umaze-v2 0.04 ± 0.01 0.02 ± 0.00 0.07 ± 0.00 0.02 ± 0.00 0.01 ± 0.00
antmaze-umaze-diverse-v2 0.88 ± 0.01 0.09 ± 0.01 0.43 ± 0.11 0.22 ± 0.07 0.05 ± 0.01
antmaze-medium-play-v2 1.00 ± 0.00 0.08 ± 0.01 0.09 ± 0.01 0.06 ± 0.00 0.04 ± 0.01
antmaze-medium-diverse-v2 1.00 ± 0.00 0.08 ± 0.00 0.10 ± 0.01 0.05 ± 0.01 0.04 ± 0.01
antmaze-large-play-v2 1.00 ± 0.00 0.21 ± 0.02 0.34 ± 0.05 0.29 ± 0.07 0.13 ± 0.02
antmaze-large-diverse-v2 1.00 ± 0.00 0.21 ± 0.03 0.41 ± 0.03 0.23 ± 0.08 0.13 ± 0.02
antmaze average 0.82 0.11 0.24 0.15 0.07
pen-cloned-v1 0.46 ± 0.02 0.97 ± 0.00 0.37 ± 0.01 0.58 ± 0.02 0.98 ± 0.01
door-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 0.83 ± 0.03 0.99 ± 0.01 1.00 ± 0.00
hammer-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 0.65 ± 0.10 0.98 ± 0.01 1.00 ± 0.00
relocate-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00
adroit average 0.86 0.99 0.71 0.89 0.99

Citing CORL

If you use CORL in your work, please use the following bibtex

@inproceedings{
tarasov2022corl,
  title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
  author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
  booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
  year={2022},
  url={https://openreview.net/forum?id=SyAS49bBcv}
}

corl's People

Contributors

adamjelley avatar cherrypiesexy avatar dt6a avatar howuhh avatar levilovearch avatar nakamotoo avatar scitator avatar suessmann avatar typoverflow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

corl's Issues

Make checkpoints public

Hi, would it be possible to release the checkpoints for this implementation? Would be very grateful for this.

Questions about D4RL dataset loading for DT

First of all, thank you for the repository. It is really helpful for learning about and developing offline RL algorithms.

While going through the DT code I had two questions.

In the function where the d4rl trajectories are loaded, I think the episode step counting will be off by one for the first episode. The episode_step += 1 should appear before the if condition. With the current implementation, if the first episode was of length 5, then episode_step would only be 4 when entering the if condition.

Also, this function would throw away the last episode in the dataset if the done/timeout flag for the last sample in the dataset is False. Is this expected behavior? And there are datasets where the last sample has both done and timeout to be False, for example in the hopper-random-v2 dataset of the D4RL benchmark. In this dataset, the last episode of 4 samples is discarded and isn't added to traj.

CORL/algorithms/dt.py

Lines 137 to 155 in b62fa28

data_, episode_step = defaultdict(list), 0
for i in trange(dataset["rewards"].shape[0], desc="Processing trajectories"):
data_["observations"].append(dataset["observations"][i])
data_["actions"].append(dataset["actions"][i])
data_["rewards"].append(dataset["rewards"][i])
if dataset["terminals"][i] or dataset["timeouts"][i]:
episode_data = {k: np.array(v, dtype=np.float32) for k, v in data_.items()}
# return-to-go if gamma=1.0, just discounted returns else
episode_data["returns"] = discounted_cumsum(
episode_data["rewards"], gamma=gamma
)
traj.append(episode_data)
traj_len.append(episode_step)
# reset trajectory buffer
data_, episode_step = defaultdict(list), 0
episode_step += 1

The results about td3_bc on Antmaze

Hi

May I ask the setting about td3_bc on antmaze. I find current hyperparameters can not work well and obtain a similar result as in the paper.

Best

run edac.py , two warning

WARNING Calling wandb.run.save without any arguments is deprecated.Changes to attributes are automatically persisted.
/home/bins/anaconda3/lib/python3.10/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32

Will these two errors affect the result?

Finetune algorithms log only train regret

All of the algorithms with offline-to-online finetuning log training regret (regret obtained by online interactions which are used for training) under both train/regret and eval/regret. So we report only train regret which is different from Cal-QL work where authors report eval regret. Reporting eval regret is strange because the thing we really want to minimize on practice is a train regret so this bug is not critical but should be kept in mind. I will fix it but without reruning all of the algorithms due to compute limitations (maybe later we will rerun it).

Minari Integration with CORL

There we will track out progress for Minari integration with CORL. Minari is a standard format for offline RL datasets, with popular reference datasets and related utilities, which we believe will replace D4RL in the future by combining most of the existing benchmarks under unified interface and storage. And we want to be prepared! Eventually this will become a CORLv2.

The plan is to add an experimental separate directory with algorithms using Minari and to retraing all algorithms on the D4RL datasets currently (and in the future) re-created in the Minari.

Adapted algorithms:

  • BC
  • TD3 + BC
  • AWAC
  • CQL
  • IQL
  • SAC-N
  • EDAC
  • DT

Retrained algorithms (only Adroit for now):

  • BC
  • TD3 + BC
  • AWAC
  • CQL
  • IQL
  • SAC-N
  • EDAC
  • DT

RUN any_percent_bc.py have OSError

here is error meessage
/home/bins/anaconda3/lib/python3.10/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
Traceback (most recent call last):
File "/home/bins/桌面/CORL/algorithms/any_percent_bc.py", line 406, in
train()
File "/home/bins/anaconda3/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/bins/桌面/CORL/algorithms/any_percent_bc.py", line 307, in train
dataset = d4rl.qlearning_dataset(env)
File "/home/bins/d4rl/d4rl/init.py", line 87, in qlearning_dataset
dataset = env.get_dataset(**kwargs)
File "/home/bins/d4rl/d4rl/offline_env.py", line 87, in get_dataset
with h5py.File(h5path, 'r') as dataset_file:
File "/home/bins/anaconda3/lib/python3.10/site-packages/h5py/_hl/files.py", line 567, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/home/bins/anaconda3/lib/python3.10/site-packages/h5py/_hl/files.py", line 231, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 181944320, sblock->base_addr = 0, stored_eof = 474567252)

Importing mujoco_py throwing error

Hi, thanks for creating and sharing this repo!

I installed requirements_dev.txt in a conda env. I run into the following error while building 'mujoco_py.cymj' extension on importing mujoco_py.

import mujoco_py
running build_ext
building 'mujoco_py.cymj' extension
gcc -pthread -B /home/grads/s/sapanac/.conda/envs/hf_dt/compiler_compat -Wno-unused-result -Wsign-compare -D
NDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/grads/s/sapanac/.conda/envs/hf_dt/include -fPIC -O2 -isyst
em /home/grads/s/sapanac/.conda/envs/hf_dt/include -fPIC -I/home/grads/s/sapanac/.conda/envs/hf_dt/lib/pytho
n3.10/site-packages/mujoco_py -I/home/grads/s/sapanac/.mujoco/mujoco210/include -I/home/grads/s/sapanac/.con
da/envs/hf_dt/lib/python3.10/site-packages/numpy/core/include -I/home/grads/s/sapanac/.conda/envs/hf_dt/lib/
python3.10/site-packages/mujoco_py/vendor/egl -I/home/grads/s/sapanac/.conda/envs/hf_dt/include/python3.10 -
c /home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c -o /home/grads/s/sap
anac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxgpuextensi
onbuilder/temp.linux-x86_64-cpython-310/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/
mujoco_py/cymj.o -fopenmp -w
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c: In function ‘_pyx_f
9mujoco_py_4cymj_9PyMjModel__set’:
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c:32003:80: error: ‘mjMo
del {aka struct _mjModel}’ has no member named ‘key_mpos’; did you mean ‘key_qpos’?
__pyx_t_1 = ((PyObject *)__pyx_f_9mujoco_py_4cymj__wrap_mjtNum_2d(__pyx_v_p->key_mpos, __pyx_v_p->nkey, (
3 * __pyx_v_p->nmocap))); if (unlikely(!__pyx_t_1)) __PYX_ERR(4, 1494, __pyx_L1_error)
^~~~~~~~
key_qpos
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c:32018:80: error: ‘mjMo
del {aka struct _mjModel}’ has no member named ‘key_mquat’; did you mean ‘body_quat’?
__pyx_t_1 = ((PyObject *)__pyx_f_9mujoco_py_4cymj__wrap_mjtNum_2d(__pyx_v_p->key_mquat, __pyx_v_p->nkey,
(4 * __pyx_v_p->nmocap))); if (unlikely(!__pyx_t_1)) __PYX_ERR(4, 1495, __pyx_L1_error)

/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c: In function ‘_pyx_pf
9mujoco_py_4cymj_12PyMjvPerturb_7active2_2__set
’:
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c:70110:22: error: ‘mjvPerturb {aka struct _mjvPerturb}’ has no member named ‘active2’; did you mean ‘active’?
__pyx_v_self->ptr->active2 = __pyx_v_x;
^~~~~~~
active
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c: In function ‘__pyx_f_9mujoco_py_4cymj_10PyMjvScene__set’:
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c:76665:82: error: ‘mjvScene {aka struct _mjvScene}’ has no member named ‘framergb’; did you mean ‘camera’?
__pyx_t_1 = ((PyObject *)__pyx_f_9mujoco_py_4cymj__wrap_float_1d((&(__pyx_v_p->framergb[0])), 3)); if (unlikely(!__pyx_t_1)) __PYX_ERR(4, 3743, __pyx_L1_error)
^~~~~~~~
camera
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c: In function ‘pyx_pf_9mujoco_py_4cymj_10PyMjvScene_10framewidth___get’:
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c:77372:53: error: ‘mjvScene {aka struct _mjvScene}’ has no member named ‘framewidth’
__pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->ptr->framewidth); if (unlikely(!__pyx_t_1)) __PYX_ERR(4, 3774, __pyx_L1_error)
^~
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c: In function ‘pyx_pf_9mujoco_py_4cymj_10PyMjvScene_10framewidth_2__set’:
/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/cymj.c:77427:20: error: ‘mjvScene {aka struct _mjvScene}’ has no member named ‘framewidth’
__pyx_v_self->ptr->framewidth = __pyx_v_x;
.
.
.
.
Traceback (most recent call last):
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/unixccomp
iler.py", line 186, in _compile
self.spawn(compiler_so + cc_args + [src, '-o', obj] + extra_postargs)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/ccompiler
.py", line 987, in spawn
spawn(cmd, dry_run=self.dry_run, **kwargs)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/spawn.py"
, line 70, in spawn
raise DistutilsExecError(
distutils.errors.DistutilsExecError: command '/usr/bin/gcc' failed with exit code 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/init.py", line 2,
in
from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/builder.py", line 504, in
cymj = load_cython_ext(mujoco_path)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/builder.py", line 110, in load_cython_ext
cext_so_path = builder.build()
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/builder.py", line 226, in build
built_so_file_path = self._build_impl()
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/builder.py", line 249, in _build_impl
dist.run_commands()
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 973, in run_commands
self.run_command(cmd)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
cmd_obj.run()
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/mujoco_py/builder.py", line 149, in build_extensions
build_ext.build_extensions(self)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
_build_ext.build_ext.build_extensions(self)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
self._build_extensions_serial()
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
self.build_extension(ext)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
objects = self.compiler.compile(
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/ccompiler.py", line 599, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/home/grads/s/sapanac/.conda/envs/hf_dt/lib/python3.10/site-packages/setuptools/_distutils/unixccompiler.py", line 188, in _compile
raise CompileError(msg)
distutils.errors.CompileError: command '/usr/bin/gcc' failed with exit code 1

Unable to reproduce results on hopper-medium-expert-v2

The README shows a last score of 106.24±6.09 for IQL on hopper-medium-expert-v2, with the following plot found on the linked dashboard:
image

However, when I try to repro a similar plot with the latest code, running python algorithms/iql.py --config=configs/iql/hopper/medium_expert_v2.yaml with 5 different seeds, I get something like this, which is clearly worse (last score is around 80):
image

Any idea where this discrepancy may be coming from?

Issues on getting antmaze-medium-play-v0 results with iql

Hi there,
Thank you for releasing the CORL benchmark. I cloned the latest repo and using parameters as below to run antmaze-medium-play-v0 experiment. However, I got all near 0 normalized reward from the first 430,000 gradient step.

I did not change the code except using these parameters:

class TrainConfig:
    # Experiment
    device: str = "cpu"
    env: str = "antmaze-medium-play-v0"  # OpenAI gym environment name
    seed: int = 0  # Sets Gym, PyTorch and Numpy seeds
    eval_freq: int = int(1e4)  # How often (time steps) we evaluate
    n_episodes: int = 100  # How many episodes run during evaluation
    max_timesteps: int = int(1e6)  # Max time steps to run environment
    checkpoints_path: str = "./models/iql"  # Save path
    load_model: str = ""  # Model load file name, "" doesn't load
    # IQL
    buffer_size: int = 10_000_000  # Replay buffer size
    batch_size: int = 256  # Batch size for all networks
    discount: float = 0.99  # Discount factor
    tau: float = 0.005  # Target network update rate
    beta: float = 10.0  # Inverse temperature. Small beta -> BC, big beta -> maximizing Q
    iql_tau: float = 0.9  # Coefficient for asymmetric loss
    iql_deterministic: bool = False  # Use deterministic actor
    normalize: bool = True  # Normalize states
    normalize_reward: bool = False  # Normalize reward
    # Wandb logging
    project: str = "CORL-default"
    group: str = "IQL-D4RL"
    name: str = "IQL"

And the results are as below:

 % python iql.py
objc[33597]: Class GLFWApplicationDelegate is implemented in both /Users/xxx/.mujoco/mujoco210/bin/libglfw.3.dylib (0x11aa13778) and /opt/anaconda3/envs/iql2/lib/python3.10/site-packages/glfw/libglfw.3.dylib (0x11aabc7e8). One of the two will be used. Which one is undefined.
objc[33597]: Class GLFWWindowDelegate is implemented in both /Users/xxx/.mujoco/mujoco210/bin/libglfw.3.dylib (0x11aa13700) and /opt/anaconda3/envs/iql2/lib/python3.10/site-packages/glfw/libglfw.3.dylib (0x11aabc810). One of the two will be used. Which one is undefined.
objc[33597]: Class GLFWContentView is implemented in both /Users/xxx/.mujoco/mujoco210/bin/libglfw.3.dylib (0x11aa137a0) and /opt/anaconda3/envs/iql2/lib/python3.10/site-packages/glfw/libglfw.3.dylib (0x11aabc860). One of the two will be used. Which one is undefined.
objc[33597]: Class GLFWWindow is implemented in both /Users/xxx/.mujoco/mujoco210/bin/libglfw.3.dylib (0x11aa13818) and /opt/anaconda3/envs/iql2/lib/python3.10/site-packages/glfw/libglfw.3.dylib (0x11aabc8d8). One of the two will be used. Which one is undefined.
Warning: Flow failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message.
No module named 'flow'
Warning: CARLA failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message.
No module named 'carla'
pybullet build time: Oct 16 2022 01:59:14
/opt/anaconda3/envs/iql2/lib/python3.10/site-packages/gym/envs/registration.py:505: UserWarning: WARN: The environment antmaze-medium-play-v0 is out of date. You should consider upgrading to version `v2` with the environment ID `antmaze-medium-play-v2`.
  logger.warn(
/Users/xxx/Documents/project_offlineexploration/D4RL_6330b4e09e36a80f4b706a3885d59d97745c05a9/d4rl/locomotion/ant.py:180: UserWarning: This environment is deprecated. Please use the most recent version of this environment.
  offline_env.OfflineEnv.__init__(self, **kwargs)
Target Goal:  (20.64647417679362, 21.089515421327548)
/opt/anaconda3/envs/iql2/lib/python3.10/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
load datafile: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.14it/s]
Dataset size: 999092
Checkpoints path: ./models/iql
---------------------------------------
Training IQL, Env: antmaze-medium-play-v0, Seed: 0
---------------------------------------
wandb: Currently logged in as: lxu. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.13.4 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.21
wandb: Run data is saved locally in /Users/xxx/Documents/default_repo/CORL/algorithms/wandb/run-20221019_133015-2d1a2d9d-8f35-4295-bac7-e39fa293699c
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run IQL
wandb: ⭐️ View project at https://wandb.ai/xxx/CORL-default
wandb: 🚀 View run at https://wandb.ai/xxx/CORL-default/runs/2d1a2d9d-8f35-4295-bac7-e39fa293699c
wandb: WARNING Calling wandb.run.save without any arguments is deprecated.Changes to attributes are automatically persisted.

iql_results

Add a switch to turn off wandb logging?

Dear developers,

Thank you and Happy new year.
Would you accept a PR to add an option to enable/disable wandb logging (It can certainly be enabled by default.)? The motivation is if you want to debug or dig into some algorithms, wandb logging is not necessary.

Best regards,
Levi

Binary Env results?

Just wondering if there are code/results for the binary envs with sparse reward? The ones used in IQL as reported in their tables for fine-tuning experiments.

evaluation metrics

I am glad to find this project, but I would like to ask if there are some evaluation metrics for offline reinforcement learning (cuz in some cases there is no available env), such as OPE, init value, soft opt score, etc.

Error installing dependencies

When I try to install the dependencies in a brand new Conda environment:

pip install -r requirements/requirements_dev.txt

it errors out with:

ERROR: Could not find a version that satisfies the requirement torch==1.11.0+cu113 (from versions: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0)
ERROR: No matching distribution found for torch==1.11.0+cu113

I am using Python 3.8.16. Am I using the wrong Python version?

The issue of LB-SAC GPU memory usage

I ran the hopper expert task, which used 50 critic networks. I checked the GPU memory occupied by the process on the RTX3090 GPU (using the nvidia smi command), which was approximately 3.7GB. However, the paper reported 5.4GB, which is a significant difference. I would like to know the reason for this, or if there is an error in my way of checking GPU memory, or if there is an issue with the GPU model.Looking forward to your early reply!

Question: about the layernorm on token input

Hi, about DT I have one tiny question:

CORL/algorithms/dt.py

Lines 343 to 346 in 2a7b88c

# LayerNorm and Dropout (!!!) as in original implementation,
# while minGPT & huggingface uses only embedding dropout
out = self.emb_norm(sequence)
out = self.emb_drop(out)

However, it seems the original implementation did not use layernorm (ref: for atari https://github.com/kzl/decision-transformer/blob/e2d82e68f330c00f763507b3b01d774740bee53f/atari/mingpt/model_atari.py#L260 and for mujoco https://github.com/kzl/decision-transformer/blob/e2d82e68f330c00f763507b3b01d774740bee53f/gym/decision_transformer/models/trajectory_gpt2.py#L687). Am I missing anything ?🤔

Some questions on reproducibility of IQL

Hi,
Thanks for providing CORL to fairly benchmark offline algorithms. I have some questions about the details. In IQL, the author's implementation normalizes the reward only (see https://github.com/ikostrikov/implicit_q_learning/blob/09d700248117881a75cb21f0adb95c6c8a694cb2/train_offline.py#L35), while CORL normalizes observation for halfcheetah, hopper and walker2d tasks while leaving the rewards unchanged. Is it necessary to match CORL's normalization strategy to what was used in the original implementation?
Please correct me if there is any mis-understanding =)

Improving and fixing underperformance for the CQL on AntMaze tasks

Even though we have made sure that almost all algorithms match the original papers performance, CQL has turned out to be one of the most difficult algorithms to reproduce accurately, and its performance varies greatly from paper to paper.

We have already made several revisions and refactorings of CQL, gradually improving its performance as much as we could and as time and resource constraints allowed us (see #12 for a public example). Still, we have not yet been able to reproduce the results (as in IQL paper) on AntMaze datasets.

Therefore, we would welcome any contributors who know CQL better than we do to help make it a reliable baseline. We can help and run heavy multiseed benchmarks on our resources. We do, however, expect at least single seed checks to be provided for new PRs.

a error

Looking forward to your replay, When i tried to download the data medium-replay, I got the error: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)
I can't solve this error.Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.