Comments (9)
Which python version? In general, TRPO can have small numerical instabilities sometimes
I just kicked off a run with 3.6 and am seeing a steady increase:
[2017-07-24 23:15:25,367] Finished episode 300 after 136 timesteps
[2017-07-24 23:15:25,367] Episode reward: 136.0
[2017-07-24 23:15:25,368] Average of last 500 rewards: 48.662
[2017-07-24 23:15:25,368] Average of last 100 rewards: 138.07
2nd run:
[2017-07-24 23:18:02,651] Episode reward: 200.0
[2017-07-24 23:18:02,651] Average of last 500 rewards: 21.498
[2017-07-24 23:18:02,651] Average of last 100 rewards: 94.35
[2017-07-24 23:18:19,890] Finished episode 200 after 200 timesteps
[2017-07-24 23:18:19,891] Episode reward: 200.0
[2017-07-24 23:18:19,891] Average of last 500 rewards: 41.312
[2017-07-24 23:18:19,892] Average of last 100 rewards: 173.19
from tensorforce.
Python 2.7.12.
I also saw a steady increase in the beginning (but only in the beginning), logs linked above show that.
from tensorforce.
Ok, I can have a go at reproducing that tomorrow in a venv. There might however always be problematic runs with bad seeds, but seems fine in 3.6.2, 3rd of 3 runs:
[2017-07-24 23:21:26,770] Finished episode 250 after 200 timesteps
[2017-07-24 23:21:26,770] Episode reward: 200.0
[2017-07-24 23:21:26,770] Average of last 500 rewards: 34.61
[2017-07-24 23:21:26,770] Average of last 100 rewards: 107.64
[2017-07-24 23:21:40,879] Finished episode 300 after 200 timessteps
from tensorforce.
but seems fine in 3.6.2
Could you please post a full log from python3 examples/quickstart.py
(running till the end, 3000 episodes) with Python 3.6.2?
from tensorforce.
Have to leave for today but here is first 1500 from the previous run before i killed it... In general, I believe you that performance might go down again, because we did not implement KL-divergence annealing where we stop updating after a while, but cartpole is considered solved if there is 100 eps with 200 reward I think:
17-07-24 23:23:54,495] Finished episode 1150 after 200 timesteps
[2017-07-24 23:23:54,495] Episode reward: 200.0
[2017-07-24 23:23:54,495] Average of last 500 rewards: 91.75
[2017-07-24 23:23:54,495] Average of last 100 rewards: 174.28
[2017-07-24 23:24:11,088] Finished episode 1200 after 200 timesteps
[2017-07-24 23:24:11,088] Episode reward: 200.0
[2017-07-24 23:24:11,088] Average of last 500 rewards: 105.69
[2017-07-24 23:24:11,088] Average of last 100 rewards: 197.2
[2017-07-24 23:24:27,112] Finished episode 1250 after 200 timesteps
[2017-07-24 23:24:27,113] Episode reward: 200.0
[2017-07-24 23:24:27,113] Average of last 500 rewards: 118.014
[2017-07-24 23:24:27,113] Average of last 100 rewards: 190.1
[2017-07-24 23:24:43,777] Finished episode 1300 after 80 timesteps
[2017-07-24 23:24:43,777] Episode reward: 80.0
[2017-07-24 23:24:43,777] Average of last 500 rewards: 130.174
[2017-07-24 23:24:43,777] Average of last 100 rewards: 183.11
[2017-07-24 23:25:03,404] Finished episode 1350 after 200 timesteps
[2017-07-24 23:25:03,404] Episode reward: 200.0
[2017-07-24 23:25:03,404] Average of last 500 rewards: 143.536
[2017-07-24 23:25:03,404] Average of last 100 rewards: 190.07
[2017-07-24 23:25:22,964] Finished episode 1400 after 200 timesteps
[2017-07-24 23:25:22,964] Episode reward: 200.0
[2017-07-24 23:25:22,964] Average of last 500 rewards: 157.062
[2017-07-24 23:25:22,964] Average of last 100 rewards: 198.56
[2017-07-24 23:25:41,130] Finished episode 1450 after 200 timesteps
[2017-07-24 23:25:41,130] Episode reward: 200.0
[2017-07-24 23:25:41,130] Average of last 500 rewards: 170.024
[2017-07-24 23:25:41,130] Average of last 100 rewards: 200.0
[2017-07-24 23:25:59,222] Finished episode 1500 after 200 timesteps
[2017-07-24 23:25:59,223] Episode reward: 200.0
[2017-07-24 23:25:59,223] Average of last 500 rewards: 181.628
[2017-07-24 23:25:59,223] Average of last 100 rewards: 200.0
[2017-07-24 23:26:16,896] Finished episode 1550 after 200 timesteps
[2017-07-24 23:26:16,896] Episode reward: 200.0
[2017-07-24 23:26:16,896] Average of last 500 rewards: 190.89
[2017-07-24 23:26:16,896] Average of last 100 rewards: 200.0
[2017-07-24 23:26:35,774] Finished episode 1600 after 200 timesteps
[2017-07-24 23:26:35,774] Episode reward: 200.0
[2017-07-24 23:26:35,775] Average of last 500 rewards: 195.774
[2017-07-24 23:26:35,775] Average of last 100 rewards: 200.0
from tensorforce.
cartpole is considered solved if there is 100 eps with 200 reward I think
Ah, I forgot about that, cheers. Checked on https://gym.openai.com/envs/CartPole-v0 and it's even lighter:
CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.
However it still looks like my run of python examples/quickstart.py
will not pass, but the second one, the one with longer command - would.
I checked the configs and they looked the same...
from tensorforce.
I run it the second time and now it looks much better.
There might however always be problematic runs with bad seeds
I think it might be this, by bad seeds you mean the network weights were drawn "unluckily" (i.e. in such a way, that later TRPO could not get 200 for 100 episodes)? Would it mean it was getting stuck in local extremums (while finding weights) that were far from the optimum? That would be sad, as the network is small (2 layers with 32 neurons). If that's the case then one have to run TRPO multiple times and record "the best run"? That seems very inefficient, let's have in mind more complex env. Or there are parameter changes that should help?
from tensorforce.
Hi, ok good. So 2 issues:
- Yes, there can be local extremums always and the algorithm cannot escape. This is why it makes sense to add a heuristic for the learning rate that increases it when there is no improvement or decreases it when it gets worse. This is always the case with reinforcement learning
- Another thing is that this is a stochastic policy, so it samples an action. This means it will always do slightly random things even at 200. Hence, another thing one can do is to move from sampling to deterministic acting once the policy performs well (there is a flag in the act method for this).
Hope this helps and I would close the issue then - this is just the general nature of RL algorithms, not a bug. And yes to the question, researchers always report averages/bests of many runs for this reason, and it's also why reproducing results reliably is difficult.
from tensorforce.
Thank you Michael, that helps a lot. Keep up the good work.
from tensorforce.
Related Issues (20)
- Gym envirnoment broken: 'dict' object has no attribute 'env_specs HOT 3
- Issues installing Tensorforce from pip on Python 3.10
- is it still active? HOT 2
- How to change epsilon value when using epsilon-greedy policy? HOT 2
- Can I customize the loss function?
- Saver documentation inconsistent with example
- End-to-end data collection and policy updates on the GPU possible with tensorforce
- how to modify the loss function of the value network in PPO
- AttributeError: 'Adam' object has no attribute '_create_all_weights' HOT 3
- Why different models performs the same HOT 1
- AttributeError: type object 'Module' has no attribute '_MODULE_STACK' HOT 1
- tensorforce.exception.TensorforceError: Invalid value for variable argument spec: TensorSpec HOT 1
- Comparison of "online" and "offline" agent-enviroment interactions
- error creating an agent
- TypeError: CCompiler_spawn() got an unexpected keyword argument 'env' HOT 2
- A minimal example of custom Environment fails on protobuf or dtensor import from tensorflow.compat.v2.experimental HOT 6
- How to specify min_value and max_value in a custom environment when shape of the state is a vector? HOT 1
- Does Runner.run perform training given it never invokes agent.experience(...) ? HOT 1
- logging to logdir for tensorboard? HOT 1
- Some issue about PPOAgent update
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorforce.