kzl / decision-transformer Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 420.0 262 KB

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

License: MIT License

Python 98.55% Shell 1.45%

decision-transformer's People

Contributors

Stargazers

Watchers

Forkers

wx-b zitterbewegung walkacross linhduongtuan sailfish009 jianzhu rojas70 vikasmech zebrajack ritchiehuang ricklentz fantes evelynmitchell jie-jay beniz gaojie0105 imvansh25 jbdatascience codeaudit cmeninwa davidchenw hizgsnbu bluseking zelladoor stjordanis trendingtechnology myelinio dengpingfan minpricing jink1994 doytsujin ejhortala manhcuongk55 wwchung91 hyeonsangjeon adityamhamunkar reinholdm nathanhack limberc adbmd octavmatu ankitshah009 syyunn poxcog andreicnica mogg01 dapatil211 predatorq enosair geyang nan1488 kanji95 rubenszimbres chongminggao ai-ml-cv tiamat-tech qingxinhu123 mohan-zhang-u richardrl jinyup100 lianhui1993 hannesvoss fmxfranky msinghraniyal fallcicada babivillanova munkichung hyyh28 karlxing ai-for-games louiealbp leefree-git fdoperezi airicky hwk0702 lunarjune buddih09 guijinson chendrag 404dev404 tianyu-z andipeng oztc yuzhouxianzhi douxiaotian qinjielin-nu fcbw2012 xhaoai xmgfx rowcolumn cocobar ramiribat pwang001 wadx2019 magic-sword borgphysics chaciooh shuowang-ai liadgiladi lucky7323

decision-transformer's Issues

I am wondering if the shortest path case is included in the code? thanks

About parameters in code

Thank you for sharing. This is a great model, but I don't quite understand some parameters in the code. For example, you are judging the environment env_ name==’-- ‘the max_ after that. ep_ len，env_ targets and scale parameters and what are their functions.

The setting of final token

Great work and thanks for the open source. In Atari experiments, is there any reason for setting the final token as "2 * dataset_length * block_size" in the code? In the Appendix, this hyperparameter is set to 2 * 50000 * K. I didn't get the point of 2 times. I think the final token is "dataset_length * block_size". Please correct me if I have missed something. Thanks.

Global position embedding and timesteps look wrong in atari

I'm not familiar with position encoding, but if my understanding is correct, for each sample batch, global_pos_emb is used only for a single timestep in the atari code. Is it the intended one?

Essentially, the current code computes global position encoding in the following way:

import torch

max_time_step = 10
emb_dim = 7
batch_size = 2
timesteps = torch.tensor([1, 3]).view(batch_size, 1, 1). # the implementation of the dataset returns a relative index in an episode of the first state.

global_pos_emb = torch.rand(1, max_time_step, emb_dim)
all_global_pos_emb = torch.repeat_interleave(
    global_pos_emb, batch_size, dim=0
)  # batch_size, traj_length, n_embd 


torch.gather(
    all_global_pos_emb,
    1,
    torch.repeat_interleave(timesteps, emb_dim, dim=-1),
)
# shape is (batch_size, 1, emb_dim),

Citation request

Hi,

I sent an email and received no response, so I am trying the issues section as a way to contact the authors.

I would like to request a citation from the "Decision Transformers" paper. Our work is very relevant I believe - the novelty presented in the "Decision Transformers" paper is identical to ours that we introduced nearly 2 years ago.

It's a blog post and not a paper, but I don't think that matters. The source code has also been public for a long time. Here is the blog post in question: https://ogma.ai/2019/08/acting-without-rewards/

The idea of "RL as a sequence prediction/generation problem" is identical to ours. The use of the Transformer is not, but that is not the novelty being presented so I don't think it matters.

We used slightly different language, as we do not use Transformers but rather a bio-inspired system (that avoids backpropagation). Still, it does the whole process of predicting a sequence and performing "goal relabeling". We took it a step further and did so hierarchically as well. As in decision transformers, we do not use any classic RL algorithm (no dynamic programming), but rather we learn to predict the sequences in such a way that they can be "prompted" and generate desired trajectories. We invented it specifically as a way to avoid rewards, but rewards can be used as well. Decision Transformers also do not require rewards necessarily, as shown in one of the experiments.

The ideas in "Upside-Down Reinforcement Learning" by Juergen Schmidhuber are also similar. However, our work pre-dates that as well, but we cannot contact Juergen Schmidhuber for a citation, so it would be kind if we could at least get one from you.

Thanks

Timesteps Shape

Hi there,

Thanks for the code sharing and I'm trying to go through the paper with the help of the code.

When initializing the dataset, I find that timesteps are selected with [idx:idx+1] on line 64 in run_dt_atari.py, resulting in a shape of (batch_size, 1) instead of the comment on line 225 in model_atari.py (batch_size, block_size, 1).

Please help to confirm the shape of timesteps and whether there is an error regarding the timesteps selection.

Is this a bug?

Hi team,

Nice work!

Not sure if it affects performance at all, but I was wondering why you are assigning actions to stepwise_returns here?

decision-transformer/atari/create_dataset.py

Line 68 in 4597fba

stepwise_returns = actions[:curr_num_transitions]

difference between two GPT models used in this repo?

Hi, thanks a lot for releasing the code. I noticed that you use separate GPT models in atari and mujoco. One comes from minGPT and the other comes from hugging face(I'm not sure, which specific file of hugging face repo, could you tell?).
I'm wondering is there any difference between the two GPT models above? I would like to run experiment of Gym Mujoco, but the code in /gym/decision_transformer/models/decision_transformer.py seems scary. Is it possible that I use minGPT implementation?

undefined name 'returns

Hi, found undefined name in:
gym/decision_transformer/training/act_trainer.py:14:84

Why the padding is different for state, action, reward?

decision-transformer/gym/experiment.py

Lines 147 to 154 in c9e6ac0

    
           s[-1] = np.concatenate([np.zeros((1, max_len - tlen, state_dim)), s[-1]], axis=1) 
        
           s[-1] = (s[-1] - state_mean) / state_std 
        
           a[-1] = np.concatenate([np.ones((1, max_len - tlen, act_dim)) * -10., a[-1]], axis=1) 
        
           r[-1] = np.concatenate([np.zeros((1, max_len - tlen, 1)), r[-1]], axis=1) 
        
           d[-1] = np.concatenate([np.ones((1, max_len - tlen)) * 2, d[-1]], axis=1) 
        
           rtg[-1] = np.concatenate([np.zeros((1, max_len - tlen, 1)), rtg[-1]], axis=1) / scale 
        
           timesteps[-1] = np.concatenate([np.zeros((1, max_len - tlen)), timesteps[-1]], axis=1) 
        
           mask.append(np.concatenate([np.zeros((1, max_len - tlen)), np.ones((1, tlen))], axis=1))

It's easy to understand padding the state with np.zero(,), but why use np.ones(,)* -10 to pad the action and np.ones(,) * 2 to pad the done flag?

about graph experiment

Hi, I'm interested in the graph example presented in the paper and wonder if that code can be also published to play around with. Thanks!

Do both BC and DT fit the training data well?

Hi thanks for the interesting work!
A question here: how well do Behavior Cloning and Decision Transformer fit the training data (esp. when there is a mixture of policies, like the ones with replay data or medium + expert)? This doesn't seem to be reported in the paper. Do they fit the data (roughly) equally well?

batch sampling: only last tokens?

Hi! Thank you for the good paper.

According to paper:

We feed the last K timesteps into Decision Transformer ...

Does this mean only for the inference, or also during the sampling of batches only the last tokens K of episode are sampled (this seems strange to me, but the sampling code is not completely clear)?

Return-to-go conditioning on Atari

Dear authors,
Great work of DT! I found that the Return-to-go conditioning hyperparameters in Table 8 are different from

decision-transformer/atari/mingpt/trainer_atari.py

Line 164 in f04280e

eval_return = self.get_returns(1150)

in the code.
Which should be right?

Thanks

Padding tokens represented differently in different parts of the code

Thank you for your great work and for making your code available!

A question regarding padding tokens: They seem to be handled slightly differently in different parts of the code. When loading the data to run the experiments, it appears that padding token values are informed by the environment characteristics (e.g. -10 for actions in mujoco, 2 for dones, and 0 for other types of tokens). However, on the model side for action prediction, all padding tokens are zeros. We were unsure about the reason behind this difference in representing padding tokens. However, we inferred that since the attention mask reflects the position of padding tokens, that would override these slight differences ultimately.

We noticed the same pattern in other work, such as GDT that was built on top of this repository.

Could you please let us know more about your implementation of padding tokens and why they are represented differently? And do their actual values matter when their position is reflected in the attention mask?

Error when loading fixed replay buffer

Hi, Thank you for the code contribution.

I tried the code with these command
python atari/run_dt_atari.py --seed 123 --epochs 5 --model_type 'reward_conditioned' --num_steps 500000 --num_buffers 50 --game 'Breakout' --batch_size 64

But it turned out like this (always 0 loaded transitions~) :

I think this problem is caused by not creating frb(fixed replay buffer) properly in "fixed_replay_buffer.py".
This part:

When declare circular_replay_buffer in the dopamine library and load the buffer, it's not created properly, so fixed replay buffer always return None.

Is it a bug or am I doing something wrong in this setting?

[IDEA] Code for dataset generation

I was wondering if it would be possible to release the code that was used to generate the datasets for the gym and atari experiments. That would facilitate the evaluation of decision transformer methods on other environments.

Some problems after reading the paper and code

After reading your paper and open source code, I have three doubts. If it's convenient, I hope you can help me answer it.

It seems that the function of reward in the data set is only used to generate return-to-go during get_batch. Although there is reward input during training and evaluation, the function of reward in the network is not seen. Is the function of reward only to generate return to go?
When determining the environment, you need to determine a target_ Return, I don't know What is the function of target_return? It seems that even if it is larger than the largest return-to-go in the existing data set, the final experiment can be successful. Emmm, that is to say, I want to know about target_ What is the impact of return on the Internet?
During evalustion, each target_return evaluation 100 rounds . According to my understanding, the evaluation result should be better and better. That is to say, the reward in the evaluation stage should be better and better. However, the result I ran out is not like this. What is the reason?
I hope you can solve my doubts at your convenience. Thank you very much! Good luck!

The results on Mujoco reported in paper might be heavily influenced by env version

Hello there,

Recently, we reproduce some experiments in offline reinforcement learning and find that the decision transformer cites the result of CQL from the original paper. However, the problem is that DT uses mujoco version 2 like(hopper-v2, walker2d-v2), while original CQL uses mujoco version 0 like(hopper-v0, walker2d-v0), and the reward scale is different in these environments. So we run DT and CQL in the same environment(hopper-v2, walker2d-v2), but CQL is better than DT in almost all the tasks (except for hopper-replay). So I wonder：

Have you considered the environment version into consideration in the results?
Refer to #16 . The score is normalized by an export policy from https://github.com/rail-berkeley/d4rl/blob/master/d4rl/infos.py . However, the results based on the official code are far away from the results reported in the paper. Or did I miss some key components in DT code?

Looking forward to your reply!

Best Wishes

TypeError on trajectory_gpt2.py

When I ran experiment.py, it comes out an error with:

File "D:\ReinformentLearning\decision-transformer-master\gym\decision_transformer\models\trajectory_gpt2.py", line 590, in GPT2Model
config_class=_CONFIG_FOR_DOC,
TypeError: add_code_sample_docstrings() got an unexpected keyword argument 'tokenizer_class'

I guess it should be the problem of different versions of transformer, but I cannot solve.

Thanks

Application to multi-agent environment

Thank you for your great work!

I was wondering if this method can be applied to the environment with multiple agents, something like this. I am just starting in reinforcement learning and for now I see no reason why it couldn't be applied. But, maybe you thought about it?

State and Return preds input

The comment on the following line and the line after says that the return and state predictions are output using both the state and action as inputs. Although the equation only seems to use the action information (index 2). Am I missing something or is there some ambiguity? I know that it won't affect the learning since we are only using the action predictions.

decision-transformer/gym/decision_transformer/models/decision_transformer.py

Line 97 in f04280e

    
           return_preds = self.predict_return(x[:,2])  # predict next return given state and action

Bug in state and action prediction

Although decision transformer does not predict states or returns, if one were to add these losses the following prediction appears to be broken as the predict_return and predict_state projections do not take in the state, but only the action.

decision-transformer/gym/decision_transformer/models/decision_transformer.py

Line 97 in e2d82e6

    
           return_preds = self.predict_return(x[:,2])  # predict next return given state and action

Getting killed before loading any data

Getting the following output:

Loading from buffer 45 which has 0 already loaded
Killed

Probably error in this line:

decision-transformer/atari/fixed_replay_buffer.py

Line 54 in c9e6ac0

replay_buffer.load(self._data_dir, suffix)

Memory seems sufficient. What could be the reason?

Thanking you.

Confusion over shape of returns_to_go in get_batch

Hi, I'm trying to understand the following code in gym/experiment.py/get_batch():

rtg.append(discount_cumsum(traj['rewards'][si:], gamma=1.)[:s[-1].shape[1] + 1].reshape(1, -1, 1))
if rtg[-1].shape[1] <= s[-1].shape[1]:
    rtg[-1] = np.concatenate([rtg[-1], np.zeros((1, 1, 1))], axis=1)
...
tlen = s[-1].shape[1]

( from https://github.com/kzl/decision-transformer/blob/master/gym/experiment.py#:~:text=rtg.append(discount_cumsum,1))%5D%2C%20axis%3D1) )

As far as I can understand it, it's creating a sequence of (tlen + 1) rtg values, then checking whether the sequence length is <= tlen, and padding it with an extra value if not. (I'm struggling to see how this situation will ever arise.)
A few lines later, the padding code is applied, pre-padding with 0s to make sure everything is length max_len, except for rtg, which will now be length max_len + 1.

I don't understand the purpose of this extra value, especially since it seems to get stripped anyway by the SequenceTrainer:

state_preds, action_preds, reward_preds = self.model.forward(
    states, actions, rewards, rtg[:,:-1], timesteps, attention_mask=attention_mask,
)

Am I missing something?
Thanks!

Can you add reacher dataset

could you please add the reacher2D dataset?

Regarding atari breakout results

Hi,

Thanks for sharing the implementation. I am unable to reproduce the performance reported in the paper for breakout. I ran this script: https://github.com/kzl/decision-transformer/blob/master/atari/run.sh. I get an eval return of 42 as compared to 267.5 reported in table 4 of that paper. Am I missing something?

Thanks

Question: is it possible to use the same Decision Transformer for new training trajectories generation?

Maybe I'm missing something, but why do we stop the training after going over the initial trajectories dataset?
Can the same model be run again to generate new (better) trajectories and trained on them in an iterative manner?
Thanks for your time!

Padding / attention_mask questions

Hi, thanks for making your code available. I'm trying to wrap my head around the padding in your gym implementation.

In this code:

decision-transformer/gym/experiment.py

Line 154 in c9e6ac0

    
           mask.append(np.concatenate([np.zeros((1, max_len - tlen)), np.ones((1, tlen))], axis=1))

you are padding your inputs on the left, and creating an attention_mask so that the model will ignore the padding.

According to this (possibly out-of-date?) comment on the Hugging Face repo, GPT should ideally be padded on the right, and then the causal masking will take care of making sure nothing is conditioned on the padding values, making the attention_mask unnecessary:

GPT-2 is a model with absolute position embeddings (like Bert) so you should always pad on the right to get best performances for this model (will add this information to the doc_string).

As it's a causal model (only attend to the left context), also means that the model will not attend to the padding tokens (which are on the right) for any real token anyway.

So in conclusion, no need to take special care of avoiding attention on padding.

Just don't use the output of the padded tokens for anything as they don't contain any reliable information (which is obvious I hope).

(see huggingface/transformers#808 (comment) )

Can you explain the rationale behind the padding scheme? Or am I just getting the wrong end of the stick?
Cheers!

Minor bug that removes best performing trajectory in gym experiments

I believe that this line should have a <= rather than a < in order for the code to not cut out the best performing trajectory even when using pct_traj = 1.

decision-transformer/gym/experiment.py

Line 109 in f04280e

while ind >= 0 and timesteps + traj_lens[sorted_inds[ind]] < num_timesteps:

To replicate, use a dataset with 2 trajectories and use pct_traj = 1. and the resulting num_trajectories will just be 1 rather than 2.

RuntimeError

Thanks for your exciting work !
The example usage of gym.

Can you help me to fix this bug?

undestanding use of rewards

as I am reading through the code I do not understand how you are using rewards to learn an optimal policy. Can you point out where rewards are used in decision transformer during training?

how to get the score of an expert policy and some other details

Hi,

Thanks for sharing this interesting work. However, I have few questions in the paper:

Could I know how to get the score of an expert policy that you used to normalize the score in Tables? If possible, could you share these information, please?
When you report the results in the paper (e.g., Table 4), which score do you use - the score in the last epoch or the best score during training?

Looking forward to your reply!

Best Wishes

AttributeError: 'GPT2Config' object has no attribute 'n_ctx'

Traceback (most recent call last):
File "D:\Python\Test\test\T\decision-transformer-master\gym\experiment.py", line 312, in
experiment('gym-experiment', variant=vars(args))
File "D:\Python\Test\test\T\decision-transformer-master\gym\experiment.py", line 213, in experiment
model = DecisionTransformer(
File "D:\Python\Test\test\T\decision-transformer-master\gym\decision_transformer\models\decision_transformer.py", line 39, in init
self.transformer = GPT2Model(config)
File "D:\Python\Test\test\T\decision-transformer-master\gym\decision_transformer\models\trajectory_gpt2.py", line 522, in init
self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
File "D:\Python\Test\test\T\decision-transformer-master\gym\decision_transformer\models\trajectory_gpt2.py", line 522, in
self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
File "E:\Python\python3.10.5\lib\site-packages\transformers\configuration_utils.py", line 260, in getattribute
return super().getattribute(key)
AttributeError: 'GPT2Config' object has no attribute 'n_ctx'

Atari results

Hi,

Thanks for your wonderful work. I cannot reproduce the performance reported in the paper for Atari. For example, compared to Table 1, my normalized score for Breakout is 147.738, for Seaquest is 1.875 (averaged over 3 seeds, I use the same seed as this script: https://github.com/kzl/decision-transformer/blob/master/atari/run.sh ) I wonder did you use the same seeds (123, 231, 312) as that script ? Or did I miss something?

aligning action embeddings to other embeddings at line 237

at https://github.com/kzl/decision-transformer/blob/master/atari/mingpt/model_atari.py, line 237

token_embeddings[:,2::3,:] = action_embeddings[:,-states.shape[1] + int(targets is None):,:]

I am not quite sure about the usage of checking targets is None. It seems to me it is for 2 cases of inputs:

1.(r_0,s_0,a_0,r_1,s_1,a_1,...,r_k,s_k,a_k) , in that we have all actions for each states for k timesteps, in this case the targets = actions

2.(r_0,s_0,a_0,r_1,s_1,a_1,...r_k,s_k), with the last action a_k to be predicted from s_k, the targets is None in this case (or we could still have the targets (a_0,a_1,...a_(k-1))?

However it looks to me the quoted line of code would make the token embeddings be presented in the following way when
the targets is absent:

(r_0,s_0,a_1,r_1,s_1,a_2,...,r_k,s_k) , in that there is mis-alignment between the states and actions, since it starts from 1 position moved to the right. To me it should be written as

token_embeddings[:,2::3,:] = action_embeddings[:,-states.shape[1] : None if targets else -1,:]

Please see if I have misunderstood the code.

More Training Information on Reacher

Hi,

first and foremost thanks a lot for releasing the code and this great work in general!

While you report great success on many games, we were wondering if you could release more information about your results on the Reacher dataset, namely:

Maximum episode return available in the dataset for the Reacher medium-replay dataset (just like in the other plots in figure 4 of the paper)
Your loss with the Decision Transformer for any or all of the Reacher datasets, just like you posted in #2

Finally, we were wondering if you have any results on an "Expert"-only dataset for any of the games with the Decision Transformer. Per (Rashidinejad et al., 2021)[1]: "imitation learning [...] is suitable for expert datasets and vanilla offline RL [...] often requires uniform coverage datasets" - we think would be interesting to see if this notion applies to the Decision Transformer as well.

Note that we are mainly asking this because we are currently in the progress of trying to apply the Decision Transformer for a simplified version of Bomberman, which notably is a multi-agent environment. We were also encouraged by your comment on issue #14 that this shouldn't be an issue (in theory). However we are converging to a loss of about 0.75, which is obviously not great and indeed doesn't match the performance of the agents' ("expert" because they are strong rule-based agents) policy used to create our own trajectory ("expert"-only) dataset. In fact, the model is about 1/10th as good when playing against three agents that operate on the same policy used to collect the data and is far off from the maximum episode return available in our dataset.

[1]: Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. 2021.
Bridging offline reinforcement learning and imitation learning: A tale of pessimism.
CoRR, abs/2103.12021

After loading 50 trajectories, the terminal shows `killed`

Hello,
Thanks for you code. And after reading readme-atari.md, I have set the env and downloaded the dataset, then I tried to run the follows:
（--block_size 90 There is no block_size args, so I removed it ）

python -m atari.run_dt_atari.py --seed 123 --epochs 5 --model_type 'reward_conditioned' --num_steps 500000 --num_buffers 50 --game 'Breakout' --batch_size 64 --data_dir_prefix /home/shy/decision-transformer/atari/dqn_replay/

Then it shows:

loading from buffer 45 which has 0 already loaded
this buffer has 2196 loaded transitions and there are now 2196 transitions total divided into 1 trajectories
loading from buffer 2 which has 0 already loaded
this buffer has 1234 loaded transitions and there are now 3430 transitions total divided into 2 trajectories
loading from buffer 28 which has 0 already loaded
this buffer has 2413 loaded transitions and there are now 5843 transitions total divided into 3 trajectories
loading from buffer 34 which has 0 already loaded
this buffer has 2718 loaded transitions and there are now 8561 transitions total divided into 4 trajectories
loading from buffer 38 which has 0 already loaded
this buffer has 2326 loaded transitions and there are now 10887 transitions total divided into 5 trajectories
loading from buffer 17 which has 0 already loaded
this buffer has 2425 loaded transitions and there are now 13312 transitions total divided into 6 trajectories
loading from buffer 19 which has 0 already loaded
this buffer has 3063 loaded transitions and there are now 16375 transitions total divided into 7 trajectories
loading from buffer 42 which has 0 already loaded
this buffer has 1190 loaded transitions and there are now 17565 transitions total divided into 8 trajectories
loading from buffer 22 which has 0 already loaded
this buffer has 1002 loaded transitions and there are now 18567 transitions total divided into 9 trajectories
loading from buffer 33 which has 0 already loaded
this buffer has 1473 loaded transitions and there are now 20040 transitions total divided into 10 trajectories
loading from buffer 32 which has 0 already loaded
this buffer has 4009 loaded transitions and there are now 24049 transitions total divided into 11 trajectories
loading from buffer 49 which has 0 already loaded
this buffer has 2006 loaded transitions and there are now 26055 transitions total divided into 12 trajectories
loading from buffer 47 which has 0 already loaded
this buffer has 1935 loaded transitions and there are now 27990 transitions total divided into 13 trajectories
loading from buffer 9 which has 0 already loaded
this buffer has 1750 loaded transitions and there are now 29740 transitions total divided into 14 trajectories
loading from buffer 32 which has 4009 already loaded
this buffer has 7451 loaded transitions and there are now 33182 transitions total divided into 15 trajectories
loading from buffer 46 which has 0 already loaded
this buffer has 2137 loaded transitions and there are now 35319 transitions total divided into 16 trajectories
loading from buffer 32 which has 7451 already loaded
this buffer has 10311 loaded transitions and there are now 38179 transitions total divided into 17 trajectories
loading from buffer 47 which has 1935 already loaded
this buffer has 5165 loaded transitions and there are now 41409 transitions total divided into 18 trajectories
loading from buffer 25 which has 0 already loaded
this buffer has 2124 loaded transitions and there are now 43533 transitions total divided into 19 trajectories
loading from buffer 19 which has 3063 already loaded
this buffer has 5660 loaded transitions and there are now 46130 transitions total divided into 20 trajectories
loading from buffer 14 which has 0 already loaded
this buffer has 1462 loaded transitions and there are now 47592 transitions total divided into 21 trajectories
loading from buffer 36 which has 0 already loaded
this buffer has 1173 loaded transitions and there are now 48765 transitions total divided into 22 trajectories
loading from buffer 32 which has 10311 already loaded
this buffer has 13460 loaded transitions and there are now 51914 transitions total divided into 23 trajectories
loading from buffer 16 which has 0 already loaded
this buffer has 2148 loaded transitions and there are now 54062 transitions total divided into 24 trajectories
loading from buffer 4 which has 0 already loaded
this buffer has 1754 loaded transitions and there are now 55816 transitions total divided into 25 trajectories
loading from buffer 49 which has 2006 already loaded
this buffer has 4612 loaded transitions and there are now 58422 transitions total divided into 26 trajectories
loading from buffer 3 which has 0 already loaded
this buffer has 1200 loaded transitions and there are now 59622 transitions total divided into 27 trajectories
loading from buffer 2 which has 1234 already loaded
this buffer has 2192 loaded transitions and there are now 60580 transitions total divided into 28 trajectories
loading from buffer 20 which has 0 already loaded
this buffer has 1644 loaded transitions and there are now 62224 transitions total divided into 29 trajectories
loading from buffer 39 which has 0 already loaded
this buffer has 1473 loaded transitions and there are now 63697 transitions total divided into 30 trajectories
loading from buffer 2 which has 2192 already loaded
this buffer has 3244 loaded transitions and there are now 64749 transitions total divided into 31 trajectories
loading from buffer 20 which has 1644 already loaded
this buffer has 4785 loaded transitions and there are now 67890 transitions total divided into 32 trajectories
loading from buffer 47 which has 5165 already loaded
this buffer has 7681 loaded transitions and there are now 70406 transitions total divided into 33 trajectories
loading from buffer 48 which has 0 already loaded
this buffer has 2836 loaded transitions and there are now 73242 transitions total divided into 34 trajectories
loading from buffer 7 which has 0 already loaded
this buffer has 2135 loaded transitions and there are now 75377 transitions total divided into 35 trajectories
loading from buffer 41 which has 0 already loaded
this buffer has 933 loaded transitions and there are now 76310 transitions total divided into 36 trajectories
loading from buffer 35 which has 0 already loaded
this buffer has 1973 loaded transitions and there are now 78283 transitions total divided into 37 trajectories
loading from buffer 28 which has 2413 already loaded
this buffer has 4864 loaded transitions and there are now 80734 transitions total divided into 38 trajectories
loading from buffer 38 which has 2326 already loaded
this buffer has 5358 loaded transitions and there are now 83766 transitions total divided into 39 trajectories
loading from buffer 33 which has 1473 already loaded
this buffer has 3457 loaded transitions and there are now 85750 transitions total divided into 40 trajectories
loading from buffer 21 which has 0 already loaded
this buffer has 2198 loaded transitions and there are now 87948 transitions total divided into 41 trajectories
loading from buffer 30 which has 0 already loaded
this buffer has 2916 loaded transitions and there are now 90864 transitions total divided into 42 trajectories
loading from buffer 27 which has 0 already loaded
this buffer has 2128 loaded transitions and there are now 92992 transitions total divided into 43 trajectories
loading from buffer 34 which has 2718 already loaded
this buffer has 4650 loaded transitions and there are now 94924 transitions total divided into 44 trajectories
loading from buffer 33 which has 3457 already loaded
this buffer has 6102 loaded transitions and there are now 97569 transitions total divided into 45 trajectories
loading from buffer 12 which has 0 already loaded
this buffer has 3207 loaded transitions and there are now 100776 transitions total divided into 46 trajectories
loading from buffer 40 which has 0 already loaded
this buffer has 1369 loaded transitions and there are now 102145 transitions total divided into 47 trajectories
loading from buffer 3 which has 1200 already loaded
this buffer has 3316 loaded transitions and there are now 104261 transitions total divided into 48 trajectories
loading from buffer 42 which has 1190 already loaded
this buffer has 2969 loaded transitions and there are now 106040 transitions total divided into 49 trajectories
loading from buffer 5 which has 0 already loaded
this buffer has 2499 loaded transitions and there are now 108539 transitions total divided into 50 trajectories
loading from buffer 0 which has 0 already loaded
killed
(decision-transformer-atari) shy@user:~/decision-transformer$

Then it shows killed and I guess when loading the dataset，this problem is due to excessive memory usage and how to fix it？Thanks a lot.

Where are the weights for the trained models?

Hello 👋 ,
I wanted to try the Decision Transformers models but it seems that the weights for trained models are not included in the git repo.

Do you plan to publish it in this git repo or somewhere else?

Thanks 😄

Is there a code for the shortest path search?

Is there any kind of source code that reproduces the shortest path search experiment discussed in the paper?

Questions about dataset preprocessing

Hi,
I have some question about the data preprocessing of medium-replay datasets. In the provided implementation,
https://github.com/kzl/decision-transformer/blob/e2d82e68f330c00f763507b3b01d774740bee53f/gym/data/download_d4rl_datasets.py#L35...L40

whenever the final_timestep or done_bool is true, the collected data will be added as a trajectory. However in D4RL's docs,

Timeouts in this (medium-replay) dataset are not always marked when the agent reaches the max trajectory length, but rather when 1000 timesteps have been sampled for a particular training iteration.

Thus, there exist trajectories which are not done or timeout but rather truncated due to the limitation of sampling steps. Such trajectories are typically short in length, and if we compute return on these trajs, the return-to-go will be deviated from its true value since we don't give an estimated value for the last timestep. Will this be an issue for DT?

Please correct me if there is any mis-understanding =)

Any reason for using different models for gym and atari experiments?

I am wondering if there is any reason for using different code bases and model structures for these two experiments. I am trying to use DT for other environments but not sure which version I should use. Thanks.

MemoryError: Unable to allocate 6.57 GiB for an array with shape (7056000000,) and data type uint8

I wonder how to solve the memory? When you do your experiment, do you load all data into the memory as each unzipped observation file is more than 2GB?

Potential bug: Attention mask allows access to future tokens?

Thanks for sharing your code, great work. I do not get one detail here, maybe you can help:

During the batch data generation, you fill the masks with one's where ever valid history trajectory data is available:

decision-transformer/gym/experiment.py

Line 154 in f04280e

    
           mask.append(np.concatenate([np.zeros((1, max_len - tlen)), np.ones((1, tlen))], axis=1))

Isn't it common practice with auto regression to use a triangular matrix? In your training code, you consider all actions where the mask is >0. Doesn't this result in allowing actions early in the sequence to access subsequent tokens?

decision-transformer/gym/decision_transformer/training/seq_trainer.py

Line 18 in 5605e40

    
           action_preds = action_preds.reshape(-1, act_dim)[attention_mask.reshape(-1) > 0]

Thanks!

Problem Creating Deterministic Action Selection

I am trying to convert the DT to model TSP like problems, in which there is obviously a discrete action space instead of the continuous one presented. I am working the the gym code. Do you have any suggestions in how to convert this?

Just a little correction. A parameter changed in script.

Thanks for your exciting work!!!

The example usage of Atari.
It will work when use --context_length 30 instead of --block_size 90

python run_dt_atari.py --seed 123 --block_size 90 --epochs 5 --model_type 'reward_conditioned' --num_steps 500000 --num_buffers 50 --game 'Breakout' --batch_size 128 --data_dir_prefix [DIRECTORY_NAME]

python run_dt_atari.py --seed 123 --context_length 30 --epochs 5 --model_type 'reward_conditioned' --num_steps 500000 --num_buffers 50 --game 'Breakout' --batch_size 128 --data_dir_prefix [DIRECTORY_NAME]

Possible misalignment in calculating rtg in Atari

Dear authors,
Thanks for your code! I found a possible error in building rtg in Atari.
I think Line 86 should be curr_traj_returns = stepwise_returns[start_index:i] and Line 88 should be rtg_j = curr_traj_returns[j-start_index:i-start_index].
I'm not 100% sure about this.

Best,
Tao

state and action prediction

Hi, interesting paper and thanks for sharing the code, QQ, here

decision-transformer/gym/decision_transformer/models/decision_transformer.py

Line 96 in d28039e

state_preds = self.predict_state(x[:,2])

and here

decision-transformer/gym/decision_transformer/models/decision_transformer.py

Line 97 in d28039e

action_preds = self.predict_action(x[:,1])

, shouldn't they be state_preds = self.predict_state(x[:,1]) and action_preds = self.predict_action(x[:,2]) instead ?

No registered env with id: halfcheetah-medium-v2

python download_d4rl_datasets.py
->
Traceback (most recent call last):
File "download_d4rl_datasets.py", line 12, in
env = gym.make(name)
File "/public/home/chenxn1/anaconda3/envs/decision-transformer-gym/lib/python3.8/site-packages/gym/envs/registration.py", line 145, in make
return registry.make(id, **kwargs)
File "/public/home/chenxn1/anaconda3/envs/decision-transformer-gym/lib/python3.8/site-packages/gym/envs/registration.py", line 89, in make
spec = self.spec(path)
File "/public/home/chenxn1/anaconda3/envs/decision-transformer-gym/lib/python3.8/site-packages/gym/envs/registration.py", line 131, in spec
raise error.UnregisteredEnv('No registered env with id: {}'.format(id))
gym.error.UnregisteredEnv: No registered env with id: halfcheetah-medium-v2

Misalignment supervision when predicting succesor state

Hi, amazing work!
I find that there may be is a misalignment supervision when predicting state in https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/training/trainer.py#L69. I think it should be
loss = self.loss_fn(
reward_preds, action_preds, reward_preds,
state_target[:, 1:], action_target, reward_target[:, 1:],
)
as DT predcits next state given current state and action in https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/models/decision_transformer.py#L96.

Thanks.

	s[-1] = np.concatenate([np.zeros((1, max_len - tlen, state_dim)), s[-1]], axis=1)
	s[-1] = (s[-1] - state_mean) / state_std
	a[-1] = np.concatenate([np.ones((1, max_len - tlen, act_dim)) * -10., a[-1]], axis=1)
	r[-1] = np.concatenate([np.zeros((1, max_len - tlen, 1)), r[-1]], axis=1)
	d[-1] = np.concatenate([np.ones((1, max_len - tlen)) * 2, d[-1]], axis=1)
	rtg[-1] = np.concatenate([np.zeros((1, max_len - tlen, 1)), rtg[-1]], axis=1) / scale
	timesteps[-1] = np.concatenate([np.zeros((1, max_len - tlen)), timesteps[-1]], axis=1)
	mask.append(np.concatenate([np.zeros((1, max_len - tlen)), np.ones((1, tlen))], axis=1))

kzl / decision-transformer Goto Github PK

decision-transformer's People

Contributors

Stargazers

Watchers

Forkers

decision-transformer's Issues

Recommend Projects

Recommend Topics

Recommend Org