Coder Social home page Coder Social logo

yuqinie98 / patchtst Goto Github PK

View Code? Open in Web Editor NEW
1.3K 1.3K 239.0 12.9 MB

An offical implementation of PatchTST: "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." (ICLR 2023) https://arxiv.org/abs/2211.14730

License: Apache License 2.0

Python 88.94% Shell 11.06%

patchtst's People

Contributors

g0bel1n avatar koseoyoung avatar namctin avatar xkszltl avatar yuqinie98 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

patchtst's Issues

Question regarding decoder inputs

Hey guys, I really enjoyed reading the paper, and thanks for pushing the source code. I am working on a multivariate problem where I am having 94 features and one output target. I have a question about the model inference/prediction.

In the predict method, we are passing the decoder input where firstly, it is being initiated by zeros of batch_y shape[0] which makes sense but then it is being concatenated with batch_y.

    # decoder input
    dec_inp = torch.zeros([batch_y.shape[0], self.args.pred_len, batch_y.shape[-1]]).float()
    dec_inp = torch.cat([batch_y[:,:self.args.label_len,:], dec_inp], dim=1).float().to(self.device)

My question is why are we concatenating batch_y values as, in real-time, we will not be having batch_y values?

代码中的位置编码

u = self.dropout(u + self.W_pos)
self.W_pos = positional_encoding(pe, learn_pe, q_len, d_model)其中的positional_encoding未进行定义。

More or Less features?

Hi Guys, I'm doing a school project and would appreciate some advice. I am doing multivariate forecasting for stocks. I want to predict stock "x" with the help of other stocks "y", "z" etc. Adding more features to the model can improve or hurt the model depending on the quality of the new data. Is there a way to determine which combination of features will deliver the best prediction ability?
Maybe a way to penalize features that worsen the model's accuracy and reward features that improve the accuracy?

Your help is appreciated!

Does the part of the performance gain come from residual attention?

I noticed that this implementation of Transformer here used residual attention, which does not appear in some of the other baselines mentioned in the paper. So I wonder if you have performed additional ablation studies to see the effect of residual attention for forecasting?

Location of datasets

Congratulations on the project. Great work. I'm trying to do some test runs of the code. I've downloaded the datasets, but not sure where to place them. I keep getting the error: FileNotFoundError: [Errno 2] No such file or directory: '/data/datasets/public/ETDataset/ETT-small/ETTh1.csv'. Thanks in advance.

LogTrans Implementation

Thank you the great work and the well-sorted codebase!
Regarding LogTrans, it seems that the authors didn't release their official code. My I ask what is the implementation you used and are you planning to release it?

Retraining of the model on new dataset

How to retrain the model on new dataset? Like if I have trained the model on one stock data(Apple)and I want to do incremental learning on new stock data(microsoft)..how to do it?

About the code of self-supervised

Thanks for your contribution.
After I run the commad:
python patchtst_pretrain.py --dset ettm1 --mask_ratio 0.4
there has create two files in:PatchTST_self_supervised/saved_models/ettm1/patchtst_pretrained_cw512_patch12_stride12_epochs-pretrain10_mask0.4_model1.path(and loss.scv)
But after this when I run the commad:
python patchtst_finetune.py --dset ettm1 --pretrained_model <model_name>
there has error like this:
FileNotFoundError: [Errno 2] No such file or directory: '/saved_models/ettm1/masked_patchtst/based_model/patchtst_pretrained_cw512_patch12_stride12_epochs-pretrain10_mask0.4_model1.pth'
I want to ask what the <model_name> should be?How can I run the patchtst_finetune.py?
Thank you very much!

Input normalization twice - scaler and revin

While loading the data there is zcore normalization z = (x - mean) / std of the input data:

self.scaler = StandardScaler()

self.scaler.fit(train_data.values)

Before the forward path there is a zscore normalization to input as part of revin layer:
def _normalize(self, x):
x = x - self.mean
x = x / self.stdev

Questions:

  1. Can you help me understand the difference between the two normalizations and why both are required by default?
  2. Why in finetune RevInCB gets denorm=True while in pretrain RevInCB gets denorm=False?

run patchtst_finetune.py error

args: Namespace(is_finetune=0, is_linear_probe=0, dset_finetune='exchange', context_points=512, target_points=96, batch_size=64, num_workers=0, scaler='standard', features='M', patch_len=12, stride=12, revin=1, n_layers=3, n_heads=16, d_model=128, d_ff=256, dropout=0.2, head_dropout=0.2, n_epochs_finetune=20, lr=0.0001, pretrained_model='./p_model.pth', finetuned_model_id=1, model_type='based_model')
weight_path= saved_models/exchange/masked_patchtst/based_model/exchange_patchtst_finetuned_cw512_tw96_patch12_stride12_epochs-finetune20_model1
number of patches: 42
number of model params 920672
Traceback (most recent call last):

File "D:\pyyj\PatchTST-main\PatchTST_self_supervised\patchtst_finetune.py", line 235, in
out = test_func(weight_path)

File "D:\pyyj\PatchTST-main\PatchTST_self_supervised\patchtst_finetune.py", line 200, in test_func
out = learn.test(dls.test, weight_path=weight_path+'.pth', scores=[mse,mae]) # out: a list of [pred, targ, score]

File "D:\pyyj\PatchTST-main\PatchTST_self_supervised\src\learner.py", line 258, in test
if weight_path is not None: self.load(weight_path)

File "D:\pyyj\PatchTST-main\PatchTST_self_supervised\src\learner.py", line 387, in load
load_model(fname, self.model, self.opt, with_opt, device=device, strict=strict)

File "D:\pyyj\PatchTST-main\PatchTST_self_supervised\src\learner.py", line 429, in load_model
state = torch.load(path, map_location=device)

File "C:\anaconda3\lib\site-packages\torch\serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:

File "C:\anaconda3\lib\site-packages\torch\serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)

File "C:\anaconda3\lib\site-packages\torch\serialization.py", line 251, in init
super(_open_file, self).init(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 'saved_models/exchange/masked_patchtst/based_model/exchange_patchtst_finetuned_cw512_tw96_patch12_stride12_epochs-finetune20_model1.pth'

Linear Head

您好,请问一下论文中Finally a flatten layer with linear head is used to obtain the prediction result,linear head的含义是什么?谢谢

problem about hyperparameter independual

Dear author,
I find that in your code you set the hyperparameter individual as 0 defeaultly,but in the paper it is stated that the channel independence has a boost on the effect. Meanwhile, I try to set the individual to 1, but I find that the effect became worse instead,What is the cause of this?

ValueError: __len__() should return >= 0

File "/public/yxz/TimeSeriesForecast/V2+TS-library/PatchTST/PatchTST_supervised/exp/exp_main.py", line 102, in train
vali_data, vali_loader = self._get_data(flag='val')
File "/public/yxz/TimeSeriesForecast/V2+TS-library/PatchTST/PatchTST_supervised/exp/exp_main.py", line 43, in _get_data
data_set, data_loader = data_provider(self.args, flag)
File "/public/yxz/TimeSeriesForecast/V2+TS-library/PatchTST/PatchTST_supervised/data_provider/data_factory.py", line 44, in data_provider
print(flag, len(data_set))
ValueError: len() should return >= 0

为什么测试集是空的,???哪里出bug了?

Memory required to pretrain on Electricity and Traffic

Hi could you please share the memory required to pretrain on Electricity and Traffic datasets? Seems it cannot fit into a single 32GB V100 for either of the datasets due to the number of variates they have. Did you apply distributed training or were you able to pretrain on a single A40? Thanks!

Question about Pre-Norm

Hello! This is a very interesting project, thank you for writing the paper and open sourcing the code. Quick question, did you try pre-norm transformer blocks at all, and if so what were the results vs. the post-norm described in the paper?

Mulit-GPU

Is there an option to run the training on multiple GPU (single node)?
I would like to make the training faster by (effectively larger batch size)?

寻找学习率时模型读取报错

您好,我当前有个任务,是在不同的数据集下跑SSP的pretrain和finetune寻找超参数。
当在服务器上多卡跑多个脚本,在不同的数据集寻找超参的时候,出现了如下报错:
1685241718117
初步判断是由于多个程序同时寻找学习率时保存的中间模型文件./temp/current.pth 名字相同,导致读取时,读到的模型和自己pretrain保存的不一致,将current增加当前显卡的标识项(一张显卡只运行一个脚本),已区分不同程序运行保存的中间文件时,依旧会出现上面提到的报错。请问这该如何解决?下图是做的修改及改后运行结果:
1685242068966(1)
1685242109322(1)

a question about head type of self-supervised PatchTST

Hi, I got a question about the head of self-supervised PatchTST.
In my opinion, self-supervised PatchTST use a D × P linear layer(self.create_pretrain_head or PretrainHead) to do the pretrain process,then remove this head and attach a PredictionHead(Flatten_Head) to do end-to-end finetune or linear probing, am I right?
By the way, there are two versions of PatchTST in PatchTST_self_supervised and PatchTST_supervised, which one is the lastest version?

spatial sequence prediction

In my dataset,each sample is in a piece of 1D space,We've devided this 1D space evenly into n segments.In these n segments,4 known features are given,and i want to predict the other 2 target features in the original 1D space.Will this time sequence prediction model help
spatial sequence prediction ?

RevInCB and PatchMaskCB

In the current implementation the forward path first applies normalization and then applies masking.

cbs = [RevInCB(dls.vars)] if args.revin else []
cbs += [PatchCB(patch_len=args.patch_len, stride=args.stride)]

Therefore the RevInCB mean and std are calculated on the non-masked inputs.
I think the RevInCB normalization can reveal the masked patches and assist the algorithm to recover pattern that are hidden if they are significantly different than the non-masked regions.
Is it the intended behavior?

About the application on the video dataset

Hi, thank you so much for such a great job.
I noticed that the dataset you used is in .CSV format. I would like to know whether the self-supervised task of this model is effective for the reconstruction or prediction of the video dataset?

Channel-Independence

Hello,

Is there a way to allow for channel-mixing under the current implementation?

Thank you.

After PatchTST encoder, why do permute in last two dims?

Hello, you reshape the u (bs*nvars, patch_num, d_model) before encoder,

u = torch.reshape(x, (x.shape[0]*x.shape[1],x.shape[2],x.shape[3])) # u: [bs * nvars x patch_num x d_model]

why do permute to transform z (bs*nvars, d_model, patch_num)?

z = self.encoder(u) # z: [bs * nvars x patch_num x d_model]
z = torch.reshape(z, (-1,n_vars,z.shape[-2],z.shape[-1])) # z: [bs x nvars x patch_num x d_model]
z = z.permute(0,1,3,2) # z: [bs x nvars x d_model x patch_num]

In next step, z (bs*nvars, d_model, patch_num) is fed into head module, then z pass a flatten layer. Can I flatten z in the way of z(-1, patch_num, d_model) instead of (-1, d_model, patch_num) ?

z = self.backbone(z) # z: [bs x nvars x d_model x patch_num]
z = self.head(z) # z: [bs x nvars x target_window]

elif head_type == 'flatten':
self.head = Flatten_Head(self.individual, self.n_vars, self.head_nf, target_window, head_dropout=head_dropout)

x = self.flatten(x)
x = self.linear(x)
x = self.dropout(x)

Why use drop_last=True in test (and val) dataloader?

Hi,
First, thanks for the excellent paper and for sharing this repo. Great work!

I want to ask why do you set the test dataloader drop_last=True? By doing that performance is not reported for all samples in the test dataset (some samples will be dropped, which is not what we want). In addition to this, changes in the batch size would lead to reporting performance on a different number of samples.
I've tested the difference by setting drop_last=False with the ILI dataset and the result is worse than the published one, although it still is the best-published result I've seen so far.

I saw the same issue in the Autoformer repo and logged an issue (thuml/Autoformer#104). As a result they've now updated the code.

BTW, this is likely to also occur with other papers that seem to use a similar code base to Autoformer's.

关于结果的范围

尊敬的作者您好!非常感谢您所提出的杰出的模型。在将PatchTST作用于自己的数据集时,发现模型输出的数值范围与数据集中的数据值域范围有很大出入,经过阅读代码发现在Dataset中有scale默认为True。请问:如何将经过scale的数据的输出缩放为原本数据的值域范围呢?期待您的回答。

关于结果

尊敬的作者您好,感谢您伟大的作品,我在将PatchTST应用于我自己的数据集的时候(我的数据集与Weather数据集类似,时间跨度为2015-2021)得到如下结果,请问您有什么调参方面的指导吗?
image

About Attention map visualization

Hello, may I ask how the attention map in Figure 6 of the paper is visualized?

  • Do you use the attention map from the training stage or the test stage for visualization?In this case, each epoch corresponding to the training or testing phase will generate an attention map, but the attention map returned during the training or testing phase does not seem to have been saved.
  • Or is it visualized using the attention map in the model saved in the trained checkpoint.path file?

将PatchTST的多头注意力修改为ProbAttention

我注意到作者论文中使用的是transformer的encoder和decoder
如果我想要将PatchTST和informer结合,将多头注意力改为ProbAttention
请问我应该修改代码中的哪一部分呢?

Data Loader use_time_features flag

Thank you for the excellent code base that is very flexible and clear.
Comparing the data loader between supervised to self-supervised it seem:

  1. Self supervised has use_time_features=False by default https://github.com/yuqinie98/PatchTST/blob/main/PatchTST_self_supervised/src/data/pred_dataset.py#L97
  2. Supervised does has use_time_features effectively always True https://github.com/yuqinie98/PatchTST/blob/main/PatchTST_supervised/data_provider/data_loader.py#L390

Questions:

  1. How important are the time feature for model performance?
  2. Why is the behaviour different between the two self-supervised and supervised ?

exchange rate dataset

Thanks for your work! I noticed that there is no experimental result of the exchange rate dataset in the paper. Did you do experiments on the exchange_rate dataset? If you have done experiments, can you provide the parameter settings for the optimal results of the exchange rate data set?

About the fairness of comparisons with other transformer variants.

First of all, thank you for your contribution to this work. I have a question about the fairness of comparing PatchTST42 with other transformer variants, which seem to have a lookback window of 96 while PatchTST42 has a lookback window of 336 as mentioned in the article. Can you explain why this comparison is fair?

Torch.onnx.export error

from tqdm import tqdm
import torch
import pandas as pd
import numpy as np
import math
from torch import nn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

data = np.genfromtxt('c:/data.csv', delimiter=",")
data=scaler.fit_transform(data.reshape(-1,1)).flatten()
# In[ ]:
iw = 96
ow = 15
train=data

from torch.utils.data import DataLoader, Dataset

class windowDataset(Dataset):
    def __init__(self, y, input_window=80, output_window=20, stride=3):
        L = y.shape[0]
        num_samples = (L - input_window - output_window) // stride + 1
        X = np.zeros([input_window, num_samples])
        Y = np.zeros([output_window, num_samples])
        print(X.shape,y.shape)
        for i in np.arange(num_samples):
            start_x = stride*i
            end_x = start_x + input_window
            X[:,i] = y[start_x:end_x]

            start_y = stride*i + input_window
            end_y = start_y + output_window
            Y[:,i] = y[start_y:end_y]

        # size: [num_samples, input_window, 1]
        X = X.reshape(X.shape[1], X.shape[0], 1) 
        Y = Y.reshape(Y.shape[1], Y.shape[0], 1)
        self.x = X
        self.y = Y
        self.len = len(X)

    def __getitem__(self, i):
        return self.x[i], self.y[i]
    def __len__(self):
        return self.len


# In[ ]:
train_dataset = windowDataset(train, input_window=iw, output_window=ow, stride=2)
train_loader = DataLoader(train_dataset, batch_size=512)
# In[ ]:
from PatchTST import PatchTST
class PT_config():
    def __init__(self):
        self.seq_len=iw
        self.pred_len=ow
        self.individual=0
        self.enc_in=1
        self.e_layers=3
        self.n_heads= 16 
        self.d_model= 128 
        self.d_ff= 256 
        self.dropout =0.2
        self.fc_dropout= 0.2
        self.head_dropout= 0
        self.patch_len =16
        self.stride =8
        self.padding_patch='end'
        self.revin=1
        self.affine=0
        self.subtract_last=0
        self.decomposition=1
        self.kernel_size=25

# In[ ]:
# # 3. Train
#device = torch.device("cuda")
device = torch.device("cpu")
lr = 1e-4

class modelParam():
    def __init__(self,label):
        self.model=PatchTST(configs=PT_config()).to(device)
        self.optimizer=torch.optim.Adam(self.model.parameters(), lr=lr)

    def epoch(self,epoch):
        self.epoch=epoch
        return self.epoch

# In[ ]:
criterion = nn.MSELoss() #0.58
# In[ ]:
PT=modelParam('PT')
PTmodel=PT.model.to(device)
optimizer=PT.optimizer
epoch=PT.epoch(2) #训练次数
progress = tqdm(range(epoch))
PTmodel.train()
losses=[]
for i in progress:
    batchloss = 0.0
    for (inputs, outputs) in train_loader:
        optimizer.zero_grad()
        #'''
        result = PTmodel(inputs.float().to(device))
        loss = criterion(result, outputs.float().to(device))
        loss.backward()
        optimizer.step()
        batchloss += loss
    losses.append(batchloss.cpu().item())
    progress.set_description("PT- loss: {:0.6f}".format(batchloss.cpu().item() / len(train_loader)))

torch.save(PTmodel.to('cpu'), 'PTmodel.pth')

# In[ ]:
input_shape = (1, 96, 1)  
input_names = ["input"]  
output_names = ["output"]  


x = torch.randn(input_shape)

 
onnx_model_path = "PTmodel.onnx"  
dynamic_axes = {'input': {0: 'batch_size', 1: 'sequence_length', 2: 'input_dim'}, # 输入张量的动态维度axis
                'output': {0: 'batch_size', 1: 'sequence_length', 2: 'output_dim'}} # 输出张量的动态维度axis
torch.onnx.export(PTmodel, x, onnx_model_path, input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes)


data.csv

How to pachify non-uniform data?

Thank you for the great work. I have a small question. Many time series in the real world are non-uniform, but patching seems to require uniform sampling. How to deal with it?

Self Supervised vs Supervised

The paper is very dense and super informative, but I want to make sure I understand it and use the code correctly.
If I have the following multivariate forecasting task: multivariate predict univariate

  1. Multiple past metrics (~100+ metrics over ~96 past time stamps) including the past of the target as input feature
  2. Single output forecasting target (1 metric over future ~24 time stamps)

If I read correctly table Table 4 self-supervised embedding should be probably better than supervised embedding.
Question:

  1. Is patchtst_pretrain with features='MS' the correct starting point to train your model to my dataset?
  2. Any other code parameter you would suggest me to consider setting for a my first experiments?

Long run time?

I want to congratulate you for the great patch transformer paper.

I want to ask a question:
I have a dataset which i hold as a pandas dataframe.

Given some window size I want to predict the next time step. Be This means I want to predict only a single step into the future:

Given X1,...,Xt
Predict Xt+1

This means I want to predict only a single step into the future.

As I understand if I want to use your model for the task I will need to have as many forward iterations as the dataset size since you are not using a casual mask in the transformer.

How can this be resolved?

Thanks

multivariate?

hello! In the paper, you state you have a multivariate method, however as far as I understand each variate (or channel) is processed independently and the emission is also a point forecasting emission which is independent.

Can you kindly clarify what part is multivariate? As far as I understand the only multivariate aspect is the input data being a vector of size M at each time point, however, I see this as a negative since after making your patches you end up with M * number of patches vectors, and thus the compute and memory via the vanilla transformer encoder is quadratic in M? If you had univariate inputs then at least you do not have the issue of O(M^2)....

Thank you for any insight!

Runtime Error

Can anyone assist? I trided running:
sh ./scripts/PatchTST/ettm1.sh
and get the following error:

 RuntimeError: 
          An attempt has been made to start a new process before the
          current process has finished its bootstrapping phase.
  
          This probably means that you are not using fork to start your
          child processes and you have forgotten to use the proper idiom
          in the main module:
  
              if __name__ == '__main__':
                  freeze_support()
                  ...
  
          The "freeze_support()" line can be omitted if the program
          is not going to be frozen to produce an executable.
  [7]  + 18410 suspended  sh ./scripts/PatchTST/ettm1.sh

Im using an M1 Mac

Adding Entity ID

If I understand correctly 'ettm1', 'ettm2' are the same multivariate task from two different stations.

  1. Have you considered training those two station jointly?
    They probably share similar patterns so training them on the same past dates will allow to have a better forecasting.
  2. It would required to add a binary variable (eg station_id) for the pre-training so the embedding will be station-aware.I am not sure to add it to the data loader that expect only date x metric input (float). Do you have an idea where I should introduce such a change in the code?

Thank you

Transformer Encoder

Hi, it seems that only the encoder part of the transformer is used in the model. However, both Autoformer and FEDformer use the structure of encoder + decoder. Is it better to use the encoder than the full structure (encoder + decoder) on the time series forecasting task? Could you provide some literature or experimental support?

Comparisons might not be consistent across batch sizes

The prediction and actual sizes here are not the same across batch sizes. This means that the metrics calculated are not exactly comparable across different models trained with different batch sizes.

Therefore if my understanding is correct all models might need to be reevaluated.

I think the culprit is here: https://github.com/cure-lab/LTSF-Linear/blob/main/data_provider/data_factory.py#L20

The drop_last should be False during the test evaluations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.