Thanks for your contribution. I found some interesting results. Experiment setting

Say we have 219 images and a batchsize of 64. This results 3 batc

Inconsistent experiental results about pytorch-cutpaste HOT 23 CLOSED

runinho commented on August 18, 2024

Inconsistent experiental results

from pytorch-cutpaste.

Comments (23)

Runinho commented on August 18, 2024

3% and 7% AUC is very very bad. A random guess is 50% AUC indicating something is not working.

How are you seeding? I use the random.uniform function from the random package as well as some torch functions for sampling random locations.
You might also want to disable some CUDA features that make the training non deterministic. See the PyTorch docs here

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

3% and 7% AUC is very very bad. A random guess is 50% AUC indicating something is not working.

How are you seeding? I use the random.uniform function from the random package as well as some torch functions for sampling random locations.
You might also want to disable some CUDA features that make the training non deterministic. See the PyTorch docs here

Sorry, i want express is the ROCAUC(Floating up and down between 3% and 7%). But now i make a lot of experiments about the Resnet18 + MLP head(train from scratch), the result is bad.
For example, Capsule in mvtec dataset, training epoches: 256, num of training samples: 219, batchsize:64, one epoch needs 4 steps. In paper, "Note that, unlike conventional definition for an epoch, we define 256 parameter update steps as one epoch." So 65536 steps. the other parameters i set is lined with paper(including learning rate, weight_decay, moment).
The training loss curve:

The acc curve:

The lr curve:

The epoch curve:

I'm not sure 65536 steps is too many. but according the loss curve, it' kind of weird.
Finally the ROCAUC is 81.05 , (paper: 87.9+-0.7), I try evaluating pretrained efficientnetb4 and b5 without training(paper, table 3)

seed seeting:
def setup_seed(seed):
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed) # cpu
torch.cuda.manual_seed_all(seed) # gpu
torch.backends.cudnn.deterministic = True # consistent results on the cpu and gpu
torch.backends.cudnn.benchmark = False

from pytorch-cutpaste.

Runinho commented on August 18, 2024

I just pushed a update where i set the drop_last=True in the dataloader. I find that this makes the training more stable.

I'm very confused with the number of stee tps the Le et al.tako train the model. I find that around 3000 Updates are enough to converge the Model.

The results look decent. How do you generate the nice table? I have some hacky code in the eval.py that outputs a csv into the eval folder. You might want to use that ;)

from pytorch-cutpaste.

Runinho commented on August 18, 2024

Towards the batchsize: I think a batchsize of 32 in my implementation is equivalent to their batchsize 64.
Because I think they calculate it as the number of images they feed into the model. In my Implementation i take batch_size iamges from the dataset. But because of cutpaste I feed 2*batch_size into the model.

In the paper Li et al say "[...] batch size of 64 (or 96 for 3-way)".
Because the 3-way variant we feed 3 times the number of images into the model is suspected they load 32 images from the dataset. To make the parameter invariant to the type i decided to use this batchsize that is defined as the number of images that we load from the dataset as the definition of the batch size.

from pytorch-cutpaste.

Runinho commented on August 18, 2024

I also suspect that running the evaulation might influence the training. But sadly i don't know why. Just a pointer if you want to investigate that. Maybe check if we also get the loss spikes when we disable the evaluation during training: --test_epochs=-1

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

I also suspect that running the evaulation might influence the training. But sadly i don't know why. Just a pointer if you want to investigate that. Maybe check if we also get the loss spikes when we disable the evaluation during training: --test_epochs=-1

drop_last=True means that would discard some samples, will it influence the final result?(i will try). According to the losss curve, 65k steps seems too many for such small dataset(But paper did).
About the batchsize in paper, 64 for two classes, 96 for three classes. You are right, 32 images is only the original images, but the batchsize in paper doesn't say it cleaerly whether it means the original images or the images into the model.
Actually i turn off the evaluating during the training process. I have sent several emails for questions but no reply at all......

from pytorch-cutpaste.

Runinho commented on August 18, 2024

Say we have 219 images and a batchsize of 64. This results 3 batches with 64 images and one batch with 27 images. Drop_last would drop the last batch containing 27 images.
But because we shuffle the training set after each iteration over the dataset we still train on all the training samples. The discarded images are always other ones.
Yes, sadly this is not 100% clear.
Did you sent questions to Li et al? Or did I miss some question?

Another thing: Currently I'm not doing augmentations before there Cutpaste step. They claim to use color jitter and random transformation. I will try to implement that today. Maybe this will improve the performance.

Feel free to create a pull request for the seeding. I think it might be interesting for other users as well :)

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Yes, i sent emails to all authors but no reply. Another thing, i notice that (https://github.com/google-research/deep_representation_one_class), the GMM for evaluating is used as follow, you can check it.
from sklearn.mixture import GaussianMixture as GMM
feats_tr = tf.nn.l2_normalize(feats_tr, axis=1)
feats = tf.nn.l2_normalize(feats, axis=1)
km = GMM(n_components=1, init_params='kmeans', covariance_type='full')
km.fit(train_embed)
scores = -km.score_samples(embeds)
About the MLP head, i use three layers and the third layer no BN no Relu.

from pytorch-cutpaste.

Runinho commented on August 18, 2024

Yes, i sent emails to all authors but no reply. Another thing, i notice that (https://github.com/google-research/deep_representation_one_class), the GMM for evaluating is used as follow, you can check it.
from sklearn.mixture import GaussianMixture as GMM
feats_tr = tf.nn.l2_normalize(feats_tr, axis=1)
feats = tf.nn.l2_normalize(feats, axis=1)
km = GMM(n_components=1, init_params='kmeans', covariance_type='full')
km.fit(train_embed)
scores = -km.score_samples(embeds)

Nice catch. I didn't see that.
I implemented 2 versions for the Gaussian Density Estimation. One that is commeted out and is using some functions of sklearn. But I think it's not that similar to the code above.
The current implementation is based on the implementation of the paper that proposed the usage of the Mahalanobis distance: https://github.com/ORippler/gaussian-ad-mvtec
I found that both implementations performed similar.

About the MLP head, i use three layers and the third layer no BN no Relu.

It's really annoying that they do not specify the MLP head. I find that a shallow head performs better. Because the representations are then easier to separate for the Gaussian mixture.

Towards the spikes in the acc:

I think they are fine and just a result of the random nature of cutpaste. Unlike in a normal supervised setting the dataset is not finit. We generate new augmented data so sometimes the model might not be able to correctly classify all instances. With a batchsize of 64 images we feed 128 images to the model. If we have a ACC of 0.993 we classify roughly one image wrong (127/128) = 0.9921875

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Yeah, i agree. I also think the results during training which are saved in writer.add_scalar should be counted by epoch for more intuitional display. emmm, Do you know "They claim to use color jitter and random translation" random translation mean?

from pytorch-cutpaste.

Runinho commented on August 18, 2024

counted by epoch for more intuitional display

This is a very easy fix. I haven't done that because I wanted to have "higher" resolution information. Because I found that the model converges very fast. Around 1000 epochs without additional augmentations. Which are about 4 epochs in their epoch sense.

Do you know "They claim to use color jitter and random translation" random translation mean?

No I don't know. In math a translation is just a shift in one direction. But in my understanding CNNs are translation invariant. Meaning they do not care were the objects in the image are because of the weight sharing and average pooling in the last layer.

Yesterday I created a new branch called dev_transform where I added a random crop and color jitter to the implementation. The model Trains much slower and we get a little bit better results. (I'll update this post later with my results)

Another thing that is not 100% clear to me is if they also use augmentations when calculating the "good" representations they use to train the GMM.
Some Direct citations related to that:
In Section 2.1: "In practice, data augmentations, such as translation or color jitter,
are applied before feeding x into g or CP"
In Section A.1" We apply random translation and
color jitters for data augmentation to enhance invariance of representations"

from pytorch-cutpaste.

Runinho commented on August 18, 2024

I added my results into the Readme you might want to compare them with yours. Let me know if you see something interesting :)

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

I added my results into the Readme you might want to compare them with yours. Let me know if you see something interesting :)

from pytorch-cutpaste.

Runinho commented on August 18, 2024

Thanks for sharing. Maybe I'll try to train the model for that amount of epochs.
Did they mention something about the network architecture used in the projection head?
Is google-research/deep_representation_one_class the code base they are referring to?

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Yes, i think you can refer to that codebase. I found some differences from the backbone(seems more several zeropadding operation ) and MLP head(set tf.keras.layers.Dense(use_bias=false) in head[:-1]).

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Hi, do you think it is necessary to tune the parameters of TSNE? such as n_components, init, perplexity. I find that the the same class feature distribution is not very compact with the default parameters of TSNE.

from pytorch-cutpaste.

Runinho commented on August 18, 2024

I don't know a lot about t-SNE. But I think you have to choose parameters to get a nice visualization. I found this website very helpful to understand the parameters.

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Thanks for your help. It seems like an amazing website. Actually i'm thinking about the feature compactness which means the feature of training data should be close to the center(instead of mapping to one point(model collapse))，and then the anomaly data would be isolated easily. Some papers do corresponding works. "Panda：Adapting Pretrained Features for Anomaly Detection and Segmentation " "Mean-shifted Contrastive Loss for Anomaly Detection".

from pytorch-cutpaste.

Classmate-Huang commented on August 18, 2024

Thanks for your help. It seems like an amazing website. Actually i'm thinking about the feature compactness which means the feature of training data should be close to the center(instead of mapping to one point(model collapse))，and then the anomaly data would be isolated easily. Some papers do corresponding works. "Panda：Adapting Pretrained Features for Anomaly Detection and Segmentation " "Mean-shifted Contrastive Loss for Anomaly Detection".

hello, Did you finally reproduce CutPaste's performance on the paper?

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Thanks for your help. It seems like an amazing website. Actually i'm thinking about the feature compactness which means the feature of training data should be close to the center(instead of mapping to one point(model collapse))，and then the anomaly data would be isolated easily. Some papers do corresponding works. "Panda：Adapting Pretrained Features for Anomaly Detection and Segmentation " "Mean-shifted Contrastive Loss for Anomaly Detection".

hello, Did you finally reproduce CutPaste's performance on the paper?

Not yet, have you?

from pytorch-cutpaste.

Classmate-Huang commented on August 18, 2024

Thanks for your help. It seems like an amazing website. Actually i'm thinking about the feature compactness which means the feature of training data should be close to the center(instead of mapping to one point(model collapse))，and then the anomaly data would be isolated easily. Some papers do corresponding works. "Panda：Adapting Pretrained Features for Anomaly Detection and Segmentation " "Mean-shifted Contrastive Loss for Anomaly Detection".

hello, Did you finally reproduce CutPaste's performance on the paper?

Not yet, have you?

Are u from China? add a contract on WeChat or QQ?

from pytorch-cutpaste.

Youskrpig commented on August 18, 2024

Thanks for your help. It seems like an amazing website. Actually i'm thinking about the feature compactness which means the feature of training data should be close to the center(instead of mapping to one point(model collapse))，and then the anomaly data would be isolated easily. Some papers do corresponding works. "Panda：Adapting Pretrained Features for Anomaly Detection and Segmentation " "Mean-shifted Contrastive Loss for Anomaly Detection".

hello, Did you finally reproduce CutPaste's performance on the paper?

Not yet, have you?

Are u from China? add a contract on WeChat or QQ?

yeah, my email is [email protected].

from pytorch-cutpaste.

Classmate-Huang commented on August 18, 2024

Thanks for your help. It seems like an amazing website. Actually i'm thinking about the feature compactness which means the feature of training data should be close to the center(instead of mapping to one point(model collapse))，and then the anomaly data would be isolated easily. Some papers do corresponding works. "Panda：Adapting Pretrained Features for Anomaly Detection and Segmentation " "Mean-shifted Contrastive Loss for Anomaly Detection".

hello, Did you finally reproduce CutPaste's performance on the paper?

Not yet, have you?

Are u from China? add a contract on WeChat or QQ?

yeah, my email is [email protected].

I sent you an e-mailed, thx

from pytorch-cutpaste.

Inconsistent experiental results about pytorch-cutpaste HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent