Coder Social home page Coder Social logo

deepdpm's Introduction

DeepDPM: Deep Clustering With An Unknown Number of Clusters

This repo contains the official implementation of our CVPR 2022 paper:

DeepDPM: Deep Clustering With An Unknown Number of Clusters

Meitar Ronen, Shahaf Finder and Oren Freifeld.

DeepDPM clustering example on 2D data.
On the left: DeepDPM's predicted clusters' assignments, centers and covariances. On the right: Clusters colored by the GT labels, and the net's decision boundary.

Examples of the clusters found by DeepDPM on the ImageNet Dataset:

Examples of the clusters found by DeepDPM on the ImageNet dataset

Table of Contents
  1. Introduction
  2. Installation
  3. Training
  4. Inference
  5. Citation

Introduction

DeepDPM is a nonparametric deep-clustering method which unlike most deep clustering methods, does not require knowing the number of clusters, K; rather, it infers it as a part of the overall learning. Using a split/merge framework to change the clusters number adaptively and a novel loss, our proposed method outperforms existing (both classical and deep) nonparametric methods.

While the few existing deep nonparametric methods lack scalability, we show ours by being the first such method that reports its performance on ImageNet.

Installation

The code runs with Pytorch version 3.9. Assuming Anaconda, the virtual environment can be installed using:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
conda install -c conda-forge pytorch-lightning=1.2.10
conda install -c conda-forge umap-learn
conda install -c conda-forge neptune-client
pip install kmeans-pytorch
conda install psutil numpy pandas matplotlib scikit-learn scipy seaborn tqdm joblib

See the requirements.txt file for an overview of the packages in the environment we used to produce our results.

Training

Setup

Datasets and embeddings

When training on raw data (e.g., on MNIST, Reuters10k) the data for MNIST will be automatically downloaded to the "data" directory. For reuters10k, the user needs to download the dataset independently (available online) into the "data" directory.

Logging

To run the following with logging enabled, edit DeepDPM.py and DeepDPM_alternations.py and insert your neptune token and project path. Alternatively, run the following script with the --offline flag to skip logging. Evaluation metrics will be printed at the end of the training in both cases.

Training models

We provide two models which can be used for clustering: DeepDPM which clusters embedded data and DeepDPM_alternations which alternates between feature learning using an AE and clustering using DeepDPM.

  1. Key hyperparameters:
  • --gpus specifies the number of GPUs to use. E.g., use "--gpus 0" to use one gpu.
  • --offline runs the model without logging
  • --use_labels_for_eval: run the model with ground truth labels for evaluation (labels are not used in the training process). Do not use this flag if you do not have labels.
  • --dir specifies the directory where the train_data and test_data tensors are expected to be saved
  • --init_k the initial guess for K.
  • --start_computing_params specifies when to start computing the clusters' parameters (the M-step) after initialization. When changing this it is important to see that the network had enough time to learn the initializatiion
  • --split_merge_every_n_epochs specifies the frequency of splits and merges
  • --hidden_dims specifies the AE's hidden dimension layers and depth for DeepDPM_alternations
  • --latent_dim specifies the AE's learned embeddings dimension (the dimension of the features that would be clustered)

Please also note the NIIW hyperparameters and the guidelines on how to choose them as described in the supplementary material.

  1. Training examples:
  • To generate a similar gif to the one presented above, run: python DeepDPM.py --dataset synthetic --log_emb every_n_epochs --log_emb_every 1

  • To run DeepDPM on pretrained embeddings (including custom ones):

    python DeepDPM.py --dataset <dataset_name> --dir <embeddings path>
    
    • for example, for MNIST run:

      python DeepDPM.py --dataset MNIST --dir "./pretrained_embeddings/umap_embedded_datasets/MNIST"
      
    • For the imbalanced case use the data dir accordingly, e.g. for MNIST:

      python DeepDPM.py --dataset MNIST --dir "./pretrained_embeddings/umap_embedded_datasets/MNIST_IMBALANCED"
      
    • To run on STL10:

    python DeepDPM.py --dataset stl10 --init_k 3 --dir pretrained_embeddings/MOCO/STL10 --NIW_prior_nu 514 --prior_sigma_scale 0.05
    

    (note that for STL10 there is no imbalanced version)

  • DeepDPM with feature extraction pipeline (jointly learning clustering and features):

    • For MNIST run:
    python DeepDPM_alternations.py --latent_dim 10 --dataset mnist --lambda_ 0.005 --lr 0.002 --init_k 3 --train_cluster_net 200 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir <path_to_dataset_location> --pretrain_path ./saved_models/ae_weights/mnist_e2e.zip --number_of_ae_alternations 3 --transform_input_data None --log_metrics_at_train True
    
    • For Reuters10k run:
    python DeepDPM_alternations.py --dataset reuters10k --dir <path_to_dataset_location> --hidden-dims 500 500 2000 --latent_dim 75 --pretrain_path ./saved_models/ae_weights/reuters10k_e2e.zip --NIW_prior_nu 80 --init_k 1 --lambda_ 0.1 --beta 0.5 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --number_of_ae_alternations 3 --log_metrics_at_train True --dir ./data/
    
    • For ImageNet-50:
    python DeepDPM_alternations.py --latent_dim 10 --lambda_ 0.05 --beta 0.01 --dataset imagenet_50 --init_k 10 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir ./pretrained_embeddings/MOCO/IMAGENET_50/ --NIW_prior_nu 12 --pretrain_path ./saved_models/ae_weights/imagenet_50_e2e.zip --prior_sigma_scale 0.0001 --prior_sigma_choice data_std --number_of_ae_alternations 2
    
    • For ImageNet-50 imbalanced:
    python DeepDPM_alternations.py --latent_dim 10 --lambda_ 0.05 --beta 0.01 --dataset imagenet_50_imb --init_k 10  --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir ./pretrained_embeddings/MOCO/IMAGENET_50_IMB/ --NIW_prior_nu 12 --pretrain_path ./saved_models/ae_weights/imagenet_50_imb.zip --prior_sigma_choice data_std --prior_sigma_scale 0.0001 --number_of_ae_alternations 4
    
  1. Training on custom datasets: DeepDPM is desinged to cluster data in the feature space. For dimensionality reduction, we suggest using UMAP, an Autoencoder, or off-the-shelf unsupervised feature extractors like MoCO, SimCLR, swav, etc. If the input data is relatively low dimensional (e.g. <= 128D), it is possible to train on the raw data.

To load custom data, create a directory that contains two files: train_data.pt and test_data.pt, a tensor for the train and test data respectively. DeepDPM would automatically load them. If you have labels you wish to load for evaluation, please use the --use_labels_for_eval flag.

Note that the saved models in this repo are per dataset, and in most of the cases specific to it. Thus, it is not recommended to use for custom data.

Inference

For loading a pretrained model from a saved checkpoint, and for an inference example, see: scripts\DeepDPM_load_from_checkpoint.py

Citation

For any questions: [email protected]

Contributions, feature requests, suggestion etc. are welcomed.

If you use this code for your work, please cite the following:

@inproceedings{Ronen:CVPR:2022:DeepDPM,
  title={DeepDPM: Deep Clustering With An Unknown Number of Clusters},
  author={Ronen, Meitar and Finder, Shahaf E. and  Freifeld, Oren},
  booktitle={Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

deepdpm's People

Contributors

freifeld avatar meitarronen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepdpm's Issues

ValueError: Expected parameter covariance_matrix

I see where this issue was brought up before, but it is not resolved.

Here is the error:

Traceback (most recent call last):
File "/home/eich467/AHA-ToDo/DeepDPM/DeepDPM_alternations.py", line 236, in
train_clusternet_with_alternations()
File "/home/eich467/AHA-ToDo/DeepDPM/DeepDPM_alternations.py", line 205, in train_clusternet_with_alternations
trainer.fit(model, train_loader, val_loader)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 633, in run_train
self.train_loop.on_train_epoch_start(epoch)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 203, in on_train_epoch_start
self.trainer.call_hook("on_train_epoch_start")
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in call_hook
output = hook_fx(*args, **kwargs)
File "/home/eich467/AHA-ToDo/DeepDPM/src/AE_ClusterPipeline.py", line 251, in on_train_epoch_start
self._init_clusters()
File "/home/eich467/AHA-ToDo/DeepDPM/src/AE_ClusterPipeline.py", line 121, in _init_clusters
self.clustering.init_cluster(self.train_dataloader(), self.val_dataloader(), logger=self.logger, centers=centers, init_num=self.init_clusternet_num)
File "/home/eich467/AHA-ToDo/DeepDPM/src/clustering_models/clusternet.py", line 57, in init_cluster
self.fit_cluster(train_loader, val_loader, logger, centers)
File "/home/eich467/AHA-ToDo/DeepDPM/src/clustering_models/clusternet.py", line 65, in fit_cluster
self.model.fit(train_loader, val_loader, logger, self.args.train_cluster_net, centers=centers)
File "/home/eich467/AHA-ToDo/DeepDPM/src/clustering_models/clusternet_modules/clusternet_trainer.py", line 42, in fit
cluster_trainer.fit(self.cluster_model, train_loader, val_loader)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.train_loop.run_training_epoch()
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
self.trainer.run_evaluation(on_epoch=True)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 726, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 166, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 177, in validation_step
return self.training_type_plugin.validation_step(*args)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 131, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
File "/home/eich467/AHA-ToDo/DeepDPM/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 283, in validation_step
cluster_loss = self.training_utils.cluster_loss_function(
File "/home/eich467/AHA-ToDo/DeepDPM/src/clustering_models/clusternet_modules/utils/training_utils.py", line 235, in cluster_loss_function
gmm_k = MultivariateNormal(model_mus[k].double().to(device=self.device), model_covs[k].double().to(device=self.device))
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/torch/distributions/multivariate_normal.py", line 146, in init
super(MultivariateNormal, self).init(batch_shape, event_shape, validate_args=validate_args)
File "/data/eich467/miniconda3/envs/DPM/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in init
raise ValueError(
ValueError: Expected parameter covariance_matrix (Tensor of shape (10, 10)) of distribution MultivariateNormal(loc: torch.Size([10]), covariance_matrix: torch.Size([10, 10])) to satisfy the constraint PositiveDefinite(), but found invalid values:
tensor([[ 0.2024, 0.0064, -0.0058, -0.0042, -0.0496, 0.0119, 0.0353, 0.0382,
-0.0344, -0.0670],
[ 0.0064, 0.3373, 0.0737, 0.0591, -0.0159, 0.0727, -0.1940, 0.1438,
0.2550, -0.2739],
[-0.0058, 0.0737, 0.2339, -0.0314, 0.0313, -0.0084, -0.0907, 0.0229,
0.0959, -0.0558],
[-0.0042, 0.0591, -0.0314, 0.2218, 0.0188, 0.0588, -0.0070, 0.0613,
0.0971, -0.0632],
[-0.0496, -0.0159, 0.0313, 0.0188, 0.6186, 0.0445, -0.0183, -0.2485,
0.1914, 0.3667],
[ 0.0119, 0.0727, -0.0084, 0.0588, 0.0445, 0.2089, -0.0516, -0.0080,
0.1229, -0.0549],
[ 0.0353, -0.1940, -0.0907, -0.0070, -0.0183, -0.0516, 0.4900, -0.0818,
-0.1718, 0.2499],
[ 0.0382, 0.1438, 0.0229, 0.0613, -0.2485, -0.0080, -0.0818, 0.5531,
-0.1021, -0.2932],
[-0.0344, 0.2550, 0.0959, 0.0971, 0.1914, 0.1229, -0.1718, -0.1021,
0.8345, -0.2239],
[-0.0670, -0.2739, -0.0558, -0.0632, 0.3667, -0.0549, 0.2499, -0.2932,
-0.2239, 0.8357]], device='cuda:0', dtype=torch.float64)

Here are the arguments I used to execute the code:

python DeepDPM_alternations.py --latent_dim 10 --dataset mnist --lambda_ 0.005 --lr 0.002 --init_k 3 --train_cluster_net 200 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir /home/eich467/AHA-Archived/DCC/data/mnist/MNIST/raw --pretrain_path ./saved_models/ae_weights/mnist_e2e.zip --number_of_ae_alternations 3 --transform_input_data None --log_metrics_at_train True --gpus 0 --offline --save_checkpoints True

While the above error is terminal, there are two runtime behaviors that are potentially useful in diagnosing the problem?

../lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: training_step returned None if it was on purpose, ignore this warning...

And also, the loss for epoch 0 is nan

I appreciate your help.

DeepDPM.py python unable to set the hparams

Hi attempting to run the deepdpm.py file in both vscode and command line with parameters i run into an exception where for some reason hparams can't be set to the args value(which is in the main function in deepdpm.py).
Please do let me know if I'm doing something wrong here and or if you need more info on the problem.
Thank you kindly

test scripts for DeepDPM.py

I haved got my pretrained model(ckpt)using DeepDPM.py, but I wonder how I can get the inference results (cluster results for some data)using my pretrained ckpt, cloud you share the inference scripts for DeepDPM.py? thanks!

Training stuck at one certain epoch

Hi Meitar,

So I was training on the MNIST dataset using pretrained features, e.g.

python DeepDPM.py --dataset MNIST --dir './pretrained_embeddings/umap_embedded_datasets/MNIST' --gpus 0

but every time training stucks at epoch 44 and will not continue, log:

Epoch 0: 100%|███████████| 547/547 [00:00<00:00, 661.71it/s, loss=nan, v_num=]Initializing clusters params using Kmeans...
Epoch 44: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 547/547 [00:19<00:00, 27.60it/s, loss=0, v_num=]

Also, why the loss becomes nan in the first epoch? Appreciate if you can suggest!

--pretrain_path parameter and how custom data should be shaped

Hello,
Thank you very much for sharing your work.

I am trying to use your code to classify images on acustom dataset.

Somethings are still a bit unclear.

I went for the feature+clustering way, I can't seem to make sense of the pretrain_path parameter.

If I want to use the Umap extractor what should I replace it with ?

The second question being, Would this as a generic custom class fit :

class Custom(MyDataset):
    def __init__(self, args):
        super().__init__(args)
        self.transformer = transforms.Compose([transforms.ToTensor(),    transforms.Resize((28,28)), transforms.Normalize((0.1307,), (0.3081,))])
        self._input_dim = 28 * 28

    def get_train_data(self):
        return datasets.Custom(self.data_dir, train=True, download=False, transform=self.transformer)

    def get_test_data(self):
        return datasets.Custom(self.data_dir, train=False, transform=self.transformer)

And in which format should the training data be ?

THank you very mluch in advance !

Are DeepDPM dependencies compatible with recent GPUs ?

Hello,

We are trying to run the test code of DeepDPM on RTX 3090 and RTX A5000.

We created a singularity VM with the dependencies listed in
https://github.com/BGU-CS-VIL/DeepDPM/blob/main/requirements.txt
Then, from this VM, ran the test code:

Singularity> python3 DeepDPM.py --gpus 1 --offline --dataset synthetic --log_emb every_n_epochs --log_emb_every 1
Sequential()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

  | Name              | Type              | Params
--------------------------------------------------------
0 | cluster_net       | MLP_Classifier    | 201  
1 | subclustering_net | Subclustering_net | 252  
--------------------------------------------------------
453       Trainable params
0         Non-trainable params
453       Total params
0.002     Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                                              | 0/158 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: training_step returned None if it was on purpose, ignore this warning...
  warnings.warn(*args, **kwargs)
Epoch 0:  50%|█████████████████████████████████████████████████████████████████████████                                                                         | 79/158 [00:00<00:00, 240.26it/s, loss=nan, v_num=]Initializing clusters params using Kmeans...
Epoch 0:  51%|██████████████████████████████████████████████████████████████████████████▍                                                                        | 80/158 [00:02<00:02, 33.51it/s, loss=nan, v_num=]
Traceback (most recent call last):                                                                                                                                                           | 0/79 [00:00<?, ?it/s]
  File "DeepDPM.py", line 448, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 428, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 726, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 166, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 177, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 131, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 268, in validation_step
    logits = self.cluster_net(codes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent/src/clustering_models/clusternet_modules/models/Classifiers.py", line 59, in forward
    x = F.relu(self.class_fc1(x))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Next, we tried to run DeepDPM from a VM we routinely use with RTX 3090 and RTX A5000 GPUs, which rest on more recent libraries versions. Not too surprisingly this failed:

python3 DeepDPM.py --gpus 0 --offline --dataset synthetic --log_emb every_n_epochs --log_emb_every 1
2022-08-12 13:26:04.514401: W tensorflow/stream_executor/platform/default/[dso_loader.cc:64](http://dso_loader.cc:64/)] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2022-08-12 13:26:04.514479: I tensorflow/stream_executor/cuda/[cudart_stub.cc:29](http://cudart_stub.cc:29/)] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "DeepDPM.py", line 19, in <module>
    from src.clustering_models.clusternet_modules.clusternetasmodel import ClusterNetModel
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 17, in <module>
    from src.clustering_models.clusternet_modules.utils.plotting_utils import PlotUtils
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent/src/clustering_models/clusternet_modules/utils/plotting_utils.py", line 8, in <module>
    import umap
  File "/usr/local/lib/python3.8/dist-packages/umap/__init__.py", line 2, in <module>
    from .umap_ import UMAP
  File "/usr/local/lib/python3.8/dist-packages/umap/umap_.py", line 28, in <module>
    import numba
  File "/usr/local/lib/python3.8/dist-packages/numba/__init__.py", line 200, in <module>
    _ensure_critical_deps()
  File "/usr/local/lib/python3.8/dist-packages/numba/__init__.py", line 140, in _ensure_critical_deps
    raise ImportError("Numba needs NumPy 1.21 or less")
ImportError: Numba needs NumPy 1.21 or less

Any advice to fix this issue?

All the best,

Vincent

The speed of training DeepDPM based on pretrained embeddings is very slow

In learning clustering using DeepDPM.py based on my data's embedding, the time cost of one epoch is 1h, my feature dim is 143 , and the number of my data is 10k. In the training , i follow your training setting in DeepDPM.py, expect "latent_dim","prior_nu" .

My training command as follows:
python DeepDPM.py --dataset renti
--dir ./pretrained_embeddings/barlowtwins
--latent_dim 147
--prior_nu 150
--prior_sigma_scale 0.05
--offline
--gpus 2,3,5,6
--offline
--exp_name 'renti_pca0.95_officeparam_real_0424'
--max_epochs 1000

Could you help me find this abnormal problem?
Thanks~

working with custom input data

Hi,
Suppose that the input data consists of 10000 images, and each image is of size MxN.
For end-to-end feature extraction and clustering scenario, what should be the size of train.pt tensor?

Also, how can we provide labels for custom data? let's say the possible labels are "positive", "negative", and "unknown".

Thanks

Cannot reproduce results of Reuters10K following the instruction.

Hi there,

I followed the instruction in ReadME:
"python DeepDPM_alternations.py --dataset reuters10k --dir <path_to_dataset_location> --hidden-dims 500 500 2000 --latent_dim 75 --pretrain_path ./saved_models/ae_weights/reuters10k_e2e.zip --NIW_prior_nu 80 --init_k 1 --lambda_ 0.1 --beta 0.5 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --number_of_ae_alternations 3 --log_metrics_at_train True"

However, I got the result:
NMI: 0.47693, ARI: 0.35033, acc: 0.6435, final K: 2

May you suggest any possible reason? Or could you upload your trained model?

Thanks,

Splitting do not occur enough

Hello thank you very much for your code.

I am trying to use it to cluster text. I am using SBert (all-MiniLM-L6-v2) embeddings of dimension 384 (reduced to dimension 20 with UMAP). I am trying to cluster the Stack-Overflow dataset containing 20 clusters, 1000 datapoints per class (c.f. the scatter plot, where I reduce the embeddings to dim=2 with UMAP). My problem is that without changing the parameters of your model (except for prior_nu which need to be at least codes_dim + 1. so I set it to 22 and also I changed latent_dim to be 20) then the model won't split enough the clusters and is stuck at 1 or 2 clusters. I tried to change the "start_sub_clustering" ,"start_splitting", "start_merging", "split_every_n_epochs", "merge_every_n_epochs", "split_merge_every_n_epochs" to smaller values but it does not help.

Also because there are a lot of hyperparameters, I am not sure where to start then tuning for the case of text. Do you have any idea or hints to apply your model to such types of embeddings?

Screenshot 2022-05-19 at 09 21 27
s

About the use of labels

Thank you very much for open-sourcing the code of your model 🎉
Going through the DeepDPM code, I see that several methods rely on ground truth labels for data points.
Would it be possible to apply the DeepDPM model to a dataset in a fully unsupervised way (no labels available)?

Size exceed during training

gmm_k = MultivariateNormal(model_mus[k].double().to(device=self.device), model_covs[k].double().to(device=self.device))
IndexError: index 4 is out of bounds for dimension 0 with size 4

Hi,I would like to know why the size exceeds the problem when the algorithm changes the number of clusters k during training. How can I solve this problem? thank you very much.

Size mismatch error when running the inference script on MNIST

After training the model on the MNIST dataset, I used the following command to do inference

python -m scripts.DeepDPM_load_from_checkpoint.py --dataset MNIST --dir "./pretrained_embeddings/umap_embedded_datasets/MNIST"

I specified the DIMENSION OF THE DATA to be 28 * 28 and added CHECKPOINT_PATH in the scripts\DeepDPM_load_from_checkpoint.py, but I still get the following error:

Traceback (most recent call last): File "/home/hafsa/DeepDPM/DeepDPM_load_from_checkpoint.py", line 30, in <module> model = ClusterNetModel.load_from_checkpoint( File "/home/hafsa/.conda/envs/deepdpm/lib/python3.10/site-packages/pytorch_lightning/core/saving.py", line 157, in load_from_checkpoint model = cls._load_model_state(checkpoint, strict=strict, **kwargs) File "/home/hafsa/.conda/envs/deepdpm/lib/python3.10/site-packages/pytorch_lightning/core/saving.py", line 205, in _load_model_state model.load_state_dict(checkpoint['state_dict'], strict=strict) File "/home/hafsa/.conda/envs/deepdpm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for ClusterNetModel: size mismatch for cluster_net.class_fc1.weight: copying a param with shape torch.Size([50, 10]) from checkpoint, the shape in current model is torch.Size([50, 784]). size mismatch for subclustering_net.class_fc1.weight: copying a param with shape torch.Size([50, 10]) from checkpoint, the shape in current model is torch.Size([50, 784]).

Any idea on how to solve this issue ?

Thank you

Is there any way to resume training?

Hi Meitar,

Thanks for your research.
I'm so interested in your paper.

Your code is working well, but I'm wondering about how to resume training.
It seems that 'resume_from_checkpoint' options did not work in your code, when I modified.

Please let me know how to resume training.

Thank you in advance.

get_test_data() slicing - inconsistent numbers of samples

When I run the code with python3 DeepDPM.py --dataset MNIST_N2D --gpus 0I get an error message:

...

File "/clusterstorage/gkobsik/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
    
File "/clusterstorage/gkobsik/DeepDPM/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 375, in training_epoch_end
    init_nmi = normalized_mutual_info_score(gt, init_labels)
    
File "/clusterstorage/gkobsik/.local/lib/python3.9/site-packages/sklearn/metrics/cluster/_supervised.py", line 1020, in normalized_mutual_info_score
    labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
    
File "/clusterstorage/gkobsik/.local/lib/python3.9/site-packages/sklearn/metrics/cluster/_supervised.py", line 72, in check_clusterings
    check_consistent_length(labels_true, labels_pred)
    
File "/clusterstorage/gkobsik/.local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 332, in check_consistent_length
    raise ValueError(

ValueError: Found input variables with inconsistent numbers of samples: [70000, 69995]

I could narrow the error message down to the dataset and the following lines of code:

def get_train_data(self):
train_codes = torch.Tensor(torch.load(os.path.join(self.dataset_loc, "train_codes.pt")))
labels = torch.load(os.path.join(self.dataset_loc, "train_labels.pt"))
train_labels = torch.Tensor(labels).cpu().float()

def get_test_data(self):
if "N2D" in self.args.dataset:
# Training and evaluating the entire dataset.
# Take only a few examples just to not break code.
data = self.get_train_data()
val_codes = data.tensors[0][5:]
val_labels = data.tensors[1][5:]

Removing the slicing operation in get_test_data() - Line 54 & 55 resolves the problem, but I just started exploring the repo and cannot estimate the side effects of this change.
It would be very helpful to get the reasoning for discarding the first 5 data samples in the validation / test set.

Model weights for inference code

For loading a pretrained model from a saved checkpoint, and for an inference example, see: scripts\DeepDPM_load_from_checkpoint.py.

What an example weights provided in the repository which can be used for inference?

I noticed the AE encoder weights doesn't contain the layer:

cp_state['state_dict']['cluster_net.class_fc2.weight'].shape[0] 

Can the authors of this project provide the sample model weights which can be used for checking the capability of this clustering algorithm?

Training on custom dataset

Thank you for your super contributions,
What would be the best way to use your code to train and predict the cluster assignment of a list of embeddings representing my data points. i.e. something like .fit(X) or .fit_predict(X) from sklearn ?

NEPTUNE_API_TOKEN InvalidApiKey

Your API token is invalid.

Learn how to get it in this docs page:
https://docs-legacy.neptune.ai/security-and-privacy/api-tokens/how-to-find-and-set-neptune-api-token.html

There are two options to add it:
- specify it in your code
- set an environment variable in your operating system.

CODE
Pass the token to neptune.init() via api_token argument:
neptune.init(project_qualified_name='WORKSPACE_NAME/PROJECT_NAME', api_token='YOUR_API_TOKEN')

ENVIRONMENT VARIABLE (Recommended option)
or export or set an environment variable depending on your operating system:

Linux/Unix
In your terminal run:
    export NEPTUNE_API_TOKEN=YOUR_API_TOKEN

Windows
In your CMD run:
    set NEPTUNE_API_TOKEN=YOUR_API_TOKEN

and skip the api_token argument of neptune.init():
neptune.init(project_qualified_name='WORKSPACE_NAME/PROJECT_NAME')

Why did the first epoch run so long

[running kmeans]: 4it [00:03, 1.42it/s, center_shift=0.000000, iteration=5, tol=0.0
[running kmeans]: 5it [00:07, 1.55s/it, center_shift=0.000000, iteration=5, tol=0.000100]ting: 0%| | 0/69 [00:00<?, ?it/s]
Epoch 1: 50%|█████████ | 69/138 [00:10<00:10, 6.70it/s, loss=1.96, v_num=]Evaluating...
NMI : 0.017017041797800267, ARI: 0.007834735394701565, ACC: 0.1387, current K: 3
Epoch 1: 99%|████████████████▉| 137/138 [07:30<00:03, 3.29s/it, loss=1.96, v_num=Evaluating...100%|███████████████████████████████████| 69/69 [00:08<00:00, 7.05it/s]
Epoch 1: 100%|█████████████████| 138/138 [07:42<00:00, 3.35s/it, loss=1.96, v_num=]
Evaluating...
Epoch 2: 50%|█████████ | 69/138 [00:11<00:11, 5.87it/s, loss=1.81, v_num=]Evaluating...
NMI : 0.27587234909394287, ARI: 0.10852441686412587, ACC: 0.20503, current K: 2
Epoch 2: 99%|████████████████▉| 137/138 [00:20<00:00, 6.63it/s, loss=1.81, v_num=Evaluating...100%|███████████████████████████████████| 69/69 [00:08<00:00, 7.09it/s]
Evaluating...
Epoch 3: 50%|█████████ | 69/138 [00:11<00:11, 5.82it/s, loss=1.64, v_num=]Evaluating...
NMI : 0.3924139816776356, ARI: 0.17518072194919748, ACC: 0.21256, current K: 2
Epoch 3: 100%|█████████████████| 138/138 [00:20<00:00, 6.62it/s, loss=1.64, v_num=Evaluating...100%|███████████████████████████████████| 69/69 [00:08<00:00, 6.42it/s]

size mismatch between gt and init_labels gave ValueError: Found input variables with inconsistent numbers of samples

thank you so much for sharing~we're very interested in your work and tried it on our data which looks like as below
image

the input data size was checked before the training (the label is an array of zero created by codes in customdataset), and it seems that a size mismatch between gt and init_labels then occured at the line init_nmi = normalized_mutual_info_score(gt, init_labels) during the training and gave ValueError: Found input variables with inconsistent numbers of samples.

could you help us to fix this issu please?

thank you

sincerely

Zhao

Singularity> python3 DeepDPM.py --gpus 0,1,2 --offline --batch-size 512 --max_epoch 20 --dir ./pretrained_embeddings/raw/XCIT_NANO_12_P16_224/
train_codes.size() torch.Size([74770, 128])
train_labels.size() torch.Size([74770])
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.DummyLogger` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.DummyLogger` instead.
  return new_rank_zero_deprecation(*args, **kwargs)
Sequential()
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
/usr/local/lib/python3.8/dist-packages/torch/utils/hooks.py:59: UserWarning: backward hook <function Subclustering_net.__init__.<locals>.<lambda> at 0x7f09d872fb80> on tensor will not be serialized.  If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
  warnings.warn("backward hook {} on tensor will not be "
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:616: UserWarning: Checkpoint directory /mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/checkpoints exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py:381: RuntimeWarning: Found unsupported keys in the optimizer configuration: {'scheduler'}
  rank_zero_warn(

  | Name              | Type              | Params
--------------------------------------------------------
0 | cluster_net       | MLP_Classifier    | 6.5 K 
1 | subclustering_net | Subclustering_net | 6.6 K 
--------------------------------------------------------
13.1 K    Trainable params
0         Non-trainable params
13.1 K    Total params
0.052     Total estimated model params size (MB)
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:203: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
Epoch 0:   0%|                                                                                                                                                                           | 0/66 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:135: UserWarning: `training_step` returned `None`. If this was on purpose, ignore this warning...
  self.warning_cache.warn("`training_step` returned `None`. If this was on purpose, ignore this warning...")
Epoch 0:  74%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                       | 49/66 [00:07<00:02,  6.32it/s, loss=nanInitializing clusters params using Kmeans...                                                                                                                                              | 0/17 [00:00<?, ?it/s]
Initializing clusters params using Kmeans...
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:15<00:00,  4.22it/s, loss=nan]/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('cluster_net_train/val/avg_val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:15<00:00,  4.22it/s, loss=nanInitializing clusters params using Kmeans...                                                                                                                                                                     
gt.size() torch.Size([24924])
init_labels.size() torch.Size([8410])
gt.size() torch.Size([24924])
init_labels.size() torch.Size([8410])
gt.size() torch.Size([24924])
init_labels.size() torch.Size([8410])
Traceback (most recent call last):
  File "DeepDPM.py", line 456, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 436, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 286, in on_advance_end
    epoch_end_outputs = self.trainer._call_lightning_module_hook("training_epoch_end", epoch_end_outputs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 375, in training_epoch_end
    init_nmi = normalized_mutual_info_score(gt, init_labels)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/cluster/_supervised.py", line 1028, in normalized_mutual_info_score
    labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/cluster/_supervised.py", line 71, in check_clusterings
    check_consistent_length(labels_true, labels_pred)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 387, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [24924, 8410]

Ask for advice about DeepDpm code :KeyError: 'hyper_parameters'

Thank you for your great work ,I followed the hints for training and generated
cp_path = "../saved_models/MNIST/default_exp/epoch=9-step=3699.ckpt". and have a little problem in DeepDPM_load_from_checkpoint.py as shown in the picture, it shows KeyError: 'hyper_parameters'. I would appreciate it if you could take the time to answer it.
2023-07-18_15-07-55

get similar gif in custom datasets

hi, if I want to get a similar gif in custom datasets , such as CIFAR10.
What should I do to get it?
I see the gif should get inputs data as 2D shape, so should I use PCA to get low dimension embedding?
thx!!

Accuracy when K is unknown

When K is unknown, if the final k is not exactly class number, how the acc defined. From the code and paper, it seems to ignoring some prediced clusters (if K is larger than number of classes.). Is my underdstanding correct? For example number of class is 10 and final K is 15, then there will be 5 predicted cluster being ignored for computing the accuracy? Is that always true that the extra 5 classes contain very few number of data?

inference example

Running the inference gives this output. Can we give a set of images as input and then it gives which cluster it belongs to.
python scripts/DeepDPM_load_from_checkpoint.py
Sequential()
[tensor([3, 2, 2, 0, 2, 1, 1, 1, 3, 2, 0, 0, 3, 0, 1, 2, 3, 5, 0, 3, 3, 2, 1, 2,
5, 3, 2, 2, 5, 2, 1, 3, 2, 2, 4, 0, 3, 2, 4, 2, 1, 0, 0, 3, 3, 4, 1, 1,
2, 5, 2, 2, 0, 0, 4, 0, 3, 2, 5, 0, 0, 2, 1, 2, 4, 3, 2, 0, 0, 3, 5, 2,
2, 1, 5, 2, 2, 2, 3, 2, 2, 0, 0, 3, 0, 1, 0, 0, 4, 2, 3, 1, 2, 3, 2, 1,
2, 3, 0, 2, 0, 2, 0, 5, 0, 2, 1, 5, 2, 3, 2, 5, 0, 2, 4, 0, 1, 3, 0, 1,
3, 1, 0, 2, 5, 2, 0, 2]),

What should be output of inference

Doubt about dataloader:data and labels?

Thankyou for your nice contribution!
I want to ask some problems...
and in issue#6, you said ''the DeepDPM is an unsupervised method, labels are not used in any training stage (we use them in our code only for evaluation, but they are not used in training"
but in the DeepDPM.py,the dataloader all will return datas and labels?I don't understand,I need your's help.Thanks!
Also, in path "./pretrained_embeedings/MoCO/..",the traincodes.pt and labels.pt is mean what?

in /MOCO/imagenet_50, get worse result

Thankyou for you great attributions!
I try to run DeepDPM.py in ./pretrained_embeddings/MOCO/IMAGENET_50 dataset,and does already embedding to 128 dimensions vector for this dataset? (I saw the train_codes.pt shape is [64274, 128])...
and I try to do cluster in this dataset by DeepDPM.py, but the cluster result is worse: 300epoch, acc=0.37597
so, why suggest to run DeepDPM_alternation in IMAGENET_50 dataset? Thankyou very much!!

`IndexError` from `plotting_utils`

When training with the command

python3 DeepDPM.py --dataset=synthetic --latent_dim=2 --init_k=15 --log_emb=every_n_epochs --log_emb_every=20 --offline --gpus=0

I consistently get the following error:

Evaluating...                                                                                                                                               
NMI : 0.9436628028368107, ARI: 0.9035767324812444, ACC: 0.9316, current K: 18
Epoch 60:  50%|████████████████████████████████████████████                                            | 79/158 [00:06<00:06, 11.52it/s, loss=0.252, v_num=]
Traceback (most recent call last):
  File "DeepDPM.py", line 441, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 426, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 560, in run_training_epoch
    self.trainer.logger_connector.log_train_epoch_end_metrics(
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 415, in log_train_epoch_end_metrics
    self.training_epoch_end(model, epoch_output, num_optimizers)
  File "/home/riccardo/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 466, in training_epoch_end
    epoch_output = model.training_epoch_end(epoch_output)
  File "/home/riccardo/DeepDPM/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 551, in training_epoch_end
    self.plot_utils.plot_cluster_and_decision_boundaries(samples=self.codes, labels=self.train_resp.argmax(-1), gt_labels=self.train_gt, net_centers=self.mus, net_covs=self.covs, n_epoch=self.current_epoch, cluster_net=self)
  File "/home/riccardo/DeepDPM/src/clustering_models/clusternet_modules/utils/plotting_utils.py", line 336, in plot_cluster_and_decision_boundaries
    self.plot_clusters(
  File "/home/riccardo/DeepDPM/src/clustering_models/clusternet_modules/utils/plotting_utils.py", line 263, in plot_clusters
    samples[:, 0], samples[:, 1], c=self.colors[labels, :], s=40, alpha=0.5, zorder=1
IndexError: index 16 is out of bounds for dimension 0 with size 15

I'm not sure whether this could be connected with #25.

"Data" directory

I have raw data fairly similar to MNIST and i would like to train your model on it.
I'm quite new in ML and i don't really know how to put the name of my dataset as an argument and where to actually put my dataset . I've read the custom Dataset closed issues but i still don't grasp how to do it .

And also how i can save models and have checkpoints please i couldn't figure it out.
Thank you in advance to anyone who can help me

inference comeout some error, in DeepDPM_load_from_checkpoint.py

2023-02-14 15-30-39 的屏幕截图

Thank you for your great work, I am very interested in your work, I followed the hints for training and generated an .pth.tar file. and have a little problem in DeepDPM_load_from_checkpoint.py as shown in the picture, I would appreciate it if you could take the time to answer it.

when running gpu-kmeans, having center_shift=nan

Hi thank you for sharing your great work first!
I'm trying to train the cluster on my own dataset, I followed the ImageNet exp setup, first resize images into 224 * 224 rgb, second using a Imagenet pretrained Resnet50 to extract the 2048 dim embeddings, than comes the fist question:

  1. the last feature before fc layer of ResNet is 2048 dim, why the ImageNet embedded data provided in this proj is 128 dim?

With the feature extracted is in 2048 dim (the latent dim is still 10), I can't use the provided './saved_models/ae_weights/imagenet_50_e2e.zip' as pretrained weight, so I tried skip the training procedure and directly train the feature_extractor, after certain epochs I stop and freeze the weight, and treat it as pretrained weight, finally restart for the begining.
At first it seems all goes well, k from 10 increased to 16 as expected while loss decreased, but suddenly I got:

  1. center_shift=nan, and it will stack in running gpu-kmeans, as center_shift is always nan

Somethings this happens after several epochs, somethings it happens at even begining, like:

python DeepDPM/DeepDPM_alternations.py --features_dim 2048 --latent_dim 10 --lambda_ 0.05 --beta 0.01 --dataset imagenet_50 --init_k 10 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir ./pretrained_embeddings/MOCO/ --prior_nu 12 --pretrain_path ./saved_models/imagenet_50/alt_1_10_checkpoint.pth --prior_sigma_scale 0.0001 --prior_sigma_choice data_std --number_of_ae_alternations 200 --batch-size 2048 --gpus 0 --save_checkpoints True --train_cluster_net 50 --lr 0.0003
NeptuneLogger will work in online mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type             | Params
-------------------------------------------------------
0 | feature_extractor | FeatureExtractor | 4.6 M 
1 | criterion         | MSELoss          | 0     
-------------------------------------------------------
4.6 M     Trainable params
0         Non-trainable params
4.6 M     Total params
18.428    Total estimated model params size (MB)

Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s] ========== Alternation 0: Running DeepDPM clustering ==========

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Sequential()
https://app.neptune.ai/siyao.liu1121/DeepDpm/e/DEEP-57

  | Name              | Type               | Params
---------------------------------------------------------
0 | feature_extractor | AE_ClusterPipeline | 4.6 M 
1 | cluster_net       | MLP_Classifier     | 1.1 K 
2 | subclustering_net | Subclustering_net  | 15.5 K
---------------------------------------------------------
4.6 M     Trainable params
0         Non-trainable params
4.6 M     Total params
18.495    Total estimated model params size (MB)
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s] python3.7/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: training_step returned None if it was on purpose, ignore this warning...
  warnings.warn(*args, **kwargs)
Epoch 0:  94%|#########3| 15/16 [00:00<00:00, 15.53it/s, loss=nan, v_num=P-57]Initializing clusters params using Kmeans...
running k-means on cuda:0..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=nan, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 73.55it/s, center_shift=nan, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 100.25it/s, center_shift=nan, iteration=3, tol=0.000100]
[running kmeans]: 3it [00:00, 112.12it/s, center_shift=nan, iteration=4, tol=0.000100]
[running kmeans]: 4it [00:00, 120.83it/s, center_shift=nan, iteration=5, tol=0.000100]
[running kmeans]: 5it [00:00, 128.53it/s, center_shift=nan, iteration=6, tol=0.000100]

Or:

Validating:   0%|          | 0/1 [00:00<?, ?it/s]
Validating: 100%|██████████| 1/1 [00:00<00:00,  4.75it/s]Evaluating...
NMI : 0.04476338455463688, ARI: -0.0004239651100343126, ACC: 0.04, current K: 5
Epoch 71: 100%|██████████| 16/16 [00:01<00:00,  8.66it/s, loss=0.0877, v_num=]
Epoch 72:  94%|█████████▍| 15/16 [00:00<00:00, 19.42it/s, loss=0.0822, v_num=]running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.407188, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 133.01it/s, center_shift=5.686129, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 180.04it/s, center_shift=0.002052, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 267.79it/s, center_shift=0.000000, iteration=4, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.287646, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 106.30it/s, center_shift=5.682350, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 22.30it/s, center_shift=0.002052, iteration=3, tol=0.000100] 
[running kmeans]: 4it [00:00, 42.84it/s, center_shift=0.000000, iteration=4, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.037330, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 371.54it/s, center_shift=0.014153, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 507.82it/s, center_shift=0.006376, iteration=3, tol=0.000100]
[running kmeans]: 3it [00:00, 604.34it/s, center_shift=0.002210, iteration=4, tol=0.000100]
[running kmeans]: 4it [00:00, 668.44it/s, center_shift=0.000808, iteration=5, tol=0.000100]
[running kmeans]: 5it [00:00, 708.52it/s, center_shift=0.000307, iteration=6, tol=0.000100]
[running kmeans]: 6it [00:00, 739.34it/s, center_shift=0.000133, iteration=7, tol=0.000100]
[running kmeans]: 8it [00:00, 855.91it/s, center_shift=0.000031, iteration=8, tol=0.000100]
running k-means on cuda..
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.844016, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.65it/s, center_shift=3.517675, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 20.67it/s, center_shift=0.000373, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 40.05it/s, center_shift=0.000000, iteration=4, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.015098, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 420.57it/s, center_shift=0.014386, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 559.88it/s, center_shift=0.010443, iteration=3, tol=0.000100]
[running kmeans]: 3it [00:00, 630.19it/s, center_shift=0.009837, iteration=4, tol=0.000100]
[running kmeans]: 4it [00:00, 672.76it/s, center_shift=0.011026, iteration=5, tol=0.000100]
[running kmeans]: 5it [00:00, 701.67it/s, center_shift=0.005228, iteration=6, tol=0.000100]
[running kmeans]: 6it [00:00, 722.89it/s, center_shift=0.002056, iteration=7, tol=0.000100]
[running kmeans]: 7it [00:00, 738.34it/s, center_shift=0.000462, iteration=8, tol=0.000100]
[running kmeans]: 8it [00:00, 750.81it/s, center_shift=0.000193, iteration=9, tol=0.000100]
[running kmeans]: 10it [00:00, 831.46it/s, center_shift=0.000072, iteration=10, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.871758, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 137.39it/s, center_shift=3.142004, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 187.02it/s, center_shift=0.000295, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 277.74it/s, center_shift=0.000000, iteration=4, tol=0.000100]
running k-means on cuda..
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.207659, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 138.94it/s, center_shift=5.758148, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 188.52it/s, center_shift=0.002321, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 279.06it/s, center_shift=0.000000, iteration=4, tol=0.000100]

Validating: 0it [00:00, ?it/s]
Validating:   0%|          | 0/1 [00:00<?, ?it/s]
Validating: 100%|██████████| 1/1 [00:00<00:00,  4.87it/s]Evaluating...
NMI : 0.04476338455463688, ARI: -0.0004239651100343126, ACC: 0.04, current K: 5
Epoch 72: 100%|██████████| 16/16 [00:02<00:00,  7.98it/s, loss=0.0822, v_num=]
Epoch 73:  94%|█████████▍| 15/16 [00:00<00:00, 19.42it/s, loss=0.0684, v_num=]running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.288587, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 136.99it/s, center_shift=1.223311, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 184.52it/s, center_shift=1.936977, iteration=3, tol=0.000100]
[running kmeans]: 3it [00:00, 210.28it/s, center_shift=0.000130, iteration=4, tol=0.000100]
[running kmeans]: 5it [00:00, 279.96it/s, center_shift=0.000000, iteration=5, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.696886, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.71it/s, center_shift=3.553470, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 20.72it/s, center_shift=0.000413, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 40.12it/s, center_shift=0.000000, iteration=4, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.038775, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 376.51it/s, center_shift=0.003019, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 497.19it/s, center_shift=0.001510, iteration=3, tol=0.000100]
[running kmeans]: 3it [00:00, 553.97it/s, center_shift=0.000504, iteration=4, tol=0.000100]
[running kmeans]: 4it [00:00, 588.20it/s, center_shift=0.000152, iteration=5, tol=0.000100]
[running kmeans]: 6it [00:00, 716.10it/s, center_shift=0.000024, iteration=6, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=1.120770, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 135.03it/s, center_shift=2.885201, iteration=2, tol=0.000100]
running k-means on cuda..
[running kmeans]: 2it [00:00, 183.26it/s, center_shift=0.000277, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 199.86it/s, center_shift=0.000000, iteration=4, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.035341, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 401.14it/s, center_shift=0.014382, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 531.19it/s, center_shift=0.011126, iteration=3, tol=0.000100]
[running kmeans]: 3it [00:00, 621.29it/s, center_shift=0.006465, iteration=4, tol=0.000100]
[running kmeans]: 4it [00:00, 690.76it/s, center_shift=0.007334, iteration=5, tol=0.000100]
[running kmeans]: 5it [00:00, 741.07it/s, center_shift=0.008686, iteration=6, tol=0.000100]
[running kmeans]: 6it [00:00, 777.97it/s, center_shift=0.007972, iteration=7, tol=0.000100]
[running kmeans]: 7it [00:00, 808.11it/s, center_shift=0.007869, iteration=8, tol=0.000100]
[running kmeans]: 8it [00:00, 832.14it/s, center_shift=0.003304, iteration=9, tol=0.000100]
[running kmeans]: 9it [00:00, 852.27it/s, center_shift=0.002945, iteration=10, tol=0.000100]
[running kmeans]: 10it [00:00, 869.45it/s, center_shift=0.001074, iteration=11, tol=0.000100]
[running kmeans]: 11it [00:00, 883.37it/s, center_shift=0.000622, iteration=12, tol=0.000100]
[running kmeans]: 12it [00:00, 894.24it/s, center_shift=0.000343, iteration=13, tol=0.000100]
[running kmeans]: 14it [00:00, 941.03it/s, center_shift=0.000067, iteration=14, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.552967, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.76it/s, center_shift=4.388100, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 20.84it/s, center_shift=0.000721, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 40.40it/s, center_shift=0.000000, iteration=4, tol=0.000100]
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.193326, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.88it/s, center_shift=4.997762, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 20.99it/s, center_shift=0.040120, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 40.54it/s, center_shift=0.000001, iteration=4, tol=0.000100]

Validating: 0it [00:00, ?it/s]
Validating:   0%|          | 0/1 [00:00<?, ?it/s]
Validating: 100%|██████████| 1/1 [00:00<00:00,  4.99it/s]Evaluating...
NMI : 0.04454646281585243, ARI: -0.000494162559496229, ACC: 0.04, current K: 5
Epoch 73: 100%|██████████| 16/16 [00:02<00:00,  7.73it/s, loss=0.0684, v_num=]
Epoch 74:  94%|█████████▍| 15/16 [00:00<00:00, 19.53it/s, loss=0.0678, v_num=]running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.968823, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.74it/s, center_shift=3.056932, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 20.77it/s, center_shift=0.000382, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 40.22it/s, center_shift=0.000000, iteration=4, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s]running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s, center_shift=2.590032, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.75it/s, center_shift=1.237566, iteration=2, tol=0.000100]
[running kmeans]: 3it [00:00, 31.16it/s, center_shift=0.000057, iteration=3, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=0.002574, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 369.05it/s, center_shift=0.000372, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 488.90it/s, center_shift=0.000153, iteration=3, tol=0.000100]
[running kmeans]: 4it [00:00, 706.29it/s, center_shift=0.000030, iteration=4, tol=0.000100]
running k-means on cuda..
running k-means on cuda..

[running kmeans]: 0it [00:00, ?it/s]
[running kmeans]: 0it [00:00, ?it/s, center_shift=nan, iteration=1, tol=0.000100]
[running kmeans]: 1it [00:00, 10.77it/s, center_shift=nan, iteration=2, tol=0.000100]
[running kmeans]: 2it [00:00, 20.87it/s, center_shift=nan, iteration=3, tol=0.000100]

Any help?

new size-match errors after commenting out lines linked to labels

    > Hi @Vz09, As a quick fix - as you anyway do not have any labels, you could comment out the following lines from the code: "init_nmi = normalized_mutual_info_score(gt, init_labels) init_ari = adjusted_rand_score(gt, init_labels)" (lines 373-4 at src/clustering_models/clusternet_modules/clusternetasmodel.py), we will upload an official fix in the future.

That said, please make sure (using debugging) that the number of samples you are training on (self.codes) is indeed equal to the number of samples you have in the dataset, as if this is not the case there might be some problem with the dimension configuration and this will need more attention.

Thank you for your reply~
Does the print added in the code of class customdataset allow to check that the number of samples of my training equals the number of samples I had in the dataset? if yes, they are equal (74770).

class CustomDataset(MyDataset):
    def __init__(self, args):
        super().__init__(args)
        self.transformer = transforms.Compose([transforms.ToTensor()])
        self._data_dim = 0
    
    def get_train_data(self):
        train_codes = torch.Tensor(torch.load(os.path.join(self.data_dir, "train_data.pt")))
        if self.args.transform_input_data:
            train_codes = transform_embeddings(self.args.transform_input_data, train_codes)
        if self.args.use_labels_for_eval:
            train_labels = torch.load(os.path.join(self.data_dir, "train_labels.pt"))
        else:
            train_labels = torch.zeros((train_codes.size()[0]))
        self._data_dim = train_codes.size()[1]
        print("train_codes.size()", train_codes.size())
        print("train_labels.size()", train_labels.size())
        train_set = TensorDatasetWrapper(train_codes, train_labels)
        del train_codes
        del train_labels
        return train_set

I tried to comment the two lines you talked about, that indeed allowed to continue to epoch 25, then another error occurred...

train_codes.size() torch.Size([74770, 128])
train_labels.size() torch.Size([74770])
Sequential()
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.DummyLogger` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.DummyLogger` instead.
  return new_rank_zero_deprecation(*args, **kwargs)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
/usr/local/lib/python3.8/dist-packages/torch/utils/hooks.py:59: UserWarning: backward hook <function Subclustering_net.__init__.<locals>.<lambda> at 0x7f6643376af0> on tensor will not be serialized.  If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
  warnings.warn("backward hook {} on tensor will not be "
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:616: UserWarning: Checkpoint directory /mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/checkpoints exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py:381: RuntimeWarning: Found unsupported keys in the optimizer configuration: {'scheduler'}
  rank_zero_warn(

  | Name              | Type              | Params
--------------------------------------------------------
0 | cluster_net       | MLP_Classifier    | 6.5 K 
1 | subclustering_net | Subclustering_net | 6.6 K 
--------------------------------------------------------
13.1 K    Trainable params
0         Non-trainable params
13.1 K    Total params
0.052     Total estimated model params size (MB)
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:203: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/99 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/99 [00:00<?, ?it/s] 
Epoch 0:   1%|          | 1/99 [00:19<31:47, 19.46s/it]
Epoch 0:   1%|          | 1/99 [00:19<31:47, 19.46s/it, loss=nan]
Epoch 0:   2%|▏         | 2/99 [00:19<15:44,  9.74s/it, loss=nan]
Epoch 0:   2%|▏         | 2/99 [00:19<15:44,  9.74s/it, loss=nan]
....I kipped intermediate log
Epoch 25:  74%|███████▎  | 73/99 [00:22<00:07,  3.31it/s, loss=0]
Epoch 25:  74%|███████▎  | 73/99 [00:22<00:07,  3.31it/s, loss=0]
Epoch 25:  75%|███████▍  | 74/99 [00:22<00:07,  3.35it/s, loss=0]
Epoch 25:  75%|███████▍  | 74/99 [00:22<00:07,  3.35it/s, loss=0]

Validation: 0it [00:00, ?it/s]�[A

Validation:   0%|          | 0/25 [00:00<?, ?it/s]�[A

Validation DataLoader 0:   0%|          | 0/25 [00:00<?, ?it/s]�[A

Validation DataLoader 0:   4%|▍         | 1/25 [00:00<00:00, 173.97it/s]�[A
Epoch 25:  76%|███████▌  | 75/99 [00:44<00:14,  1.70it/s, loss=0]

Validation DataLoader 0:   8%|▊         | 2/25 [00:00<00:00, 134.73it/s]�[A
Epoch 25:  77%|███████▋  | 76/99 [00:44<00:13,  1.72it/s, loss=0]

Validation DataLoader 0:  12%|█▏        | 3/25 [00:00<00:00, 98.68it/s] �[A
Epoch 25:  78%|███████▊  | 77/99 [00:44<00:12,  1.74it/s, loss=0]

Validation DataLoader 0:  16%|█▌        | 4/25 [00:00<00:00, 58.51it/s]�[A
Epoch 25:  79%|███████▉  | 78/99 [00:44<00:11,  1.76it/s, loss=0]

Validation DataLoader 0:  20%|██        | 5/25 [00:00<00:00, 67.64it/s]�[A
Epoch 25:  80%|███████▉  | 79/99 [00:44<00:11,  1.78it/s, loss=0]

Validation DataLoader 0:  24%|██▍       | 6/25 [00:00<00:00, 76.02it/s]�[A
Epoch 25:  81%|████████  | 80/99 [00:44<00:10,  1.81it/s, loss=0]codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1


Validation DataLoader 0:  28%|██▊       | 7/25 [00:00<00:00, 79.10it/s]�[A
Epoch 25:  82%|████████▏ | 81/99 [00:44<00:09,  1.83it/s, loss=0]

Validation DataLoader 0:  32%|███▏      | 8/25 [00:00<00:00, 75.19it/s]�[A
Epoch 25:  83%|████████▎ | 82/99 [00:44<00:09,  1.85it/s, loss=0]

Validation DataLoader 0:  36%|███▌      | 9/25 [00:00<00:00, 73.14it/s]�[A
Epoch 25:  84%|████████▍ | 83/99 [00:44<00:08,  1.87it/s, loss=0]

Validation DataLoader 0:  40%|████      | 10/25 [00:00<00:00, 70.16it/s]�[A
Epoch 25:  85%|████████▍ | 84/99 [00:44<00:07,  1.89it/s, loss=0]

Validation DataLoader 0:  44%|████▍     | 11/25 [00:00<00:00, 73.41it/s]�[A
Epoch 25:  86%|████████▌ | 85/99 [00:44<00:07,  1.91it/s, loss=0]

Validation DataLoader 0:  48%|████▊     | 12/25 [00:00<00:00, 77.05it/s]�[A
Epoch 25:  87%|████████▋ | 86/99 [00:44<00:06,  1.94it/s, loss=0]

Validation DataLoader 0:  52%|█████▏    | 13/25 [00:00<00:00, 77.63it/s]�[A
Epoch 25:  88%|████████▊ | 87/99 [00:44<00:06,  1.96it/s, loss=0]

Validation DataLoader 0:  56%|█████▌    | 14/25 [00:00<00:00, 76.06it/s]�[A
Epoch 25:  89%|████████▉ | 88/99 [00:44<00:05,  1.98it/s, loss=0]

Validation DataLoader 0:  60%|██████    | 15/25 [00:00<00:00, 78.72it/s]�[A
Epoch 25:  90%|████████▉ | 89/99 [00:44<00:04,  2.00it/s, loss=0]

Validation DataLoader 0:  64%|██████▍   | 16/25 [00:00<00:00, 76.24it/s]�[A
Epoch 25:  91%|█████████ | 90/99 [00:44<00:04,  2.02it/s, loss=0]

Validation DataLoader 0:  68%|██████▊   | 17/25 [00:00<00:00, 73.49it/s]�[A
Epoch 25:  92%|█████████▏| 91/99 [00:44<00:03,  2.05it/s, loss=0]

Validation DataLoader 0:  72%|███████▏  | 18/25 [00:00<00:00, 72.70it/s]�[A
Epoch 25:  93%|█████████▎| 92/99 [00:44<00:03,  2.07it/s, loss=0]codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1


Validation DataLoader 0:  76%|███████▌  | 19/25 [00:00<00:00, 71.92it/s]�[A
Epoch 25:  94%|█████████▍| 93/99 [00:44<00:02,  2.09it/s, loss=0]codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1


Validation DataLoader 0:  80%|████████  | 20/25 [00:00<00:00, 73.28it/s]�[A
Epoch 25:  95%|█████████▍| 94/99 [00:44<00:02,  2.11it/s, loss=0]

Validation DataLoader 0:  84%|████████▍ | 21/25 [00:00<00:00, 75.34it/s]�[A
Epoch 25:  96%|█████████▌| 95/99 [00:44<00:01,  2.13it/s, loss=0]

Validation DataLoader 0:  88%|████████▊ | 22/25 [00:00<00:00, 77.29it/s]�[A
Epoch 25:  97%|█████████▋| 96/99 [00:44<00:01,  2.16it/s, loss=0]

Validation DataLoader 0:  92%|█████████▏| 23/25 [00:00<00:00, 79.32it/s]�[A
Epoch 25:  98%|█████████▊| 97/99 [00:44<00:00,  2.18it/s, loss=0]

Validation DataLoader 0:  96%|█████████▌| 24/25 [00:00<00:00, 81.38it/s]�[A
Epoch 25:  99%|█████████▉| 98/99 [00:44<00:00,  2.20it/s, loss=0]

Validation DataLoader 0: 100%|██████████| 25/25 [00:00<00:00, 83.41it/s]�[A
Epoch 25: 100%|██████████| 99/99 [00:44<00:00,  2.22it/s, loss=0]
Epoch 25: 100%|██████████| 99/99 [00:44<00:00,  2.22it/s, loss=0]

                                                                        �[Acodes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
Traceback (most recent call last):
  File "DeepDPM.py", line 456, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 436, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 286, in on_advance_end
    epoch_end_outputs = self.trainer._call_lightning_module_hook("training_epoch_end", epoch_end_outputs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 448, in training_epoch_end
    ) = self.training_utils.comp_cluster_params(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/training_utils.py", line 146, in comp_cluster_params
    mus = compute_mus(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 278, in compute_mus
    mus = compute_mus_soft_assignment(codes, logits, K)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 227, in compute_mus_soft_assignment
    [
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 228, in <listcomp>
    (logits[:, k].reshape(-1, 1) * codes).sum(axis=0) / denominator[k]
RuntimeError: The size of tensor a (9347) must match the size of tensor b (3154) at non-singleton dimension 0

then I tried to comment the following lines of clusternetasmodel.py, it continued till epoch 44 and another error occurred again....

  if not freeze_mus:
                (
                    self.pi,
                    self.mus,
                    self.covs,
                ) = self.training_utils.comp_cluster_params(
                    self.train_resp,
                    self.codes.view(-1, self.codes_dim),
                    self.pi,
                    self.K,
                    self.prior,
                )
Epoch 44:  74%|███████▎  | 73/99 [00:25<00:09,  2.86it/s, loss=0]
Epoch 44:  74%|███████▎  | 73/99 [00:25<00:09,  2.86it/s, loss=0]
Epoch 44:  75%|███████▍  | 74/99 [00:25<00:08,  2.90it/s, loss=0]
Epoch 44:  75%|███████▍  | 74/99 [00:25<00:08,  2.90it/s, loss=0]

Validation: 0it [00:00, ?it/s]�[A

Validation:   0%|          | 0/25 [00:00<?, ?it/s]�[A

Validation DataLoader 0:   0%|          | 0/25 [00:00<?, ?it/s]�[A

Validation DataLoader 0:   4%|▍         | 1/25 [00:00<00:00, 149.64it/s]�[A
Epoch 44:  76%|███████▌  | 75/99 [00:48<00:15,  1.56it/s, loss=0]

Validation DataLoader 0:   8%|▊         | 2/25 [00:01<00:13,  1.67it/s] �[A
Epoch 44:  77%|███████▋  | 76/99 [00:49<00:14,  1.54it/s, loss=0]

Validation DataLoader 0:  12%|█▏        | 3/25 [00:01<00:12,  1.80it/s]�[A
Epoch 44:  78%|███████▊  | 77/99 [00:49<00:14,  1.55it/s, loss=0]

Validation DataLoader 0:  16%|█▌        | 4/25 [00:01<00:08,  2.38it/s]�[A
Epoch 44:  79%|███████▉  | 78/99 [00:49<00:13,  1.57it/s, loss=0]

Validation DataLoader 0:  20%|██        | 5/25 [00:01<00:06,  2.93it/s]�[A
Epoch 44:  80%|███████▉  | 79/99 [00:49<00:12,  1.59it/s, loss=0]

Validation DataLoader 0:  24%|██▍       | 6/25 [00:01<00:05,  3.45it/s]�[A
Epoch 44:  81%|████████  | 80/99 [00:49<00:11,  1.61it/s, loss=0]

Validation DataLoader 0:  28%|██▊       | 7/25 [00:01<00:04,  4.00it/s]�[A
Epoch 44:  82%|████████▏ | 81/99 [00:49<00:11,  1.63it/s, loss=0]

Validation DataLoader 0:  32%|███▏      | 8/25 [00:01<00:03,  4.55it/s]�[A
Epoch 44:  83%|████████▎ | 82/99 [00:49<00:10,  1.65it/s, loss=0]

Validation DataLoader 0:  36%|███▌      | 9/25 [00:01<00:03,  5.08it/s]�[A
Epoch 44:  84%|████████▍ | 83/99 [00:49<00:09,  1.67it/s, loss=0]

Validation DataLoader 0:  40%|████      | 10/25 [00:01<00:02,  5.58it/s]�[A
Epoch 44:  85%|████████▍ | 84/99 [00:49<00:08,  1.69it/s, loss=0]

Validation DataLoader 0:  44%|████▍     | 11/25 [00:01<00:02,  6.07it/s]�[A
Epoch 44:  86%|████████▌ | 85/99 [00:49<00:08,  1.71it/s, loss=0]

Validation DataLoader 0:  48%|████▊     | 12/25 [00:01<00:01,  6.51it/s]�[A
Epoch 44:  87%|████████▋ | 86/99 [00:49<00:07,  1.73it/s, loss=0]

Validation DataLoader 0:  52%|█████▏    | 13/25 [00:01<00:01,  6.99it/s]�[A
Epoch 44:  88%|████████▊ | 87/99 [00:49<00:06,  1.74it/s, loss=0]

Validation DataLoader 0:  56%|█████▌    | 14/25 [00:01<00:01,  7.47it/s]�[A
Epoch 44:  89%|████████▉ | 88/99 [00:49<00:06,  1.76it/s, loss=0]

Validation DataLoader 0:  60%|██████    | 15/25 [00:01<00:01,  7.93it/s]�[A
Epoch 44:  90%|████████▉ | 89/99 [00:49<00:05,  1.78it/s, loss=0]

Validation DataLoader 0:  64%|██████▍   | 16/25 [00:01<00:01,  8.39it/s]�[A
Epoch 44:  91%|█████████ | 90/99 [00:49<00:04,  1.80it/s, loss=0]

Validation DataLoader 0:  68%|██████▊   | 17/25 [00:01<00:00,  8.84it/s]�[A
Epoch 44:  92%|█████████▏| 91/99 [00:49<00:04,  1.82it/s, loss=0]

Validation DataLoader 0:  72%|███████▏  | 18/25 [00:01<00:00,  9.21it/s]�[A
Epoch 44:  93%|█████████▎| 92/99 [00:49<00:03,  1.84it/s, loss=0]

Validation DataLoader 0:  76%|███████▌  | 19/25 [00:01<00:00,  9.62it/s]�[A
Epoch 44:  94%|█████████▍| 93/99 [00:49<00:03,  1.86it/s, loss=0]

Validation DataLoader 0:  80%|████████  | 20/25 [00:01<00:00, 10.02it/s]�[A
Epoch 44:  95%|█████████▍| 94/99 [00:49<00:02,  1.88it/s, loss=0]

Validation DataLoader 0:  84%|████████▍ | 21/25 [00:02<00:00, 10.41it/s]�[A
Epoch 44:  96%|█████████▌| 95/99 [00:50<00:02,  1.90it/s, loss=0]

Validation DataLoader 0:  88%|████████▊ | 22/25 [00:02<00:00, 10.86it/s]�[A
Epoch 44:  97%|█████████▋| 96/99 [00:50<00:01,  1.92it/s, loss=0]

Validation DataLoader 0:  92%|█████████▏| 23/25 [00:02<00:00, 11.22it/s]�[A
Epoch 44:  98%|█████████▊| 97/99 [00:50<00:01,  1.94it/s, loss=0]

Validation DataLoader 0:  96%|█████████▌| 24/25 [00:02<00:00, 11.67it/s]�[A
Epoch 44:  99%|█████████▉| 98/99 [00:50<00:00,  1.96it/s, loss=0]

Validation DataLoader 0: 100%|██████████| 25/25 [00:02<00:00, 12.13it/s]�[A
Epoch 44: 100%|██████████| 99/99 [00:50<00:00,  1.98it/s, loss=0]
Epoch 44: 100%|██████████| 99/99 [00:50<00:00,  1.98it/s, loss=0]

                                                                        �[ATraceback (most recent call last):
  File "DeepDPM.py", line 456, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 436, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 286, in on_advance_end
    epoch_end_outputs = self.trainer._call_lightning_module_hook("training_epoch_end", epoch_end_outputs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 462, in training_epoch_end
    ) = self.training_utils.init_subcluster_params(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/training_utils.py", line 190, in init_subcluster_params
    mus, covs, pis = init_mus_and_covs_sub(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 134, in init_mus_and_covs_sub
    codes_k = codes[indices_k]
IndexError: The shape of the mask [9347] at index 0 does not match the shape of the indexed tensor [3154, 128] at index 0

I don't know if further coding comment impacts the results of clustering or not....
could you explain more about problem with the dimension configuration please? how should I check or change please?

thank you

sincerely

Zhao

Originally posted by @Vz09 in #34 (comment)

TensorOptions TypeError at 'embbeded_datasets.py', line 35

I am new to keras, so it's probably my error, but I can not find a solution in any forum, so I am asking here.
The error is:
TypeError: expected TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)) (got TensorOptions(dtype=unsigned char, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
When I run: python DeepDPM.py --dataset MNIST_N2D

AttributeError: 'ClusterNet' object has no attribute 'model'

I try to pre-train the model on MINIST dataset, but I got the AttributeError: 'ClusterNet' object has no attribute 'model' . It seems that the ClusterNet model fails to call the def init_cluster() to initialize the self.model = ClusterNetTrainer(). Below is the traceback of the error. Could you help me to fix this?

Epoch 0: 0%| | 0/548 [00:00<?, ?it/s]
========== Start pretraining ==========
Epoch 0: 70%|██▊ | 384/548 [09:47<04:10, 1.53s/it, loss=2.49e+04, v_num=] Epoch 1: 28%|███ | 153/548 [01:59<05:08, 1.28it/s, loss=2.23e+04, v_num=] Epoch 2: 65%|████▌ | 354/548 [09:07<04:59, 1.55s/it, loss=1.95e+04, v_num=]
Epoch 2: 100%|████████████| 548/548 [12:13<00:00, 1.34s/it, loss=1.88e+04, v_num=]
Traceback (most recent call last):
File "/home/huili/Projects/DeepDPM_Original/DeepDPM_alternations.py", line 236, in
train_clusternet_with_alternations()
File "/home/huili/Projects/DeepDPM_Original/DeepDPM_alternations.py", line 208, in train_clusternet_with_alternations
DeepDPM = model.clustering.model.cluster_model
AttributeError: 'ClusterNet' object has no attribute 'model'

what is embbeded_datasets?

When I run the DeepDPM_load_from_checkpoint.py, it report error that "No module named 'src.embbeded_datasets'".
I am forward to waiting your reply, thanks~

Training using a custom dataset

Hi, do you have any scripts we can use to generate the pt file from our images? I tried to make it but I cannot export it successfully

The result of accuracy and k is strange when I change the batch size

Question is the results trained on STL-10 are too low with different batch size.
The only difference of environment is that the version of python is 3.8.
And the accuracy is 0.793, and k is 10 when I use the default command (as shown in the first, second quote).
But when I only change the batch size (e.g., 512), the result is very strange (as shown in the third, fourth quote), especially k = 3. So my question is what happen when I change the batch size?

The default command : python DeepDPM.py --dataset stl10 --init_k 3 --dir pretrained_embeddings/MOCO/STL10 --NIW_prior_nu 514 --prior_sigma_scale 0.05 --seed 0 --use_labels_for_eva --save_checkpoints True --exp_name 'test'

The result of default command : NMI: 0.74268, ARI: 0.6775, acc: 0.793, final K: 10

The command of changing batch size : python DeepDPM.py --dataset stl10 --init_k 3 --dir pretrained_embeddings/MOCO/STL10 --NIW_prior_nu 514 --prior_sigma_scale 0.05 --seed 0 --use_labels_for_eva --batch-size 512 --save_checkpoints True --exp_name 'test1'

The result of the command of changing batch size : NMI: 0.52621, ARI: 0.29056, acc: 0.2952, final K: 3

ValueError: array must not contain infs or NaNs in DeepDPM_alternations.py

Hi, author.

I encountered the following error when running the DeepDPM_alternations.py script on MNIST.

The command I used is the following:

python DeepDPM_alternations.py --latent_dim 10 --dataset mnist --lambda_ 0.005 --lr 0.002 --init_k 3 --train_cluster_net 200 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir data --pretrain_path ./saved_models/ae_weights/mnist_e2e.zip --number_of_ae_alternations 3 --transform None --log_metrics_at_train True --gpus 1 --offline

The error I got is the following.

Epoch 38:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                         | 469/548 [00:29<00:04, 15.86it/s, loss=2.68, v_num=]running k-means on cuda..                                                                                                                                                                                                                         
[running kmeans]: 1it [00:00, 17.12it/s, center_shift=0.000000, iteration=1, tol=0.000100]
Evaluating...ns]: 0it [00:00, ?it/s, center_shift=0.000000, iteration=1, tol=0.000100]
NMI : 0.556786727054141, ARI: 0.3196873776276074, ACC: 0.3082, current K: 3
Epoch 38: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 546/548 [00:34<00:00, 15.77it/s, loss=2.68, v_num=]Evaluating...96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋       | 76/79 [00:02<00:00, 30.46it/s]
Evaluating...
Epoch 39:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                         | 469/548 [00:36<00:06, 12.94it/s, loss=2.67, v_num=]
Epoch 0:   0%|                                                                                                                                                                                                            | 0/548 [20:34<?, ?it/s]
Traceback (most recent call last):
  File "/data2/users/luomai/DeepDPM/DeepDPM_alternations.py", line 201, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 633, in run_train
    self.train_loop.on_train_epoch_start(epoch)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 203, in on_train_epoch_start
    self.trainer.call_hook("on_train_epoch_start")
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in call_hook
    output = hook_fx(*args, **kwargs)
  File "/data2/users/luomai/DeepDPM/src/AE_ClusterPipeline.py", line 250, in on_train_epoch_start
    self._init_clusters()
  File "/data2/users/luomai/DeepDPM/src/AE_ClusterPipeline.py", line 121, in _init_clusters
    self.clustering.init_cluster(self.train_dataloader(), self.val_dataloader(), logger=self.logger, centers=centers, init_num=self.init_clusternet_num)
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet.py", line 57, in init_cluster
    self.fit_cluster(train_loader, val_loader, logger, centers)
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet.py", line 65, in fit_cluster
    self.model.fit(train_loader, val_loader, logger, self.args.train_cluster_net, centers=centers)
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet_modules/clusternet_trainer.py", line 42, in fit
    cluster_trainer.fit(self.cluster_model, train_loader, val_loader)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 560, in run_training_epoch
    self.trainer.logger_connector.log_train_epoch_end_metrics(
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 415, in log_train_epoch_end_metrics
    self.training_epoch_end(model, epoch_output, num_optimizers)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 466, in training_epoch_end
    epoch_output = model.training_epoch_end(epoch_output)
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 478, in training_epoch_end
    ) = self.training_utils.comp_subcluster_params(
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet_modules/utils/training_utils.py", line 179, in comp_subcluster_params
    mus_sub, covs_sub, pi_sub = compute_mus_covs_pis_subclusters(
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 311, in compute_mus_covs_pis_subclusters
    mus_sub, covs_sub, pi_sub_ = init_mus_and_covs_sub(codes=codes, k=k, n_sub=n_sub, logits=logits, logits_sub=logits_sub, how_to_init_mu_sub="kmeans_1d", prior=prior, use_priors=use_priors, device=codes.device)
  File "/data2/users/luomai/DeepDPM/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 137, in init_mus_and_covs_sub
    pca_codes = pca.fit_transform(codes_k.detach().cpu())
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/sklearn/decomposition/_pca.py", line 407, in fit_transform
    U, S, Vt = self._fit(X)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/sklearn/decomposition/_pca.py", line 459, in _fit
    return self._fit_truncated(X, n_components, self._fit_svd_solver)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/sklearn/decomposition/_pca.py", line 580, in _fit_truncated
    U, S, Vt = randomized_svd(
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/sklearn/utils/extmath.py", line 407, in randomized_svd
    Uhat, s, Vt = linalg.svd(B, full_matrices=False)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/scipy/linalg/decomp_svd.py", line 108, in svd
    a1 = _asarray_validated(a, check_finite=check_finite)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/scipy/_lib/_util.py", line 293, in _asarray_validated
    a = toarray(a)
  File "/data2/users/luomai/anaconda3/envs/DeepDPM/lib/python3.9/site-packages/numpy/lib/function_base.py", line 488, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

`IndexError` when training on a custom dataset

I cloned the repository and trained a DeepDPM on a custom dataset (with pre-computed embeddings of dim 128).

I ran the DeepDPM.py file with init_k=150. At epoch 56 (after splitting clusters at epoch 55), I get the following error:

line 237, in cluster_loss_function
    gmm_k = MultivariateNormal(model_mus[k].double().to(device=self.device), model_covs[k].double().to(device=self.device))
IndexError: index 150 is out of bounds for dimension 0 with size 150

After splitting, the model creates more clusters (156) than model_mus and model_covs can handle (150).
@meitarronen Please help, how can I deal with this issue?

RuntimeError: All elements must be greater than (p-1)/2

Dear authors, I tried to use your implementation in research proposes and faced error in training process.
I want to cluster vectors of SIFT descriptors (image keypoints). To this I packed data which same to your examples (you could download it with the following link: https://drive.google.com/file/d/17TqIZLtlmSI2jiPl-haLE-_gpVVWoXhp/view?usp=sharing). As labels, I used results of kmeans algorithm from scikit-learn.
I tried to train model without and with initial clusters count (--init_k=100) and faced the following error:

Evaluating...
Epoch 55:  77%|███████████████████████████████████████▏           | 3907/5079 [01:40<00:30, 38.90it/s, loss=0.238, v_num=]
Traceback (most recent call last):                                                                                        
  File "DeepDPM.py", line 457, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 442, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 560, in run_training_epoch
    self.trainer.logger_connector.log_train_epoch_end_metrics(
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 415, in log_train_epoch_end_metrics
    self.training_epoch_end(model, epoch_output, num_optimizers)
  File "/home/alexey/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 466, in training_epoch_end
    epoch_output = model.training_epoch_end(epoch_output)
  File "/home/alexey/programming/upwork/duplicates_filter/visual_matching_model/DeepDPM/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 492, in training_epoch_end
    split_decisions = split_step(
  File "/home/alexey/programming/upwork/duplicates_filter/visual_matching_model/DeepDPM/src/clustering_models/clusternet_modules/utils/clustering_utils/split_merge_operations.py", line 157, in split_step
    split_rule(
  File "/home/alexey/programming/upwork/duplicates_filter/visual_matching_model/DeepDPM/src/clustering_models/clusternet_modules/utils/clustering_utils/split_merge_operations.py", line 106, in split_rule
    log_ll_k = prior.log_marginal_likelihood(codes_k, mus[k])
  File "/home/alexey/programming/upwork/duplicates_filter/visual_matching_model/DeepDPM/src/clustering_models/clusternet_modules/utils/clustering_utils/priors.py", line 66, in log_marginal_likelihood
    return self.mus_covs_prior.log_marginal_likelihood(codes_k, mu_k)
  File "/home/alexey/programming/upwork/duplicates_filter/visual_matching_model/DeepDPM/src/clustering_models/clusternet_modules/utils/clustering_utils/priors.py", line 165, in log_marginal_likelihood
    - mvlgamma(torch.tensor(self.niw_nu) / 2.0, D)
RuntimeError: All elements must be greater than (p-1)/2

Can you suggest what is the problem?

I used the following command to run training:
python3 DeepDPM.py --dataset smallsift --latent_dim 128 --offline --gpus 0 --save_checkpoints=True --dir pretrained_embeddings/MOCO/

Best regards, Alexey

Invalid Covariance Matrix Error in DeepDPM_alternations.py

Hello,

Thank you for the amazing work and for publishing the code.

I encountered the following error when running the DeepDPM_alternations.py script on MNIST.

The command I used is the following (copied from the example only with --dir different and --offline added):

python DeepDPM_alternations.py --latent_dim 10 --dataset mnist --lambda_ 0.005 --lr 0.002 --init_k 3 --train_cluster_net 200 --alternate --init_cluster_net_using_centers --reinit_net_at_alternation --dir ./dataset/ --pretrain_path ./saved_models/ae_weights/mnist_e2e.zip --number_of_ae_alternations 3 --transform None --log_metrics_at_train True --gpus 1 --epoch 1 --offline

The error I got is the following. I tried several times, and the error always happened around iteration 30 - 33.

Evaluating...
Epoch 30: 100%|█████████| 548/548 [00:17<00:00, 30.63it/s, loss=0.00362, v_num=]
Epoch 31:  86%|███████▋ | 469/548 [00:14<00:02, 31.33it/s, loss=0.00345, v_num=]Evaluating...
NMI : 0.49259220412522065, ARI: 0.28473904520954585, ACC: 0.30912, current K: 3
Validating: 0it [00:00, ?it/s]
Epoch 31:  86%|███████▋ | 470/548 [00:15<00:02, 30.05it/s, loss=0.00345, v_num=]
Epoch 0:   0%|                                          | 0/548 [07:41<?, ?it/s]
Traceback (most recent call last):
  File "/home/shichang/paper_reproduce/DeepDPM/DeepDPM_alternations.py", line 202, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 633, in run_train
    self.train_loop.on_train_epoch_start(epoch)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 203, in on_train_epoch_start
    self.trainer.call_hook("on_train_epoch_start")
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in call_hook
    output = hook_fx(*args, **kwargs)
  File "/home/shichang/paper_reproduce/DeepDPM/src/AE_ClusterPipeline.py", line 250, in on_train_epoch_start
    self._init_clusters()
  File "/home/shichang/paper_reproduce/DeepDPM/src/AE_ClusterPipeline.py", line 121, in _init_clusters
    self.clustering.init_cluster(self.train_dataloader(), self.val_dataloader(), logger=self.logger, centers=centers, init_num=self.init_clusternet_num)
  File "/home/shichang/paper_reproduce/DeepDPM/src/clustering_models/clusternet.py", line 57, in init_cluster
    self.fit_cluster(train_loader, val_loader, logger, centers)
  File "/home/shichang/paper_reproduce/DeepDPM/src/clustering_models/clusternet.py", line 65, in fit_cluster
    self.model.fit(train_loader, val_loader, logger, self.args.train_cluster_net, centers=centers)
  File "/home/shichang/paper_reproduce/DeepDPM/src/clustering_models/clusternet_modules/clusternet_trainer.py", line 42, in fit
    cluster_trainer.fit(self.cluster_model, train_loader, val_loader)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 726, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 166, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 177, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 131, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/home/shichang/paper_reproduce/DeepDPM/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 275, in validation_step
    cluster_loss = self.training_utils.cluster_loss_function(
  File "/home/shichang/paper_reproduce/DeepDPM/src/clustering_models/clusternet_modules/utils/training_utils.py", line 235, in cluster_loss_function
    gmm_k = MultivariateNormal(model_mus[k].double().to(device=self.device), model_covs[k].double().to(device=self.device))
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/home/shichang/anaconda3/envs/deepdpm/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter covariance_matrix (Tensor of shape (10, 10)) of distribution MultivariateNormal(loc: torch.Size([10]), covariance_matrix: torch.Size([10, 10])) to satisfy the constraint PositiveDefinite(), but found invalid values:
tensor([[ 0.5951,  0.0092,  0.2854,  0.0728, -0.2996,  0.1251, -0.1172, -0.0078,
          0.0765,  0.1235],
        [ 0.0092,  0.5513,  0.0643, -0.1219,  0.1693,  0.1630, -0.1415, -0.0218,
         -0.1041,  0.0764],
        [ 0.2854,  0.0643,  0.6676, -0.2186, -0.2203,  0.0510, -0.0280, -0.1126,
          0.1910,  0.1136],
        [ 0.0728, -0.1219, -0.2186,  0.8729, -0.1988, -0.1151, -0.0771,  0.1936,
         -0.1680,  0.1279],
        [-0.2996,  0.1693, -0.2203, -0.1988,  0.7979, -0.0535,  0.0689, -0.0186,
         -0.1713, -0.0783],
        [ 0.1251,  0.1630,  0.0510, -0.1151, -0.0535,  0.4169, -0.1412,  0.0098,
          0.0774,  0.1063],
        [-0.1172, -0.1415, -0.0280, -0.0771,  0.0689, -0.1412,  0.5972, -0.0558,
          0.0094, -0.0770],
        [-0.0078, -0.0218, -0.1126,  0.1936, -0.0186,  0.0098, -0.0558,  0.4313,
          0.0730,  0.0873],
        [ 0.0765, -0.1041,  0.1910, -0.1680, -0.1713,  0.0774,  0.0094,  0.0730,
          0.6265,  0.0646],
        [ 0.1235,  0.0764,  0.1136,  0.1279, -0.0783,  0.1063, -0.0770,  0.0873,
          0.0646,  0.4388]], device='cuda:1', dtype=torch.float64)

Custom Dataset

Hi, may I know what's the steps if I want to train using my own dataset?

Questions about AttributeError: Missing attribute "user_labels_for_eval"

Hello, thank you for your excellent work! I just started running your program according to the description file, but I have encountered some problems.

First of all, I ran 'python DeepDPM.py --dataset synthetic --log_emb every_n_epochs --log_emb_every 1' according to the instructions, but reported the error AttributeError: Missing Attribute "user _ labels _ for _ eval". I don't know how to set this user_labels_for_eval. (I remember that the documentation said that this was not needed for training, but I don't know why I reported an error.)

Secondly, you mentioned that you don't need to download the data set during raw data training, and it will be automatically downloaded to data ('When training on raw data (e.g., on MNIST, Reuters10k) the data for MNIST will be automatically downloaded to the "data" directory. For reuters10k, the user needs to download the dataset independently (available online) into the "data" directory.'). Is this training directly from the original data set online? Because I didn't see the data folder (my understanding is that the program will automatically download the data set and generate a response folder)

Thank you for your answer.

about your result file

Hello, I am very interested in your work! I would like to ask about the seed setting for USPS and FASHION data sets. I want to follow your work. Is it convenient to provide the result file?
In the paper, you said that the same experimental settings were used for the balance and imbalance of the three data sets, but we could not reproduce your results. We adjusted the parameters split_merge_every_n_epochs and prior_sigma_scale, but they were still worse. Results cannot be reproduced in imbalanced versions of USPS. Could you please give some suggestions on the settings of the parameters?
Looking forward to your reply. Thank you!

Training gets stuck in a loop [running kmeans]

I have 18k training samples with embedding dimension of 4. After about 150 epochs and a few kmeans splitting operations the training gets stuck in a loop during another kmeans operation with center_shift=nan

kmeans

How to generate .zip file as shown in the path of "\saved_models\ae_weights\" for a custom dataset?

Thank you for sharing your great work! I have tried the newest version package, it works for clustering benchmark datasets but there are still difficulties with my custom dataset. After preparing a custom dataset, the first thing seems to learn weights of an auto-encoder on the custom dataset and saving the corresponding .zip file in the path of "\saved_models\ae_weights". In order to generate train_data.pt, we need the file of "make_embbedings.py" or "make_umap_embeddings.py" to do so. But I found that the "make_embbedings.py" has a parameter "--pretrain_path" and make_umap_embeddings.py has a parameter "--ae_pretrain_path" which means the weights of an auto-encoder should be prepared in advance. Could you please share the corresponding script or tell me how to do it in more detail. Looking forward to your guidance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.