Coder Social home page Coder Social logo

lightning-universe / lightning-bolts Goto Github PK

View Code? Open in Web Editor NEW
1.6K 23.0 320.0 5.7 MB

Toolbox of models, callbacks, and datasets for AI/ML researchers.

Home Page: https://lightning-bolts.readthedocs.io

License: Apache License 2.0

Python 99.93% Makefile 0.07%
pytorch ai machine-learning image-processing natural-language-processing supervised-learning gan

lightning-bolts's Introduction

Deep Learning components for extending PyTorch Lightning


InstallationLatest DocsStable DocsAboutCommunityWebsiteLicense

PyPI Status PyPI - Downloads Build Status codecov

Documentation Status Slack license DOI


Getting Started

Pip / Conda

pip install lightning-bolts
Other installations

Install bleeding-edge (no guarantees)

pip install https://github.com/Lightning-Universe/lightning-bolts/archive/refs/heads/master.zip

To install all optional dependencies

pip install lightning-bolts["extra"]

What is Bolts?

Bolts package provides a variety of components to extend PyTorch Lightning, such as callbacks & datasets, for applied research and production.

Example 1: Accelerate Lightning Training with the Torch ORT Callback

Torch ORT converts your model into an optimized ONNX graph, speeding up training & inference when using NVIDIA or AMD GPUs. See the documentation for more details.

from pytorch_lightning import LightningModule, Trainer
import torchvision.models as models
from pl_bolts.callbacks import ORTCallback


class VisionModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.model = models.vgg19_bn(pretrained=True)

    ...


model = VisionModel()
trainer = Trainer(gpus=1, callbacks=ORTCallback())
trainer.fit(model)

Example 2: Introduce Sparsity with the SparseMLCallback to Accelerate Inference

We can introduce sparsity during fine-tuning with SparseML, which ultimately allows us to leverage the DeepSparse engine to see performance improvements at inference time.

from pytorch_lightning import LightningModule, Trainer
import torchvision.models as models
from pl_bolts.callbacks import SparseMLCallback


class VisionModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.model = models.vgg19_bn(pretrained=True)

    ...


model = VisionModel()
trainer = Trainer(gpus=1, callbacks=SparseMLCallback(recipe_path="recipe.yaml"))
trainer.fit(model)

Are specific research implementations supported?

We'd like to encourage users to contribute general components that will help a broad range of problems; however, components that help specific domains will also be welcomed!

For example, a callback to help train SSL models would be a great contribution; however, the next greatest SSL model from your latest paper would be a good contribution to Lightning Flash.

Use Lightning Flash to train, predict and serve state-of-the-art models for applied research. We suggest looking at our VISSL Flash integration for SSL-based tasks.

Contribute!

Bolts is supported by the PyTorch Lightning team and the PyTorch Lightning community!

Join our Slack and/or read our CONTRIBUTING guidelines to get help becoming a contributor!


License

Please observe the Apache 2.0 license that is listed in this repository. In addition, the Lightning framework is Patent Pending.

lightning-bolts's People

Contributors

akihironitta avatar ananyahjha93 avatar aniketmaurya avatar annikabrundyn avatar baruchg avatar blahblahhhj avatar borda avatar briankosw avatar chris-clem avatar clementpoiret avatar deepsource-autofix[bot] avatar dependabot[bot] avatar djbyrne avatar garryod avatar hassiahk avatar hecoding avatar lijm1358 avatar matsumotosan avatar nateraw avatar oke-aditya avatar otaj avatar pre-commit-ci[bot] avatar rohitgr7 avatar senarvi avatar shivammehta25 avatar sidhantls avatar teddykoker avatar williamfalcon avatar zhengyu-yang avatar zlapp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightning-bolts's Issues

SimCLR Imagenet Results Reproduction

Hi,

Firstly thank you for this very cool repository.
I would like to use your implementation of SimCLR to reproduce the Imagenet results.

I saw the following command in the implementation:
# imagenet
python simclr_module.py
--gpus 8
--dataset imagenet2012
--data_dir /path/to/imagenet/
--meta_dir /path/to/folder/with/meta.bin/
--batch_size 32

Does this command match the papers results? How long does the self-supervised training take? Are there any stats that could be added for reference (hardware used, times ect.).

Thanks :)

Incorrect model link for CPCV2-resnet18?

🐛 Bug

Following the docs https://pytorch-lightning-bolts.readthedocs.io/en/latest/self_supervised_models.html I was loading a pretrained ResNet-18 for CPCV2, but it seems that it returns a ResNet-101. Maybe the URL to the trained model is incorrect?

Screen Shot 2020-09-04 at 9 09 09 PM

There also seems to be an unrelated bug when it loads the model in file https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/32fb560a429532dfb40a5935ca7674990dae1f66/pl_bolts/models/self_supervised/cpc/cpc_module.py
Line 160 should be elif encoder not in available_weights?

Add logger for Azure Machine Learning

(Migrated over from Add logger for Azure Machine Learning #3128.)

🚀 Feature

We already have PyTorch Lightning loggers for Comet, MLFlow, and Neptune. It would be great if we also had one for Azure Machine Learning (AML).

Motivation

A data scientist hoping to use PyTorch Lightning in AML currently has to build their own "logger adapter" to get logs from their training runs to show up in the AML "metrics" UI like this:

image

It would be great if a user could just "drop in" an AML logger and get those kinds of metrics for free.

Pitch

The AML logging API is very similar to PyTorch Lightning's, with the small caveat that AML uses the terms "experiment" and "run" slightly differently than PyTorch Lightning does. I've even coded up a preliminary implementation with unit tests here: dkmiller/pytorch-lightning-bolts.

Alternatives

The only alternative I can think of is for individual users to build their own AML :: PyTorch Lightning loggers.

Additional context

Full disclosure: I'm a data scientist in the Azure Machine Learning team.

Fastprogress callback

🚀 Feature

Copied from Lightning-AI/pytorch-lightning#1500 by @tcwalther

Use https://github.com/fastai/fastprogress instead of tqdm for displaying progress bars.

Motivation

tqdm doesn't work well with Jupyter Lab. This creates issues with pytorch-lightning's progress bars, where each validation bar creates a newline in the cell output: #1399. This is a well-known issue in jupyter-widgets: jupyter-widgets/ipywidgets#1845

The Fast.ai people developed fastprogress as a replacement for exactly this reason.

Pitch

Option to replace tqdm with fastprogress.

Alternatives

The alternative is to wait for jupyter-widgets to fix the issue on their side. Given that this issue has been around since November 2017, I'm not too hopeful that it will happen, though. It looks like this would require major design changes in ipywidgets and/or jupyter-lab.

Additional context

Motivated by #1399.

Add auto_scale_batch_size for DataModule

🚀 Feature

Could you please add the auto_scale_batch_size functionality to the DataModule class. Currently when the data part is separated from the model it is impossible to set batch size to auto scale.

The following error is returned:

MisconfigurationException: Field batch_size not found in both `model` and `model.hparams`

Motivation

This functionality is very helpful in training the models and I guess moving what was already implemented to DataModule would be fairly simple

How it should work

Exactly as in LightningModule class. I should be only obligated to add batch_size field in the class and the framework should take care of everything else, example below

class FashionMNISTDataModule(pl.LightningDataModule):
    batch_size = 1 #only this should be initialised

    def __init__(self, root='./data/FashionMNIST'):
        self.root = root
        
        super().__init__()

    def setup(self, stage=None):
        self.trainset = torchvision.datasets.FashionMNIST(
            root = self.root,
            train = True,
            download = True,
            transform = transforms.Compose([
                transforms.ToTensor()]))

        self.valset = torchvision.datasets.FashionMNIST(
            root = self.root,
            train = False,
            download = True,
            transform = transforms.Compose([
                transforms.ToTensor()]))

    def train_dataloader(self):
        return DataLoader(self.trainset, batch_size=self.batch_size, num_workers=4, shuffle=True)

    def val_dataloader(self):
        return DataLoader(self.valset, batch_size=self.batch_size, num_workers=4)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size)

dm = FashionMNISTDataModule()

trainer = pl.Trainer(max_epochs=10,
                     gpus=1,
                     auto_scale_batch_size='power')

trainer.fit(model, dm)

visdom logger

🚀 Feature

Comet, MLFlow, Neptune, TensorBoard, test-tube, WandB, and CSV are the supported loggers as of 08/31/2020. It would be great if Visdom was supported too.

Motivation

just a matter of preference. one nice feature of visdom is that it allows users to organize views. Visdom also caches state.

Pitch

an API that looks like the other loggers, but uses visdom instead

Alternatives

The most reasonable way to implement this would be to add a Visdom logger here:
https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pytorch_lightning/loggers

Additional context

n/a

SimCLR doesn't seem to utilize GPU for forward/backward passes

🐛 Bug

When running the pl_bolts simclr implementation on cifar10 and my own dataset, I see very low GPU utilization and only at the very beginning of the validation sanity check. After that, nvidia-smi reports that the job holds on to a bit of GPU memory but never shows GPU utilization above 0%.

There was also a missing import of ArgumentParser here https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/977f2ed92694d9de1d05b4c53efbf149192505b4/pl_bolts/models/self_supervised/simclr/simclr_module.py#L247

To Reproduce

Run the following:

python -m pl_bolts.models.self_supervised.simclr.simclr_module --dataset cifar10 --ma
x_epochs=1 --profiler=True

Watch GPU utilization via nvidia-smi. I see a small spike at the start of validation sanity check and then 0%.

Code sample

See above

Expected behavior

Higher GPU Utilization

Environment

pytorch-lightning 0.7.6
pytorch-lightning-bolts 0.1.0.dev8
torch 1.5.0
torchvision 0.6.0a0+82fd1c8

  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): AWS Deep Learning AMI w/ pytorch_latest_py36 conda pre-built environment
  • Build command you used (if compiling from source): Pre-built in AMI
  • Python version: 3.6.10
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: Tesla K80, driver 440.33.01
  • Any other relevant information:

[BUG] SimCLR NT Xent loss does not take into account batches from other DDP processes

🐛 Bug

In the NT Xent loss, out_1 and out_2 are not gathered over the whole DDP process group. This is a big issue as the loss is only classifying the correct pair over local_batch_size*2 possible pairs instead of over world_size*local_batch_size*2 possible pairs.

Code Sample

See "Gather hidden1/hidden2 across replicas and create local labels." comment in original implementation:

def add_contrastive_loss(hidden,
                         hidden_norm=True,
                         temperature=1.0,
                         tpu_context=None,
                         weights=1.0):
  """Compute loss for model.
  Args:
    hidden: hidden vector (`Tensor`) of shape (2 * bsz, dim).
    hidden_norm: whether or not to use normalization on the hidden vector.
    temperature: a `floating` number for temperature scaling.
    tpu_context: context information for tpu.
    weights: a weighting number or vector.
  Returns:
    A loss scalar.
    The logits for contrastive prediction task.
    The labels for contrastive prediction task.
  """
  # Get (normalized) hidden1 and hidden2.
  if hidden_norm:
    hidden = tf.math.l2_normalize(hidden, -1)
  hidden1, hidden2 = tf.split(hidden, 2, 0)
  batch_size = tf.shape(hidden1)[0]

  # Gather hidden1/hidden2 across replicas and create local labels.
  if tpu_context is not None:
    hidden1_large = tpu_cross_replica_concat(hidden1, tpu_context)
    hidden2_large = tpu_cross_replica_concat(hidden2, tpu_context)
    enlarged_batch_size = tf.shape(hidden1_large)[0]
    # TODO(iamtingchen): more elegant way to convert u32 to s32 for replica_id.
    replica_id = tf.cast(tf.cast(xla.replica_id(), tf.uint32), tf.int32)
    labels_idx = tf.range(batch_size) + replica_id * batch_size
    labels = tf.one_hot(labels_idx, enlarged_batch_size * 2)
    masks = tf.one_hot(labels_idx, enlarged_batch_size)
  else:
    hidden1_large = hidden1
    hidden2_large = hidden2
    labels = tf.one_hot(tf.range(batch_size), batch_size * 2)
    masks = tf.one_hot(tf.range(batch_size), batch_size)

  logits_aa = tf.matmul(hidden1, hidden1_large, transpose_b=True) / temperature
  logits_aa = logits_aa - masks * LARGE_NUM
  logits_bb = tf.matmul(hidden2, hidden2_large, transpose_b=True) / temperature
  logits_bb = logits_bb - masks * LARGE_NUM
  logits_ab = tf.matmul(hidden1, hidden2_large, transpose_b=True) / temperature
  logits_ba = tf.matmul(hidden2, hidden1_large, transpose_b=True) / temperature

  loss_a = tf.losses.softmax_cross_entropy(
      labels, tf.concat([logits_ab, logits_aa], 1), weights=weights)
  loss_b = tf.losses.softmax_cross_entropy(
      labels, tf.concat([logits_ba, logits_bb], 1), weights=weights)
  loss = loss_a + loss_b

  return loss, logits_ab, labels

Minimal Requirement Tests Hanging

🐛 Bug

Tests run using the minimal requirements hang and are eventually canceled by CI. When run with latest requirements, test complete in ~5 minutes

Following behavior has been observed

  • resnet128 gets stuck when run with all the tests, but not when you run pytest test_resnet.py
  • self_supervised hangs both as a group test and single test
  • gans fails when run with minimal requirements.txt
  • test_dev_dataset hangs when run as a group or as a single test

To Reproduce

run unit tests with minimal requirements

Optimise RL Code

🚀 Feature

There seems to be fair few inefficiencies in the RL model code.

In both the VPG and DQN code, the network is computed twice, once to generate the trajectory and then once again in the loss function.

https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/datamodules/experience_source.py#L165

https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/rl/vanilla_policy_gradient_model.py#L146
https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/losses/rl.py#L35

Because of the way PyTorch stores computational graph, it is sufficient to simply run the network once when generating the trajectory and store the output, and compute the loss on that at each training step. This is pointlessly doubling the computational cost.

Furthermore, in both the VPG and DQN code, multi-envs are allowed, but no parallelrisation is applied across them. This takes away significant proportion of the advantage of using multi-envs in the first place
https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/rl/vanilla_policy_gradient_model.py#L202

batch_size for MNISTDataModule

🚀 Feature

Add the batch_size parameter to the MNISTDataModule and BinaryMNISTDataModule.

Motivation

When using the MNISTDataModule there is no way to set the batch size if it is used directly within PyTorch Lightning (i.e. as argument to Trainer.fit or as _datamodule field inside a LightningModule).

Pitch

I would like to be able to set the batch size when initializing an MNISTDataModule as I can with many other DataModules right now (like CIFAR10DataModule or ImagenetDataModule).

Alternatives

An alternative to set the batch size would be not to feed the DataModule to the trainer directly (or using it as _datamodule field), but to use the separate test_dataloader, train_dataloader and val_dataloader methods separately. However I think ti would be against one of the points of using a DataModule.

Additional context

A possible implementation could be like that shown on Lightning's docs. I would be available to open a PR and work on this.

Multi Env Runs for RL models

🚀 Feature

Carry out roll outs of environment steps across multiple environments instead of just one.

Motivation

This will allow the value based agents to gather more IID experiences and increase the batch size, thus improving training time.

Pitch

Each training step carries out a step in each environment in the environment pool. These experiences are added to a single buffer which the DataLoader pulls batches from used in training

GPUs are not utilized in any self_supervised models without the --gpus flag

🐛 Bug

Running something like CUDA_VISIBLE_DEVICES=0 python simclr_module.py allocates memory on the GPU but doesn't put any batches or forward/backward passes on it. Reported specifically for SimCLR in #35.

This issue also percolates to the current tests in 'tests/models/test_self_supervised.py' where tests are passed without ever running the model on GPUs. Test fix expected in #122.

To Reproduce

Steps to reproduce the behavior:

  1. python simclr_module.py

'simclr_module.py' can be replaced by amdim/cpc etc.

Screen Shot 2020-07-22 at 3 12 52 AM

Screen Shot 2020-07-22 at 3 13 03 AM

Expected behavior

The default params should put the model on the GPU and if --gpus flag is not set, Lightning trainer should not report GPU used as True.

Might well be a lightning bug and not a bolts bug.

Polyaxon Logger

🚀 Feature

I would like to have a logger for the polyaxon platform. Similar to the way there is such a logger for MLFlow, WandB, tensorboard and such.

Motivation

Polyaxon is a fast growing, large scale, open source ML platfrorm that I'v been using for a couple of months, it is similar to WandB but with a larger scope.

Pitch

A new logger to be added that would let the user use polyaxon as an experiment management system with ease.

Additional context

To learn more about polyaxon you can visit the github repo HERE or the website HERE

Use Pytorch Lightning's LightningDataModule

🚀 Feature

Refactor to reflect the built-in LightningDataModule that has been implemented directly into lightning by removing the same code here and replacing with references to pl.LightningDataModule

Motivation

Pytorch Lightning version 0.9.0 implements the LightningDataModule directly in lightning. We have no need for the bolts implementation once that version is released, so we should update the code here accordingly.

Pitch

I suggest we merge a fix for this as soon as the stable version of 0.9.0 has been released. We could do it earlier, but something bothers me about having a prerelease in requirements.txt 😅

Alternatives

Additional context

The implementation is nearly the same and should cause no conflicts.

  • The code can be found here
  • The docs can be found here

Question regarding reproducibility of train, validation splits

Are the train, validations splits of the dataset guaranteed? From what I see, if I'm not
missing something, randomness depends on random_split that uses PyTorch default_generator if a new PyTorch version changes this, is it possible that 2 different versions of the MNISTDataModule for example with 2 torch versions have different splits?
Thanks and sorry if this is obvious!

Add Cityscapes DataModule

🚀 Feature

A new DataModule lass for Cityscapes dataset

Motivation

Pitch

Alternatives

Additional context

As discussed here

Base Networks Module

🚀 Feature

Have a series of base networks that are configurable in order to act as a base or backbone for other bolts models

Motivation

Many models use a common CNN/RNN/MLP backbone which is then added to or built on top of. I think it would be really useful to have a common module containing base implementations of common models that are configurable and usable across a variety of models. This will make bolts implementations more modular and make it easier to extend models for our users.

Pitch

Have a networks module in bolts. This would allow users to pass in a basic network structure for base networks like CNN/MLP.
The network object would take in a dict or config file with network info (i.e layers, nodes, input, ouput) and the appropriate network would be instantiated and returned

Alternatives

Each specific model creates model code as needed. This may end up having several redundant networks in the repo.

Linear classification performance on CIFAR10/STL10 DataModules drops from Lightning 0.8.5/Bolts 0.1.0 to Lightning 0.9.0/Bolts 0.1.1

🐛 Bug

Linear classification performance on CIFAR10/STL10 Bolts DataModules drops from Lightning 0.8.5/Bolts 0.1.0 to Lightning 0.9.0/Bolts 0.1.1.

I know you do careful performance-matching testing for the different Lightning versions, so I thought I'd post this here too thinking it could be from something in the DataModules changing rather than Lightning. However, since switching Bolts versions gave errors without switching Lightning versions, I can't be certain about that. It could also be some change in defaults.

To Reproduce

Steps to reproduce the behavior:

  1. Train Moco on (unlabeled data from) STL10 DataModule with the Bolts implementation using resnet18 (I can provide a pretrained model of this part if it would be helpful)
  2. Load trained model, freeze weights, set .eval()
  3. Add a linear classifier on top of the penultimate layer representation, train with (labeled data from) STL10 DataModule with traditional self-supervised performance evaluation. (Train for 100 epochs with batch size of 256 and initial learning rate of 0.12; decay learning rate by a factor of 0.2 at epochs 60,80.)
  4. Observe substantial drop in performance

On STL10, I am getting a performance drop of 16% top-1 validation accuracy (88->72), and a 0.5 increase in cross-entropy loss (0.3 to 0.8). On CIFAR10 I get a smaller, but consistent drop of 2% top-1 validation accuracy. Both of these are using the exact same code, just with switching conda envs. (The reason for changing both the Lightning version and the Bolts version is that it seems required for compatibility between the two.)

Code sample

I can provide pretrained models on CIFAR10 and STL10 if that would be helpful, along with my full hyperparameter configs, but not the code at this time (sorry!). It's just a linear classifier though.

Expected behavior

Expected: same performance across versions.

Environment

  • PyTorch Version (e.g., 1.0): 1.5.1 with the former/1.6.0 for the latter (compatibility)
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source):
  • Python version: 3.8.3
  • CUDA/cuDNN version:
  • GPU models and configuration: 2080Ti
  • Any other relevant information:

Additional context

[Representation Learning] MoCo licensing issues

📚 Documentation

MoCo is licensed under "Attribution-NonCommercial 4.0 International". There should be a mention somewhere that the pytorch-lightning-bolts modified versions is also non-commercial.

Validation epochs for RL models

🚀 Feature

Currently there are only train and test functionality. The models should also have a validation run per pseudo epoch

Motivation

The metrics are currently based on the moving avg training reward, but that doesn't show the true performance of the agent, especially in the case of the e-greedy agents.

Pitch

Every 1000 training epochs (which is essentially a step with the iterable dataset) a series of validation episodes should be run

Additional context

Previously, I found some complications when trying to get the validation runs working with the iterabel dataset, however it seems that recent updates in PL should resolve these issues

Using gpus gives error for BasicGAN

🐛 Bug

Demo code below:

import argparse
from pytorch_lightning import Trainer
from pl_bolts.models.gans import BasicGAN

trainer = Trainer()
model = BasicGAN()

parser = argparse.ArgumentParser(description='demo')
parser = trainer.add_argparse_args(parser)
args = parser.parse_args()

trainer = Trainer.from_argparse_args(args)
trainer.fit(model)

Running with GPUs from cli like: python demo.py --gpus 1 gives the following error:

  File "/home/tushar/anaconda3/envs/aer/lib/python3.6/site-packages/pl_bolts/models/gans/basic/basic_gan_pl_module.py", line 105, in discriminator_step
    d_loss = self.discriminator_loss(x)
  File "/home/tushar/anaconda3/envs/aer/lib/python3.6/site-packages/pl_bolts/models/gans/basic/basic_gan_pl_module.py", line 97, in discriminator_loss
    self.discriminator(self.generated_imgs.detach()), fake)
  File "/home/tushar/anaconda3/envs/aer/lib/python3.6/site-packages/pl_bolts/models/gans/basic/basic_gan_pl_module.py", line 57, in adversarial_loss
    return F.binary_cross_entropy(y_hat, y)
  File "/home/tushar/anaconda3/envs/aer/lib/python3.6/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #2 'target' in call to _thnn_binary_cross_entropy_forward

Looks like the labels [y] are not being sent to GPU/(to_device()).

Environment

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.1
* Packages:
        - numpy:             1.18.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.4.0
        - pytorch-lightning: 0.7.5
        - tensorboard:       2.2.0
        - tqdm:              4.45.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.6.7
        - version:           #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019

Add RL tests

@djbyrne
For each model, please add the following test:

# replace with your model name
def test_gan(tmpdir):
    reset_seed()

    model = BasicGAN(data_dir=tmpdir)
    trainer = pl.Trainer(fast_dev_run=True, default_root_dir=tmpdir)
    trainer.fit(model)
    trainer.test(model)

Add image vision datasets as Base lightningModules

Add the torchvision datasets including the standard transforms and splits as modules

class MNISTModule(pl.LightningModule):

    def prepare_data(self):
        self.mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
        self.mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())

    def train_dataloader(self):
        loader = DataLoader(self.mnist_train, batch_size=self.hparams.batch_size, num_workers=self.hparams.num_workers)
        return loader

    def val_dataloader(self):
        loader = DataLoader(self.mnist_test, batch_size=self.hparams.batch_size, num_workers=self.hparams.num_workers)
        return loader

    def test_dataloader(self):
        loader = DataLoader(self.mnist_test, batch_size=self.hparams.batch_size, num_workers=self.hparams.num_workers)
        return loader
from pytorch_lightning.bolts.dataset_modules import MNISTModule

class MyModel(MNISTModule, pl.LightningModule):

    # all the dataloaders have been implemented with MNIST

Using dp in SimCLR with --online_ft

🐛 Bug

SimCLR module crashes when running with dp backend and --online_ft flag with an assertion error during validation sanity check.

Screen Shot 2020-07-22 at 3 41 04 AM

To Reproduce

Steps to reproduce the behavior:

  1. Run python simclr_module.py --gpus 2 --distributed_backend dp --online_ft

Expected behavior

SimCLR should run on dp distributed_backend.

add seed to random_split

@nateraw

all the random_splits need to change to the seed:

before

        dataset_train, _ = random_split(dataset, [100, 50])

now

        dataset_train, _ = random_split(dataset, [100, 50]), generator=torch.Generator().manual_seed(self.seed))

Setting seed for SimCLR transforms

Hi @williamFalcon,

Currently, the SimCLR implementation uses a set of transformations that are stochastic. Is there any way to set the seed so that subsequent experiments are reproducible?

I was using overfit_batches=1 and batch_size=2 to just make sure SimCLR overfits correctly. The batch has the same 2 images loaded, but on each epoch, the transformations applied is different.

VAE loss for non-binary images

🐛 Bug

I think the VAE ELBO loss when training on imagenet or cifar10 should use F.mse_loss instead of F.binary_cross_entropy. The tensors for the target x right now contain values outside of [0, 1] which might not behave sensibly with BCE. For nonbinary images, the Gaussian posterior would imply MSE loss.

To Reproduce

See here:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/cf3830787e1352ce71d673347fb5648c90497b48/pl_bolts/models/autoencoders/basic_vae/basic_vae_module.py#L160

Add A2C, PPO and other modern standard RL Algorithms

🚀 Feature

The RL section of bolts currently only includes variants of DQN and VPG and lacks some of the more modern RL algorithms. Adding PPO, A2C, curiosity exploration etc. might be prudent

  • PPO
  • A2C
  • curiosity exploration
  • ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.