microsoft / infinibatch Goto Github PK

View Code? Open in Web Editor NEW

188.0 37.0 17.0 1.01 MB

Efficient, check-pointed data loading for deep learning with massive data sets.

License: MIT License

Python 100.00%

infinibatch's People

Contributors

Stargazers

Watchers

Forkers

taffywrinkle claudiusgonzo deeplearning2012 trantorrepository marian-nmt standardgalactic shumingma frank-dong-ms-zz lcrun munael techthiyanes wangleiofficial test-mass-forker-org-1 nuclearnai thammegowda failable

infinibatch's Issues

InfinitePermutationSourceIterator + distributed training + shuffle

Is it safe to do distributed training with InfinitePermutationSourceIterator with shuffle=True? On a cursory glance at the class's source, there isn't any handling based on instance_rank to properly distribute the shards across instances and this could lead to same chunks being read by different data loading processes in a distributed setting. A clarification on how to properly use this in a distributed setting would be great!

edit: fix typo

Edit author's list in setup.py

Author's list in setup.py is presently a prototype. Let's update it with the correct author's list.

Also let's include a pip install command directly from the Git repo, and notes on how to use this as a submodule.

Fix triggers for unit tests

Unit tests are currently run twice for pull requests. We should fix the triggers of the GitHub Action

Can we get the length of entire dataset from `chunked_dataset_iterator`?

To integrate fairseq and infinibatch, I need to get the size of the entire dataset. I used to chunked_dataset_iterator to read the text dataset. As a workaround, right now I'm reading the length from a custom config file with hardcorded values.
Is there a way to get the length of entire dataset while continuing the same functionality as current iterator?

Few questions on real world usage of infinibatch while training models

Hi, I have been experimenting with this awesome library. I made a blog post on this https://saiprasanna.in/posts/efficient-dynamic-batching-of-large-datasets-with-infinibatch/ .Making dynamic batches based on tokens per batch rather than fixed batch size has huge advantages in terms of reducing total number of batches. I have few questions with regards to convergence in such dynamic batching setting. Would be grateful if you can help me out with these.

Is maximizing the tokens per batch and doing dynamic batching without any limit other than GPU memory on batch size ok for covnergence? Will having varied different batch sizes each step (but with constant tokens per batch) affect convergence?
Each instance of batch has tokens that are correlated. So would having long batches with few instances be "noisy" in terms of update they provide? Should we re-scale losses or do something to address this? Or should there be a cut-off on maximum instances per batch (using the lambda we provide).
When doing distributed data parallel in torch with data loading from infinibatch, each GPU might see a different batch size (though the tokens per batch might be the same) Should we take the number of instances per batch into account for syncing gradients?
Is there any rule of thumb with regards to hyperparameters when doing dynamic batching?
In transfer learning setting, would it be advisable to do this?

Implement wrapper iterator that inherits from PyTorch's IterableDataset

We might want to implement an Iterator that inherits from PyTorch's IterableDataset to have a direct interface to PyTorch's data loader functionality.

Here is some prototype code that we had earlier in this direction.

class IterableCheckpointedDataset(torch.utils.data.IterableDataset):
    """
    Wraps a CheckpointableIterator into a PyTorch IterableDataset, which is recognized by its type by
    PyTorch's DataLoader class.
    """
    def __init__(self, source: CheckpointableIterator):
        super().__init__()
        self._source = source

    def __iter__(self):  # this is called in the forked clone
        worker_info = torch.utils.data.get_worker_info()
        assert worker_info is None or worker_info.num_workers == 1  # not supported since we can't get at the checkpoint for each worker
        return iter(self._source)


class IterableChunkedDataset(torch.utils.data.IterableDataset):
    def __init__(self, paths: Union[str, Iterable[str]], shuffle: bool=True, buffer_size: int=2**20, transform=None, seed: int=None, world_size: int=1, rank: int=0, num_workers_per_rank: int=1):
        super().__init__()
        self.rank = rank
        self.num_workers_per_rank = num_workers_per_rank
        # instance_rank is set assuming that num_workers_per_rank = 1 and adapted dynamically in __iter__
        self.dataset = chunked_dataset_iterator(paths, shuffle=shuffle, buffer_size=buffer_size, transform=transform, seed=seed, num_instances=world_size*num_workers_per_rank, instance_rank=rank)

    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:  # single-process data loading
            self.dataset._instance_rank = self.rank
        else:
            assert worker_info.num_workers == self.num_workers_per_rank
            self.dataset._instance_rank = self.rank * self.num_workers_per_rank + worker_info.id
        return iter(self.dataset)

Why is setup.py empty?

The project looks great, but manipulating the PYTHONPATH or appending paths to the sys.path at runtime is a bit unconventional. 😄 I think populating the setup.py could make this a bit more easy to install. WDYT?

Iterator will stuck when use PrefetchIterator after ParallelMapIterator, hangs indefinitely at pool.map operation of ParallelMapIterator, any ideas?

`
from infinibatch.iterators import ChunkedSourceIterator, ParallelMapIterator, PrefetchIterator

def transform(x):
return x ** 2

source_iter = ChunkedSourceIterator(list(range(1000)))

mid_iter = ParallelMapIterator(source_iterator=source_iter,
num_processes=4,
num_items_per_process=1000,
transform=transform)

res_iter = PrefetchIterator(mid_iter,
buffer_size=100,
buffer_in_main_process=True,
log_empty_buffer_warning=True)

print(next(iter(res_iter)))
`

Add FAQ for typical programming patterns

E.g. illustrate how to work with small data sets (if so desired), and multi-dataset scenarios with weighted sampling across corpora.

Support for cloud streaming

Hello,

are you planning to support streaming datasets stored on cloud storage buckets?

Many thanks,
Alessandro

training stuck if data loader worker process has exceptions

If the data loader process has exception - the training will stuck.

It looks like we have code to ensure that if main training process ended, reap all the data loader processes.
Maybe the other way is also needed - if a dataloader throws an exception and terminate for any reason, either restart one or terminate the parent training process.

microsoft / infinibatch Goto Github PK

infinibatch's People

Contributors

Stargazers

Watchers

Forkers

infinibatch's Issues

InfinitePermutationSourceIterator + distributed training + shuffle

Edit author's list in setup.py

Fix triggers for unit tests

Can we get the length of entire dataset from `chunked_dataset_iterator`?

Few questions on real world usage of infinibatch while training models

Implement wrapper iterator that inherits from PyTorch's IterableDataset

Why is setup.py empty?

Iterator will stuck when use PrefetchIterator after ParallelMapIterator, hangs indefinitely at pool.map operation of ParallelMapIterator, any ideas?

Add FAQ for typical programming patterns

Support for cloud streaming

training stuck if data loader worker process has exceptions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent