Coder Social home page Coder Social logo

infinibatch's People

Contributors

erip avatar gmyr avatar microsoft-github-operations[bot] avatar microsoftopensource avatar yushims avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infinibatch's Issues

InfinitePermutationSourceIterator + distributed training + shuffle

Is it safe to do distributed training with InfinitePermutationSourceIterator with shuffle=True? On a cursory glance at the class's source, there isn't any handling based on instance_rank to properly distribute the shards across instances and this could lead to same chunks being read by different data loading processes in a distributed setting. A clarification on how to properly use this in a distributed setting would be great!

edit: fix typo

Edit author's list in setup.py

Author's list in setup.py is presently a prototype. Let's update it with the correct author's list.

Also let's include a pip install command directly from the Git repo, and notes on how to use this as a submodule.

Can we get the length of entire dataset from `chunked_dataset_iterator`?

To integrate fairseq and infinibatch, I need to get the size of the entire dataset. I used to chunked_dataset_iterator to read the text dataset. As a workaround, right now I'm reading the length from a custom config file with hardcorded values.
Is there a way to get the length of entire dataset while continuing the same functionality as current iterator?

Few questions on real world usage of infinibatch while training models

Hi, I have been experimenting with this awesome library. I made a blog post on this https://saiprasanna.in/posts/efficient-dynamic-batching-of-large-datasets-with-infinibatch/ .Making dynamic batches based on tokens per batch rather than fixed batch size has huge advantages in terms of reducing total number of batches. I have few questions with regards to convergence in such dynamic batching setting. Would be grateful if you can help me out with these.

  • Is maximizing the tokens per batch and doing dynamic batching without any limit other than GPU memory on batch size ok for covnergence? Will having varied different batch sizes each step (but with constant tokens per batch) affect convergence?
    Each instance of batch has tokens that are correlated. So would having long batches with few instances be "noisy" in terms of update they provide? Should we re-scale losses or do something to address this? Or should there be a cut-off on maximum instances per batch (using the lambda we provide).
  • When doing distributed data parallel in torch with data loading from infinibatch, each GPU might see a different batch size (though the tokens per batch might be the same) Should we take the number of instances per batch into account for syncing gradients?
  • Is there any rule of thumb with regards to hyperparameters when doing dynamic batching?
  • In transfer learning setting, would it be advisable to do this?

Implement wrapper iterator that inherits from PyTorch's IterableDataset

We might want to implement an Iterator that inherits from PyTorch's IterableDataset to have a direct interface to PyTorch's data loader functionality.

Here is some prototype code that we had earlier in this direction.

class IterableCheckpointedDataset(torch.utils.data.IterableDataset):
    """
    Wraps a CheckpointableIterator into a PyTorch IterableDataset, which is recognized by its type by
    PyTorch's DataLoader class.
    """
    def __init__(self, source: CheckpointableIterator):
        super().__init__()
        self._source = source

    def __iter__(self):  # this is called in the forked clone
        worker_info = torch.utils.data.get_worker_info()
        assert worker_info is None or worker_info.num_workers == 1  # not supported since we can't get at the checkpoint for each worker
        return iter(self._source)


class IterableChunkedDataset(torch.utils.data.IterableDataset):
    def __init__(self, paths: Union[str, Iterable[str]], shuffle: bool=True, buffer_size: int=2**20, transform=None, seed: int=None, world_size: int=1, rank: int=0, num_workers_per_rank: int=1):
        super().__init__()
        self.rank = rank
        self.num_workers_per_rank = num_workers_per_rank
        # instance_rank is set assuming that num_workers_per_rank = 1 and adapted dynamically in __iter__
        self.dataset = chunked_dataset_iterator(paths, shuffle=shuffle, buffer_size=buffer_size, transform=transform, seed=seed, num_instances=world_size*num_workers_per_rank, instance_rank=rank)

    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:  # single-process data loading
            self.dataset._instance_rank = self.rank
        else:
            assert worker_info.num_workers == self.num_workers_per_rank
            self.dataset._instance_rank = self.rank * self.num_workers_per_rank + worker_info.id
        return iter(self.dataset)

Why is setup.py empty?

The project looks great, but manipulating the PYTHONPATH or appending paths to the sys.path at runtime is a bit unconventional. ๐Ÿ˜„ I think populating the setup.py could make this a bit more easy to install. WDYT?

Iterator will stuck when use PrefetchIterator after ParallelMapIterator, hangs indefinitely at pool.map operation of ParallelMapIterator, any ideas?

`
from infinibatch.iterators import ChunkedSourceIterator, ParallelMapIterator, PrefetchIterator

def transform(x):
return x ** 2

source_iter = ChunkedSourceIterator(list(range(1000)))

mid_iter = ParallelMapIterator(source_iterator=source_iter,
num_processes=4,
num_items_per_process=1000,
transform=transform)

res_iter = PrefetchIterator(mid_iter,
buffer_size=100,
buffer_in_main_process=True,
log_empty_buffer_warning=True)

print(next(iter(res_iter)))
`

Support for cloud streaming

Hello,

are you planning to support streaming datasets stored on cloud storage buckets?

Many thanks,
Alessandro

training stuck if data loader worker process has exceptions

If the data loader process has exception - the training will stuck.

It looks like we have code to ensure that if main training process ended, reap all the data loader processes.
Maybe the other way is also needed - if a dataloader throws an exception and terminate for any reason, either restart one or terminate the parent training process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.