microsoft / infinibatch Goto Github PK
View Code? Open in Web Editor NEWEfficient, check-pointed data loading for deep learning with massive data sets.
License: MIT License
Efficient, check-pointed data loading for deep learning with massive data sets.
License: MIT License
Is it safe to do distributed training with InfinitePermutationSourceIterator
with shuffle=True
? On a cursory glance at the class's source, there isn't any handling based on instance_rank
to properly distribute the shards across instances and this could lead to same chunks being read by different data loading processes in a distributed setting. A clarification on how to properly use this in a distributed setting would be great!
edit: fix typo
Author's list in setup.py
is presently a prototype. Let's update it with the correct author's list.
Also let's include a pip install
command directly from the Git repo, and notes on how to use this as a submodule.
Unit tests are currently run twice for pull requests. We should fix the triggers of the GitHub Action
To integrate fairseq
and infinibatch
, I need to get the size of the entire dataset. I used to chunked_dataset_iterator
to read the text dataset. As a workaround, right now I'm reading the length from a custom config file with hardcorded values.
Is there a way to get the length of entire dataset while continuing the same functionality as current iterator?
Hi, I have been experimenting with this awesome library. I made a blog post on this https://saiprasanna.in/posts/efficient-dynamic-batching-of-large-datasets-with-infinibatch/ .Making dynamic batches based on tokens per batch rather than fixed batch size has huge advantages in terms of reducing total number of batches. I have few questions with regards to convergence in such dynamic batching setting. Would be grateful if you can help me out with these.
We might want to implement an Iterator that inherits from PyTorch's IterableDataset to have a direct interface to PyTorch's data loader functionality.
Here is some prototype code that we had earlier in this direction.
class IterableCheckpointedDataset(torch.utils.data.IterableDataset):
"""
Wraps a CheckpointableIterator into a PyTorch IterableDataset, which is recognized by its type by
PyTorch's DataLoader class.
"""
def __init__(self, source: CheckpointableIterator):
super().__init__()
self._source = source
def __iter__(self): # this is called in the forked clone
worker_info = torch.utils.data.get_worker_info()
assert worker_info is None or worker_info.num_workers == 1 # not supported since we can't get at the checkpoint for each worker
return iter(self._source)
class IterableChunkedDataset(torch.utils.data.IterableDataset):
def __init__(self, paths: Union[str, Iterable[str]], shuffle: bool=True, buffer_size: int=2**20, transform=None, seed: int=None, world_size: int=1, rank: int=0, num_workers_per_rank: int=1):
super().__init__()
self.rank = rank
self.num_workers_per_rank = num_workers_per_rank
# instance_rank is set assuming that num_workers_per_rank = 1 and adapted dynamically in __iter__
self.dataset = chunked_dataset_iterator(paths, shuffle=shuffle, buffer_size=buffer_size, transform=transform, seed=seed, num_instances=world_size*num_workers_per_rank, instance_rank=rank)
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
if worker_info is None: # single-process data loading
self.dataset._instance_rank = self.rank
else:
assert worker_info.num_workers == self.num_workers_per_rank
self.dataset._instance_rank = self.rank * self.num_workers_per_rank + worker_info.id
return iter(self.dataset)
The project looks great, but manipulating the PYTHONPATH
or appending paths to the sys.path
at runtime is a bit unconventional. ๐ I think populating the setup.py
could make this a bit more easy to install. WDYT?
`
from infinibatch.iterators import ChunkedSourceIterator, ParallelMapIterator, PrefetchIterator
def transform(x):
return x ** 2
source_iter = ChunkedSourceIterator(list(range(1000)))
mid_iter = ParallelMapIterator(source_iterator=source_iter,
num_processes=4,
num_items_per_process=1000,
transform=transform)
res_iter = PrefetchIterator(mid_iter,
buffer_size=100,
buffer_in_main_process=True,
log_empty_buffer_warning=True)
print(next(iter(res_iter)))
`
E.g. illustrate how to work with small data sets (if so desired), and multi-dataset scenarios with weighted sampling across corpora.
Hello,
are you planning to support streaming datasets stored on cloud storage buckets?
Many thanks,
Alessandro
If the data loader process has exception - the training will stuck.
It looks like we have code to ensure that if main training process ended, reap all the data loader processes.
Maybe the other way is also needed - if a dataloader throws an exception and terminate for any reason, either restart one or terminate the parent training process.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.