Comments (10)
- Absolutely! One strategy you can use is to partition the model into per-rank shards and serialize each shard into its own respective
.tensors
file. Here's an example of what checkpointing with tensorizer could look like for distributed training, simplified from the coreweave/megatron fork
model = unwrap_model(model)
for i in range(len(model)):
mpu.set_virtual_pipeline_model_parallel_rank(i)
shard_serializer = TensorSerializer(
f"{checkpoint_name}_shard{i}.tensors"
)
shard_serializer.write_module(model[i])
shard_serializer.close()
And the corresponding deserialization, again simplified from the linked code:
for i in range(len(model)):
mpu.set_virtual_pipeline_model_parallel_rank(i)
shard_deserializer = TensorDeserializer(
f"{checkpoint_name}_shard{i}.tensors",
device=torch.cuda.current_device(),
plaid_mode=True
)
shard_deserializer.load_into_module(model[i])
shard_deserializer.close()
Alternatively, if you have a checkpoint already that you want to load across multiple ranks, you can provide a filter function to indicate which tensors should or should not be streamed into any given rank. EDIT: I just learned this is actually a pretty inefficient approach. If you have sufficient RAM, the lowest latency approach will be to deserialize the entire checkpoint onto your CPU, then assign tensors to devices as you normally would from there.
# EDIT: Actually, don't do this. leaving it here for demonstrative purposes.
for i in range(len(model)):
module_names = modules_to_load_into_rank[i]
# Should evaluate to True when the provided `key`
# matches the name of a tensor we want on the device
def filter_func(key: str) -> Callable[[str], bool]:
return key in module_names
mpu.set_virtual_pipeline_model_parallel_rank(i)
shard_deserializer = TensorDeserializer(
f"{checkpoint_name}_shard{i}.tensors",
device=torch.cuda.current_device(),
plaid_mode=True
)
shard_deserializer.load_into_module(model[i], filter_func)
shard_deserializer.close()
.tensors
files pack tensors into contiguous memory on disk. The chunk for a given tensor starts with a header containing metadata, and then is followed by the tensor's data. Main implementation details can be found here The data format doesn't do sharding or distribution on its own, but as demonstrated above is compatible with user-specified sharding. Alternatively, Coreweave storage supports both CephFS and Vast distributed file systems. We are currently undergoing benchmarking exercises whose results we will hopefully share soon. Based on our preliminary observations, reading and writing shards distributed across multiple.tensors
files is likely to be faster than one big file when using one of those distributed file systems.
Did that answer your questions?
from tensorizer.
Thanks @dmarx for replying quickly. I'll try the suggestion of partitioning the model into per-rank shards for checkpointing. Two questions:
-
Regarding Serialization and Deserialization with Multiple GPUs: If I have more GPUs during serialization and write the shards, but fewer GPUs during
deserialization
, would it impact the deserialization process? Does Tensorizer handle this scenario efficiently, or are there any considerations I should be aware of? -
Observing Slow Deserialization Speeds: I tried serializing and deserializing the same exact example using Tensorizer with the model EleutherAI/gpt-j-6B . Despite, our fast S3 download speeds of 10GB/s and the 12GB
.tensors
file, the deserialization process was slow, with a transfer rate of 477 MB/s and a 26-second loading time to GPU ( these numbers are fromtotal_bytes_str
andget_mem_usage()
when executing the example). Given that network speed isn't a bottleneck, could there be other processes causing this delay, such as intermediate disk writes or aggregation happening on the GPU side to stitch model weights?
from tensorizer.
could you maybe share some code to clarify how you're deserializing? just to reiterate: I provided three chunks of code above, and one of them (the last one) is a demonstration of what not to do.
also, could you elaborate a bit more on your environment, e.g. are you downloading from S3 within AWS? If you are pulling this data over a WAN connection (which is probably the case if you are doing all of this within AWS) that is potentially a contributor to the poor performance.
from tensorizer.
Thank you @dmarx for your prompt response.
For deserialization, I'm employing the same code structure as provided in the deserialize example. Initially, I write the .tensors file for the 6B model, and subsequently, I utilize the boto3 TensorDeserializer to stream it back onto the GPU.
Environment Details:
Our environment is internal, not AWS-based. However, during experimentation, I observed that while using the MinIO command-line utility mc to download the .tensors
file, downloading to an NVMe(not disk) I achieved speeds of at least 3GB/s, indicating that our network can indeed support higher speeds.
Despite the network's capability to achieve high speeds, the TensorDeserializer only achieves transfer speeds in the range of 500 MB/s during deserialization. Given the confirmed high download speeds, I'm curious about the factors limiting the deserialization speeds with Tensorizer.
I hope this provides the necessary clarity on the deserialization process and environment setup. Any insights for speeding up or recommendations regarding the observed deserialization speeds would be greatly appreciated.
from tensorizer.
What version of tensorizer are you using? Try updating to the latest version, 2.8.0
. One of the most important performance modes in tensorizer is plaid_mode
which pipelines the loading and processing of tensors. This has historically been a feature that was disabled by default, and will now be turned on by default whenever a cuda device is detected.
from tensorizer.
For some reason when I try to download via pip install tensorizer
it defaults to 2.7.2
But I tried both with and without plaid_mode = True
in my experiments and the model was streamed to GPU in almost same time for both cases
from tensorizer.
This is really unusual. I'm thankful you've brought your case to our attention and for your patience and cooperation as we try to figure out what could be going on here.
To update the library, try pip install --upgrade tensorizer
or pip install tensorizer==2.8.0
.
What kind of performance do you experience using tensorizer to deserialize models from local paths?
from tensorizer.
also, could you possibly share a little about the hardware you're using here? GPU, CPU, # cores, RAM, etc.
from tensorizer.
Yea sure, Thanks so much for taking a look into this.
I executed the helper utils from tensorizer
total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
after_mem = get_mem_usage()
deserializer.close()
print(f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s")
print(f"Memory usage before: {before_mem}")
print(f"Memory usage after: {after_mem}")
and got the following output
Deserialized 12.2 GB in 26.35s, 463.8 MB/s
Memory usage before: CPU: (maxrss: 14,341MiB F: 486,836MiB) GPU: (U: 309MiB F: 32,191MiB T: 32,501MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 26,839MiB F: 486,739MiB) GPU: (U: 23,749MiB F: 8,751MiB T: 32,501MiB) TORCH: (R: 23,348MiB/32,014MiB, A: 11,661MiB/23,315MiB)
I have 12 core CPU and for GPU I am using V100 but I didnt get much difference when using A100 either. Both cases the speed was ~ 500 MB/s . I am having some issues internally installing latest version 2.8.0. I'll ask around and see whats the issue . But I was setting plaid_mode= True with 2.7.2.
Local mode
yea local mode was very fast in my experiment from disk it reached speed upto 3GB/s
from tensorizer.
Related Issues (14)
- Eager vs. Lazy Loading in Plaid Mode
- Add encryption and hash verification to tensorizer HOT 1
- missing dependency: serializer HOT 1
- Unintuitive behaviour of `verify_hash` parameter to `load_into_module`
- `mmap` issues with loading models form PVC as of `tensorizer` `v2.4`
- Support self-signed certificates in `CURLStreamFile` HOT 1
- Unexpected results on attention mask (Blip2ForConditionalGeneration) HOT 3
- Deserialisation issue: KeyError: "attribute 'bias' already exists" HOT 1
- Stability example should download all artifacts from S3 rather than HuggingFace Hub HOT 1
- ValueError: Tensor index in the file is empty HOT 3
- Tensorizer Serializer Image
- Fail to deserialize model: Expected all tensors to be on the same device HOT 3
- support setting S3 region and signature version in stream_io
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorizer.