I'm currently evaluating Tensorizer for handling large models, specifically models wit

Absolutely! One strategy you can use is to partition the model into per-rank sha

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Tensorizer Support for Large Models( 70B+) that dont fit into a single GPU about tensorizer HOT 10 OPEN

coreweave commented on July 30, 2024

Tensorizer Support for Large Models( 70B+) that dont fit into a single GPU

from tensorizer.

Comments (10)

dmarx commented on July 30, 2024 1

Absolutely! One strategy you can use is to partition the model into per-rank shards and serialize each shard into its own respective .tensors file. Here's an example of what checkpointing with tensorizer could look like for distributed training, simplified from the coreweave/megatron fork

model = unwrap_model(model)
for i in range(len(model)):
    mpu.set_virtual_pipeline_model_parallel_rank(i)
    shard_serializer = TensorSerializer(
        f"{checkpoint_name}_shard{i}.tensors"
    )
    shard_serializer.write_module(model[i])
    shard_serializer.close()

And the corresponding deserialization, again simplified from the linked code:

for i in range(len(model)):
    mpu.set_virtual_pipeline_model_parallel_rank(i)
    shard_deserializer = TensorDeserializer(
        f"{checkpoint_name}_shard{i}.tensors",
        device=torch.cuda.current_device(),
        plaid_mode=True
    )
    shard_deserializer.load_into_module(model[i])
    shard_deserializer.close()

Alternatively, if you have a checkpoint already that you want to load across multiple ranks, you can provide a filter function to indicate which tensors should or should not be streamed into any given rank. EDIT: I just learned this is actually a pretty inefficient approach. If you have sufficient RAM, the lowest latency approach will be to deserialize the entire checkpoint onto your CPU, then assign tensors to devices as you normally would from there.

# EDIT: Actually, don't do this. leaving it here for demonstrative purposes.

for i in range(len(model)):
    module_names = modules_to_load_into_rank[i] 
    # Should evaluate to True when the provided `key`
    # matches the name of a tensor we want on the device
    def filter_func(key: str) -> Callable[[str], bool]:
        return key in module_names

    mpu.set_virtual_pipeline_model_parallel_rank(i)
    shard_deserializer = TensorDeserializer(
        f"{checkpoint_name}_shard{i}.tensors",
        device=torch.cuda.current_device(),
        plaid_mode=True
    )
    shard_deserializer.load_into_module(model[i], filter_func)
    shard_deserializer.close()

.tensors files pack tensors into contiguous memory on disk. The chunk for a given tensor starts with a header containing metadata, and then is followed by the tensor's data. Main implementation details can be found here The data format doesn't do sharding or distribution on its own, but as demonstrated above is compatible with user-specified sharding. Alternatively, Coreweave storage supports both CephFS and Vast distributed file systems. We are currently undergoing benchmarking exercises whose results we will hopefully share soon. Based on our preliminary observations, reading and writing shards distributed across multiple .tensors files is likely to be faster than one big file when using one of those distributed file systems.

Did that answer your questions?

from tensorizer.

SujoyDutta commented on July 30, 2024

Thanks @dmarx for replying quickly. I'll try the suggestion of partitioning the model into per-rank shards for checkpointing. Two questions:

Regarding Serialization and Deserialization with Multiple GPUs: If I have more GPUs during serialization and write the shards, but fewer GPUs during deserialization, would it impact the deserialization process? Does Tensorizer handle this scenario efficiently, or are there any considerations I should be aware of?
Observing Slow Deserialization Speeds: I tried serializing and deserializing the same exact example using Tensorizer with the model EleutherAI/gpt-j-6B . Despite, our fast S3 download speeds of 10GB/s and the 12GB .tensors file, the deserialization process was slow, with a transfer rate of 477 MB/s and a 26-second loading time to GPU ( these numbers are from total_bytes_str and get_mem_usage() when executing the example). Given that network speed isn't a bottleneck, could there be other processes causing this delay, such as intermediate disk writes or aggregation happening on the GPU side to stitch model weights?

from tensorizer.

dmarx commented on July 30, 2024

could you maybe share some code to clarify how you're deserializing? just to reiterate: I provided three chunks of code above, and one of them (the last one) is a demonstration of what not to do.

also, could you elaborate a bit more on your environment, e.g. are you downloading from S3 within AWS? If you are pulling this data over a WAN connection (which is probably the case if you are doing all of this within AWS) that is potentially a contributor to the poor performance.

from tensorizer.

SujoyDutta commented on July 30, 2024

Thank you @dmarx for your prompt response.

For deserialization, I'm employing the same code structure as provided in the deserialize example. Initially, I write the .tensors file for the 6B model, and subsequently, I utilize the boto3 TensorDeserializer to stream it back onto the GPU.

Environment Details:

Our environment is internal, not AWS-based. However, during experimentation, I observed that while using the MinIO command-line utility mc to download the .tensors file, downloading to an NVMe(not disk) I achieved speeds of at least 3GB/s, indicating that our network can indeed support higher speeds.

Despite the network's capability to achieve high speeds, the TensorDeserializer only achieves transfer speeds in the range of 500 MB/s during deserialization. Given the confirmed high download speeds, I'm curious about the factors limiting the deserialization speeds with Tensorizer.

I hope this provides the necessary clarity on the deserialization process and environment setup. Any insights for speeding up or recommendations regarding the observed deserialization speeds would be greatly appreciated.

from tensorizer.

dmarx commented on July 30, 2024

What version of tensorizer are you using? Try updating to the latest version, 2.8.0. One of the most important performance modes in tensorizer is plaid_mode which pipelines the loading and processing of tensors. This has historically been a feature that was disabled by default, and will now be turned on by default whenever a cuda device is detected.

from tensorizer.

SujoyDutta commented on July 30, 2024

For some reason when I try to download via pip install tensorizer it defaults to 2.7.2
But I tried both with and without plaid_mode = True in my experiments and the model was streamed to GPU in almost same time for both cases

from tensorizer.

dmarx commented on July 30, 2024

This is really unusual. I'm thankful you've brought your case to our attention and for your patience and cooperation as we try to figure out what could be going on here.

To update the library, try pip install --upgrade tensorizer or pip install tensorizer==2.8.0.

What kind of performance do you experience using tensorizer to deserialize models from local paths?

from tensorizer.

dmarx commented on July 30, 2024

also, could you possibly share a little about the hardware you're using here? GPU, CPU, # cores, RAM, etc.

from tensorizer.

SujoyDutta commented on July 30, 2024

Yea sure, Thanks so much for taking a look into this.

I executed the helper utils from tensorizer

total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
after_mem = get_mem_usage()
deserializer.close()
print(f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s")
print(f"Memory usage before: {before_mem}")
print(f"Memory usage after: {after_mem}")

and got the following output

Deserialized 12.2 GB in 26.35s, 463.8 MB/s
Memory usage before: CPU: (maxrss: 14,341MiB F: 486,836MiB) GPU: (U: 309MiB F: 32,191MiB T: 32,501MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 26,839MiB F: 486,739MiB) GPU: (U: 23,749MiB F: 8,751MiB T: 32,501MiB) TORCH: (R: 23,348MiB/32,014MiB, A: 11,661MiB/23,315MiB)

I have 12 core CPU and for GPU I am using V100 but I didnt get much difference when using A100 either. Both cases the speed was ~ 500 MB/s . I am having some issues internally installing latest version 2.8.0. I'll ask around and see whats the issue . But I was setting plaid_mode= True with 2.7.2.

Local mode
yea local mode was very fast in my experiment from disk it reached speed upto 3GB/s

from tensorizer.

Tensorizer Support for Large Models( 70B+) that dont fit into a single GPU about tensorizer HOT 10 OPEN

Comments (10)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent