coreweave / tensorizer Goto Github PK

View Code? Open in Web Editor NEW

133.0 23.0 21.0 1.25 MB

Module, Model, and Tensor Serialization/Deserialization

License: MIT License

CMake 1.59% Python 91.98% JavaScript 6.43%

tensorizer's Introduction

tensorizer

Module, Model, and Tensor Serialization/Deserialization

TLDR

Extremely fast model loads from HTTP/HTTPS, Redis, and S3 endpoints. GPT-J (20GB) loads at wire-speed (~5GB/s) on a 40GbE network, and is only bottlenecked by the Linux kernel TCP stack.

Rationale

CoreWeave and our customers use KNative to deploy models as serverless functions. How long a model takes to load is a major factor in the latency of KNative scale-up. tensorizer is a tool to serialize models and their associated tensors into a single file that can be loaded quickly and efficiently off an HTTP/HTTPS or S3 endpoint.

By not embedding the model in the container image, we can reduce the container image size and the time it takes to load the model. This is especially important for models that are large in size, such as EleutherAI/gpt-neox-20B that weighs in at ~40GB.

This decoupling of the model from the container image also allows us to update the model without having to rebuild the container image. This allows us to quickly iterate on the model and deploy new versions without having to wait for the container image to build or for the container image cache to be populated.

tensorizer has S3 support, so we can store the serialized model in S3 object storage, and perform streaming loads from S3. This allows us to stream the model directly from S3 into the container without having to download the model to the container's local filesystem. This also pertains to HTTP/HTTPS endpoints, as S3 is just an HTTP/HTTPS endpoint.

tensorizer also has support for loading models from a local filesystem, so you can use it to serialize models locally and load them locally. This is extremely fast, as the same principles that make it fast for HTTP/HTTPS and S3 endpoints also apply to local filesystems.

tensorizer has preliminary support for Redis, but it is not recommended for model deployment due to the lack of distributed caching. It is intended for sharing state between inference pods, or for loading data on a per-request basis from a Redis cache.

Speed

tensorizer's deserialization speed is primarily network-bound.

The following graph presents data collected from the scripts and Kubernetes manifests in examples/benchmark_buffer_size comparing the various deserialization modes available in tensorizer release 2.5.0—along with the raw network speed, and the speed of torch.load().

Installation

From PyPI

tensorizer can be installed from PyPI with pip:

python -m pip install tensorizer

From Source

You can also install tensorizer from source using pip.

To clone the repository and install tensorizer in editable mode, run:

git clone https://github.com/coreweave/tensorizer
cd tensorizer
python -m pip install -e .

Or, run the following for pip to install tensorizer directly from GitHub:

python -m pip install git+https://github.com/coreweave/tensorizer

Basic Usage

Serialization is done with the TensorSerializer class. It takes a path_uri argument that can be a local filesystem path, an HTTP/HTTPS endpoint, or an S3 endpoint.

write_module is the main method of the TensorSerializer class. It takes a torch.nn.Module and serializes the tensors to the path_uri endpoint.

The below example serializes the EleutherAI/gpt-j-6B model to an S3 endpoint. It assumes that you have already configured your S3 credentials in ~/.s3cfg.

NOTE: Loading and serializing gpt-j-6B will take a lot of CPU RAM, up to ~20GB. Additionally, when loading gpt-j-6B into a GPU, you will need about ~16GB of VRAM. If you don't have that much RAM or VRAM, you can use the smaller gpt-neo-125M model instead.

NOTE2: The below examples require the transformers and accelerate libraries. You can install them with pip:

python -m pip install transformers accelerate

serialize.py

import torch
from tensorizer import TensorSerializer
from transformers import AutoModelForCausalLM

model_ref = "EleutherAI/gpt-j-6B"
# For less intensive requirements, swap above with the line below:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"

model = AutoModelForCausalLM.from_pretrained(
    model_ref,
    revision="float16",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

serializer = TensorSerializer(s3_uri)
serializer.write_module(model)
serializer.close()

Conversely, deserialization is done with the TensorDeserializer class. It takes a path_uri argument that can be a local filesystem path, an HTTP/HTTPS endpoint, or an S3 endpoint.

load_into_module is the main method of the TensorDeserializer class. It takes a torch.nn.Module and loads the tensors from the path_uri endpoint into the torch.nn.Module.

The below example loads the EleutherAI/gpt-j-6B model from an S3 endpoint.

deserialize-simple.py

import time
import torch
from tensorizer import TensorDeserializer
from tensorizer.utils import no_init_or_tensor, convert_bytes, get_mem_usage

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

model_ref = "EleutherAI/gpt-j-6B"
# To run this at home, swap this with the line below for a smaller example:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"

config = AutoConfig.from_pretrained(model_ref)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# This ensures that the pretrained model weights are not initialized,
# and non-persistent buffers (generated at runtime) are on the correct device.
with torch.device(device), no_init_or_tensor():
    model = AutoModelForCausalLM.from_config(config)

print(f"Deserializing to {device}:")
before_mem = get_mem_usage()

# Lazy load the tensors from S3 into the model.
start = time.perf_counter()
deserializer = TensorDeserializer(s3_uri, device=device)
deserializer.load_into_module(model)
end = time.perf_counter()

after_mem = get_mem_usage()

# Brag about how fast we are.
total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
deserializer.close()
print(f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s")
print(f"Memory usage before: {before_mem}")
print(f"Memory usage after: {after_mem}")

# Tokenize and generate
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_ref)
eos = tokenizer.eos_token_id
input_ids = tokenizer.encode(
    "¡Hola! Encantado de conocerte. hoy voy a", return_tensors="pt"
).to(device)

with torch.no_grad():
    output = model.generate(
        input_ids, max_new_tokens=50, do_sample=True, pad_token_id=eos
    )

print(f"Output: {tokenizer.decode(output[0], skip_special_tokens=True)}")

It should produce output similar to the following, with GPT-J-6B:

Deserialized model in 6.25 seconds
Test Output: ¡Hola! Encantado de conocerte. hoy voy a comentar por primera
vez una teoría de trineo, que quizá te parezca
algo desconocido, ya que en este mundo han
llegado a dominar tantos

More practical examples for the usage of tensorizer can be found in examples/hf_serialization.py, where df_main() serializes models from HuggingFace Diffusers and hf_main() serializes HuggingFace Transformers models.

Tensor Weight Encryption

tensorizer supports fast tensor weight encryption and decryption during serialization and deserialization, respectively.

Be aware that metadata (tensor names, dtypes, shapes, etc.) are not encrypted, only the weights themselves.

Note

Refer to docs/encryption.md for details, instructions, and warnings on using tensorizer encryption correctly and safely.

To use tensorizer encryption, a recent version of libsodium must be installed. Install libsodium with apt-get install libsodium23 on Ubuntu or Debian, or follow the instructions in libsodium's documentation for other platforms.

Quick Encryption Example

The following outline demonstrates how to encrypt and decrypt a tensorized model with a randomly-generated encryption key:

from tensorizer import (
    EncryptionParams, DecryptionParams, TensorDeserializer, TensorSerializer
)

# Serialize and encrypt a model:

encryption_params = EncryptionParams.random()

serializer = TensorSerializer("model.tensors", encryption=encryption_params)
serializer.write_module(...)  # or write_state_dict(), etc.
serializer.close()

# Save the randomly-generated encryption key somewhere
with open("tensor.key", "wb") as key_file:
    key_file.write(encryption_params.key)


# Then decrypt it again:

# Load the randomly-generated key from where it was saved
with open("tensor.key", "rb") as key_file:
    key: bytes = key_file.read()
 
decryption_params = DecryptionParams.from_key(key)

deserializer = TensorDeserializer("model.tensors", encryption=decryption_params)
deserializer.load_into_module(...)
deserializer.close()

For more detail, refer to docs/encryption.md. A complete example is also available as examples/encryption.py. The EncryptionParams and DecryptionParams class docstrings additionally contain some usage information for quick reference from an IDE.

An example command line tool to add or remove encryption from existing serialized models is also available as examples/encryption.py.

Benchmarks

You can run your own benchmarks on CoreWeave or your own Kubernetes cluster by using the benchmark.yaml file in the examples/benchmark_buffer_size directory. Please see the README.

Available Pre-Tensorized Models on the CoreWeave Cloud

The following models are available on the CoreWeave Cloud for free, and can be used with the TensorDeserializer class. The S3 support defaults to the accel-object.ord1.coreweave.com endpoint, and the bucket to use as tensorized.

We name the keys in the S3 bucket after the HuggingFace model identifier, and append the /fp16 suffix for the half-precision version.

For example, the S3 URI for the EleutherAI/gpt-j-6B model is: s3://tensorized/EleutherAI/gpt-j-6B/fp16/model.tensors

The below table shows the available models and their S3 URIs.

Large Language Models

Model	Precision	S3 URI
EleutherAI/gpt-neo-125M	`fp32`	`s3://tensorized/EleutherAI/gpt-neo-125M/model.tensors`
EleutherAI/gpt-neo-125M	`fp16`	`s3://tensorized/EleutherAI/gpt-neo-125M/fp16/model.tensors`
EleutherAI/gpt-neo-1.3B	`fp32`	`s3://tensorized/EleutherAI/gpt-neo-1.3B/model.tensors`
EleutherAI/gpt-neo-1.3B	`fp16`	`s3://tensorized/EleutherAI/gpt-neo-1.3B/fp16/model.tensors`
EleutherAI/gpt-neo-2.7B	`fp32`	`s3://tensorized/EleutherAI/gpt-neo-2.7B/model.tensors`
EleutherAI/gpt-neo-2.7B	`fp16`	`s3://tensorized/EleutherAI/gpt-neo-2.7B/fp16/model.tensors`
EleutherAI/gpt-j-6B	`fp32`	`s3://tensorized/EleutherAI/gpt-j-6B/model.tensors`
EleutherAI/gpt-j-6B	`fp16`	`s3://tensorized/EleutherAI/gpt-j-6B/fp16/model.tensors`
EleutherAI/gpt-neox-20b	`fp32`	`s3://tensorized/EleutherAI/gpt-neox-20b/model.tensors`
EleutherAI/gpt-neox-20b	`fp16`	`s3://tensorized/EleutherAI/gpt-neox-20b/fp16/model.tensors`
EleutherAI/pythia-70m	`fp32`	`s3://tensorized/EleutherAI/pythia-70m/model.tensors`
EleutherAI/pythia-70m	`fp16`	`s3://tensorized/EleutherAI/pythia-70m/fp16/model.tensors`
EleutherAI/pythia-1.4b	`fp32`	`s3://tensorized/EleutherAI/pythia-1.4b/model.tensors`
EleutherAI/pythia-1.4b	`fp16`	`s3://tensorized/EleutherAI/pythia-1.4b/fp16/model.tensors`
EleutherAI/pythia-2.8b	`fp32`	`s3://tensorized/EleutherAI/pythia-2.8b/model.tensors`
EleutherAI/pythia-2.8b	`fp16`	`s3://tensorized/EleutherAI/pythia-2.8b/fp16/model.tensors`
EleutherAI/pythia-6.9b	`fp32`	`s3://tensorized/EleutherAI/pythia-6.9b/model.tensors`
EleutherAI/pythia-6.9b	`fp16`	`s3://tensorized/EleutherAI/pythia-6.9b/fp16/model.tensors`
EleutherAI/pythia-12b	`fp32`	`s3://tensorized/EleutherAI/pythia-12b/model.tensors`
EleutherAI/pythia-12b	`fp16`	`s3://tensorized/EleutherAI/pythia-12b/fp16/model.tensors`
EleutherAI/pythia-70m-deduped	`fp32`	`s3://tensorized/EleutherAI/pythia-70m-deduped/model.tensors`
EleutherAI/pythia-70m-deduped	`fp16`	`s3://tensorized/EleutherAI/pythia-70m-deduped/fp16/model.tensors`
EleutherAI/pythia-1.4b-deduped	`fp32`	`s3://tensorized/EleutherAI/pythia-1.4b-deduped/model.tensors`
EleutherAI/pythia-1.4b-deduped	`fp16`	`s3://tensorized/EleutherAI/pythia-1.4b-deduped/fp16/model.tensors`
EleutherAI/pythia-2.8b-deduped	`fp32`	`s3://tensorized/EleutherAI/pythia-2.8b-deduped/model.tensors`
EleutherAI/pythia-2.8b-deduped	`fp16`	`s3://tensorized/EleutherAI/pythia-2.8b-deduped/fp16/model.tensors`
EleutherAI/pythia-6.9b-deduped	`fp32`	`s3://tensorized/EleutherAI/pythia-6.9b-deduped/model.tensors`
EleutherAI/pythia-6.9b-deduped	`fp16`	`s3://tensorized/EleutherAI/pythia-6.9b-deduped/fp16/model.tensors`
EleutherAI/pythia-12b-deduped	`fp32`	`s3://tensorized/EleutherAI/pythia-12b-deduped/model.tensors`
EleutherAI/pythia-12b-deduped	`fp16`	`s3://tensorized/EleutherAI/pythia-12b-deduped/fp16/model.tensors`
KoboldAI/fairseq-dense-125M	`fp32`	`s3://tensorized/KoboldAI/fairseq-dense-125M/model.tensors`
KoboldAI/fairseq-dense-125M	`fp16`	`s3://tensorized/KoboldAI/fairseq-dense-125M/fp16/model.tensors`
KoboldAI/fairseq-dense-355M	`fp32`	`s3://tensorized/KoboldAI/fairseq-dense-355M/model.tensors`
KoboldAI/fairseq-dense-355M	`fp16`	`s3://tensorized/KoboldAI/fairseq-dense-355M/fp16/model.tensors`
KoboldAI/fairseq-dense-2.7B	`fp32`	`s3://tensorized/KoboldAI/fairseq-dense-2.7B/model.tensors`
KoboldAI/fairseq-dense-2.7B	`fp16`	`s3://tensorized/KoboldAI/fairseq-dense-2.7B/fp16/model.tensors`
KoboldAI/fairseq-dense-6.7B	`fp32`	`s3://tensorized/KoboldAI/fairseq-dense-6.7B/model.tensors`
KoboldAI/fairseq-dense-6.7B	`fp16`	`s3://tensorized/KoboldAI/fairseq-dense-6.7B/fp16/model.tensors`
KoboldAI/fairseq-dense-13B	`fp32`	`s3://tensorized/KoboldAI/fairseq-dense-13B/model.tensors`
KoboldAI/fairseq-dense-13B	`fp16`	`s3://tensorized/KoboldAI/fairseq-dense-13B/fp16/model.tensors`
Salesforce/codegen-350M-mono	`fp32`	`s3://tensorized/Salesforce/codegen-350M-mono/model.tensors`
Salesforce/codegen-350M-mono	`fp16`	`s3://tensorized/Salesforce/codegen-350M-mono/fp16/model.tensors`
Salesforce/codegen-350M-multi	`fp32`	`s3://tensorized/Salesforce/codegen-350M-multi/model.tensors`
Salesforce/codegen-350M-multi	`fp16`	`s3://tensorized/Salesforce/codegen-350M-multi/fp16/model.tensors`
Salesforce/codegen-2B-multi	`fp32`	`s3://tensorized/Salesforce/codegen-2B-multi/model.tensors`
Salesforce/codegen-2B-multi	`fp16`	`s3://tensorized/Salesforce/codegen-2B-multi/fp16/model.tensors`
Salesforce/codegen-6B-mono	`fp32`	`s3://tensorized/Salesforce/codegen-6B-mono/model.tensors`
Salesforce/codegen-6B-mono	`fp16`	`s3://tensorized/Salesforce/codegen-6B-mono/fp16/model.tensors`
Salesforce/codegen-6B-multi	`fp32`	`s3://tensorized/Salesforce/codegen-6B-multi/model.tensors`
Salesforce/codegen-6B-multi	`fp16`	`s3://tensorized/Salesforce/codegen-6B-multi/fp16/model.tensors`
Salesforce/codegen-16B-mono	`fp32`	`s3://tensorized/Salesforce/codegen-16B-mono/model.tensors`
Salesforce/codegen-16B-mono	`fp16`	`s3://tensorized/Salesforce/codegen-16B-mono/fp16/model.tensors`
Salesforce/codegen-16B-multi	`fp32`	`s3://tensorized/Salesforce/codegen-16B-multi/model.tensors`
Salesforce/codegen-16B-multi	`fp16`	`s3://tensorized/Salesforce/codegen-16B-multi/fp16/model.tensors`

Generative Diffusion Models

Model	Component	Precision	S3 URI
RunwayML/stable-diffusion-v1-5	`VAE`	`fp32`	`s3://tensorized/runwayml/stable-diffusion-v1-5/vae.tensors`
RunwayML/stable-diffusion-v1-5	`UNet`	`fp32`	`s3://tensorized/runwayml/stable-diffusion-v1-5/unet.tensors`
RunwayML/stable-diffusion-v1-5	`TextEnc`	`fp32`	`s3://tensorized/runwayml/stable-diffusion-v1-5/text_encoder.tensors`
RunwayML/stable-diffusion-v1-5	`VAE`	`fp16`	`s3://tensorized/runwayml/stable-diffusion-v1-5/fp16/vae.tensors`
RunwayML/stable-diffusion-v1-5	`UNet`	`fp16`	`s3://tensorized/runwayml/stable-diffusion-v1-5/fp16/unet.tensors`
RunwayML/stable-diffusion-v1-5	`TextEnc`	`fp16`	`s3://tensorized/runwayml/stable-diffusion-v1-5/fp16/text_encoder.tensors`
StabilityAI/stable-diffusion-2-1	`VAE`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-2-1/vae.tensors`
StabilityAI/stable-diffusion-2-1	`UNet`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-2-1/unet.tensors`
StabilityAI/stable-diffusion-2-1	`TextEnc`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-2-1/text_encoder.tensors`
StabilityAI/stable-diffusion-2-1	`VAE`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-2-1/fp16/vae.tensors`
StabilityAI/stable-diffusion-2-1	`UNet`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-2-1/fp16/unet.tensors`
StabilityAI/stable-diffusion-2-1	`TextEnc`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-2-1/fp16/text_encoder.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`VAE`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/vae.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`UNet`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/unet.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`TextEnc`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/text_encoder.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`TextEnc2`	`fp32`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/text_encoder_2.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`VAE`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/fp16/vae.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`UNet`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/fp16/unet.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`TextEnc`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/fp16/text_encoder.tensors`
StabilityAI/stable-diffusion-xl-base-1.0	`TextEnc2`	`fp16`	`s3://tensorized/stabilityai/stable-diffusion-xl-base-1.0/fp16/text_encoder_2.tensors`

S3 Usage Notes

tensorizer uses the boto3 library to interact with S3. The easiest way to use tensorizer with S3 is to configure your S3 credentials in ~/.s3cfg.

If you don't want to use ~/.s3cfg, or wish to use a .s3cfg config file saved at a nonstandard location (e.g. under /var/run), you can also specify your S3 credentials using the tensorizer.stream_io.open_stream() function, and then pass that into the TensorSerializer or TensorDeserializer constructor.

The stream_io.open_stream() function takes a path_uri argument, which can be an s3:// URI, and accepts the following keyword arguments:

s3_access_key_id: S3 access key ID
s3_secret_access_key: S3 secret access key
s3_endpoint: S3 endpoint

Or,

s3_config_path: Alternative filesystem path to a .s3cfg config file

For example:

TensorSerializer(
    open_stream(s3_uri,
                "wb",
                s3_access_key_id=ACCESS_KEY,
                s3_secret_access_key=SECRET_KEY,
                s3_endpoint="object.ord1.coreweave.com"))

and...

TensorDeserializer(
    open_stream(s3_uri,
                "rb",
                s3_access_key_id=ACCESS_KEY,
                s3_secret_access_key=SECRET_KEY,
                s3_endpoint="object.ord1.coreweave.com"))

NOTE: For faster object downloads in the CoreWeave Cloud, you can use the accel-object.ord1.coreweave.com endpoint. This endpoint is optimized for object downloads, and will be faster than the object.ord1.coreweave.com endpoint once the object is cached.

NOTE2: The cache above does not get invalidated when the object is updated in S3. If you update an object in S3, you will need to wait for the cache to expire before you can download the updated object. This takes 24 hours since the last download.

For this reason, it is recommended to use a unique S3 key for each version of a model if you use the accel-object.ord1.coreweave.com endpoint.

Additional Features

tensorizer has a few additional features that make it more useful than just a serialization/deserialization tool.

Concurrent Reads

The TensorDeserializer class has a num_readers argument that controls how many threads are allowed to read concurrently from the source file. This can greatly improve performance, since in many cases the network or the file is the bottleneck. A few caveats to running with num_readers > 1:

The specified file must be able to be reopened, so that the TensorDeserializer can open more streams against the source.
- Local files, paths, and HTTP(S) and S3 URIs / open streams are all able to be reopened
- Special files like pipes and sockets, or synthetic file-like objects such as BytesIO are not currently able to be reopened
For HTTP(S) and S3 streams and URIs, the host must support the Range header. Each reader will read a stream from a different Range offset in the source.

The default is num_readers=1, which has no special requirements.

`state_dict` Support

The TensorDeserializer object can be used as-is as a state_dict for torch.nn.Module.load_state_dict. This is useful for loading the tensors into a torch.nn.Module that is already initialized, or for inspection.

Keep in mind that load_state_dict is not a fast operation, and will likely be much slower than load_into_module.

The state_dict can also be used to initialize a HuggingFace Transformers AutoModel. But HuggingFace Transformers performs three or more copies of the data, so memory use will explode.

`bfloat16` Support

Tensorizer supports models using the bfloat16 data type. However, tensorizer uses numpy to save the tensors as binary and numpy doesn't support bfloat16. This means that special conversions need to be applied.

To be saved, the torch tensor is cast to int16 before being converted to numpy, which doesn't change any of the underlying data. When serialized, the original bfloat16 datatype string is also saved so that it will be cast back to bfloat16 during the deserialization process.

The complex32 datatype is supported in a similar way, by casting to int32. The quantized datatypes (qint8, qint32, etc.) are not currently supported by tensorizer as they would require supplemental quantization parameters to be deserialized correctly.

NOTE: The exact choice of intermediate types as int16 and int32 is considered an implementation detail, and is subject to change, so they should not be relied upon.

NOTE2: This does not interfere with storing actual int datatypes used in tensors in tensorized files.

Numpy Support

Tensorizer can be used with numpy directly to read and write numpy.ndarrays.

The serializer's write_tensor function handles supplying both torch.Tensors and numpy.ndarrays.

The deserializer has a separate function read_numpy_arrays that will return the data as numpy.ndarrays.

As explained above in bfloat16 support, tensorizer uses special conversions to write "opaque" datatypes, those not supported by numpy. Therefore, special considerations need to be taken when loading such data as numpy.ndarrays.

By default, the TensorDeserializer.read_numpy_arrays function sets its allow_raw_data parameter to False. This means that if a file contains opaque datatypes, a ValueError will be raised during deserialization.

If you want to return the raw data regardless, set allow_raw_data to True. Otherwise, the file may be read with TensorDeserializer.read_tensors instead, which yields torch.Tensor objects of the correct datatype.

A fifth and sixth variable are also returned by the read_numpy_arrays generator. The fifth is a bool that indicates whether the returned array has an opaque datatype and requires special handling (only legal when allow_raw_data=True). The sixth is a string describing the true, non-numpy datatype that the raw data should be interpreted as in such cases. For all other datatypes that require no special handling, these are returned as False and None, respectively. The exact numpy datatypes used by the returned opaque numpy.ndarray objects is not guaranteed, and should not be relied upon.

Plaid mode

Older versions of Tensorizer had an argument called plaid_mode that reused buffers when copying to CUDA devices. This now happens automatically. plaid_mode and plaid_mode_buffers are left as arguments for backwards compatibility but are deprecated and have no effect.

Running Tests

tensorizer uses unittest for testing. The tests have their own set of dependencies, which can be installed with pip install -r tests/requirements.txt.

Some tests require a GPU, and will be skipped if no GPU is available. To run the tests, run the following in the root of the repository:

python -m pip install -e .
python -m pip install -r tests/requirements.txt
python -m unittest discover tests/ --verbose

Serialization in a subprocess

You may want to do Serialization in a separate process so that your main process can continue executing and not get bogged down by GIL contention. See docs/subprocess-serialization.md for more details.

tensorizer's People

Contributors

Stargazers

Watchers

Forkers

phoenixdigitalfx rtalaricw haijieg stability-ai 5l1v3r1 splt12 robbat2 mifeet spmkone sangstar dmarx bchess lizzzcai daemon-solutions getzep tuhinsharma121 okalldal

tensorizer's Issues

missing dependency: serializer

Stability example should download all artifacts from S3 rather than HuggingFace Hub

@harubaru: "I had to do a crappy workaround in the SD inference example where instead of downloading the uploaded tokenizer and noise scheduler config from object storage, it had to be downloaded from HuggingFace"

It should be downloading these items from object storage. Perhaps we can figure out how to mirror these items from the Hub?

Deserialisation issue: KeyError: "attribute 'bias' already exists"

I am trying to use tensorizer to serliaize/deserialize the following model on HF: TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ however I am getting an error that I am unsure how to resolve.

The model serializes correctly but on deserialization I get the error: KeyError: "attribute 'bias' already exists"

Code to reproduce:

pip install tensorizer accelerate transformers auto-gptq optimum

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, AutoConfig
from tensorizer import TensorDeserializer, TensorSerializer
from tensorizer.utils import no_init_or_tensor
import time
import sys

model_name_or_path = "TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=True,
                                             revision="main")


def serialise_model(model, save_path):
    try:
        serializer = TensorSerializer(save_path)
        start = time.time()
        serializer.write_module(model)
        end = time.time()
        print((f"Serialising model took {end - start} seconds"),  file=sys.stderr)
        serializer.close()
        return True
    except Exception as e:
        print("Serialisation failed with error: ", e,  file=sys.stderr)
        return False

serialise_model(model, "./test.tensors")

def deserialise_saved_model(model_path, model_id, plaid=True):
    config = AutoConfig.from_pretrained(model_id)

    print(("Initialising empty model"),  file=sys.stderr)
    start = time.time()
    with no_init_or_tensor():
        model = AutoModelForCausalLM.from_config(config)
    end_init = time.time() - start

    deserializer = TensorDeserializer(model_path, plaid_mode=True)

    print(("Loading model"),  file=sys.stderr)
    start = time.time()
    deserializer.load_into_module(model)
    end = time.time()
    deserializer.close()

    print(f"Initialising empty model took {end_init} seconds",  file=sys.stderr)
    print((f"\nDeserialising model took {end - start} seconds\n"),  file=sys.stderr)

    return model
    
model = deserialise_saved_model("./test.tensors", "TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ")

Error Trace:

Initialising empty model
Loading model
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 model = deserialise_saved_model("./test.tensors", "TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ")

Cell In[4], line 28, in deserialise_saved_model(model_path, model_id, plaid)
     26 print(("Loading model"),  file=sys.stderr)
     27 start = time.time()
---> 28 deserializer.load_into_module(model)
     29 end = time.time()
     30 deserializer.close()

File /usr/local/lib/python3.10/dist-packages/tensorizer/serialization.py:1855, in TensorDeserializer.load_into_module(self, m, filter_func, verify_hash)
   1853     module.register_parameter(attr, tensor)
   1854 elif entry.type is TensorType.BUFFER:
-> 1855     module.register_buffer(attr, tensor)
   1856 elif entry.type is TensorType.STATE_DICT:
   1857     raise NotImplementedError(
   1858         "This was serialized using"
   1859         " TensorSerializer.write_state_dict(), so it cannot be"
   (...)
   1862         " state_dict mapping instead."
   1863     )

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:538, in Module.register_buffer(self, name, tensor, persistent)
    536     raise KeyError("buffer name can't be empty string \"\"")
    537 elif hasattr(self, name) and name not in self._buffers:
--> 538     raise KeyError(f"attribute '{name}' already exists")
    539 elif tensor is not None and not isinstance(tensor, torch.Tensor):
    540     raise TypeError(f"cannot assign '{torch.typename(tensor)}' object to buffer '{name}' "
    541                     "(torch Tensor or None required)"
    542                     )

KeyError: "attribute 'bias' already exists"

Release 2.9.0?

Thanks for the library, it's great!

The addition of n_readers #87 massively improves the performance, and the README.md refers to the setting. Is there a plan to release 2.9.0 with this feature?

Add encryption and hash verification to tensorizer

Having the ability to have tensors encrypted at-rest, on-cache, and streamed would help mitigate some customer concerns about using object store. tensorizer would accept a key as part of the serialization and deserialization class instantiation.

Some requirements:

fast decryption, preferably at wire speed
no library specific implementation of ciphers that are not generally available
verification of tensor content matching signature or hash

Deliverable:

functionality
documentation
benchmarks of impact

Tensorizer Serializer Image

Converting models using tensorizer can be an intensive and expensive operation. So we should make it into a container that can be invoked as a job. This should support LLMs and diffusion models.

Support self-signed certificates in `CURLStreamFile`

The current construction of the curl command in CURLStreamFile does not allow for self-signed https downloads from s3 buckets.

Why

Buckets created in local NAS (like QNAP/QuObjects) or local workflows built around minio containers take self-signed certificates for https endpoints. Using tensorizer natively in these environments is pretty useful for dev iterations and workflows before actually deploying.

Solution

A possible solution is that curl allows for bypassing the check all together with the -k flag (reference). Of course, it may also be possible to provide the cert to the curl call as an alternative. However, for these onprem workflows it may be ok to go with the insecure curl call.

Possible Implementation

Propagate a allow_insecure boolean flag through open_stream to CURLStreamFile which adds the -k flag to the curl command.

def open_stream(
    ...,
    allow_insecure: bool = False,
) 

class CURLStreamFile:
    def __init__(
        allow_insecure: bool = False
        *,
    ):

    # Individual flag or self._additional_curl_flags list
    # which could be used in _reproduce_and_capture_error()
    kflag = "-k" if allow_insecure else "" 

    cmd = [
            curl_path,
            ...
            kflag,
            uri,
        ]

I can put up a PR for this, if ok.

Unintuitive behaviour of `verify_hash` parameter to `load_into_module`

Issues with `load_into_module`'s `verify_hash` Parameter

The current behaviour of the verify_hash parameter is unintuitive when specified in TensorDeserializer.load_into_module but not the TensorDeserializer constructor when lazy_load=False, e.g. when used like this:

deserializer = TensorDeserializer(file_obj)
deserializer.load_into_module(m, verify_hash=True)

The end result is that tensor hashes are not verified at all, which is probably not what people expect, and could lead to bugs.

Why is it like that?

Changing verify_hash after TensorDeserializer construction only affects newly-deserialized tensors. If lazy_load=False, then load_into_module does not deserialize any tensors—it uses preloaded cached copies, and their hashes are never checked.

With that in mind, the correct way to write the above code snippet is:

deserializer = TensorDeserializer(file_obj, verify_hash=True)
deserializer.load_into_module(m)

But the massive difference between the two is not obvious.

What should happen?

When a tensor is pulled from the cache and the value of verify_hash has changed to True, the method should verify the already-loaded tensor's hash (and cache the result). This would make either of the above code snippets perform equivalent verification.

Potential implementation

TensorDeserializer._verify_hashes should cache True results in self._metadata[name]["verified_hash"].

When accessing a cached tensor through TensorDeserializer.__getitem__ or TensorDeserializer.get, check the current state of self._verify_hash. If True, check self._metadata[name]["verified_hash"]. If None/False/not present, call self._verify_hashes before returning the tensor. This will check hashes only for tensors that were previously loaded, but never yet verified.

Is there support for downloading via s5?

First of all, thanks for the awesome project!
I was just curious if there was an option or a plan to support downloading the model files via s5 which allows us to obtain even higher network throughput?

Fail to deserialize model: Expected all tensors to be on the same device

I am trying out tensorizer and I can serialize the model using hf_serialization.py. However, when I try to deserialize the model using deserialize.py from local path, I am getting error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!.

The model I am testing is "EleutherAI/gpt-neo-125M", the same error for "EleutherAI/gpt-j-6B"

error:

root@nginx:/workspace/tensorizer/examples# python3 deserialize_local.py
Deserialized 327.9 MB in 0.12s, 2.6 GB/s
Memory usage before: CPU: (maxrss: 985MiB F: 22,761MiB) GPU: (U: 256MiB F: 22,467MiB T: 22,723MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 1,141MiB F: 22,757MiB) GPU: (U: 606MiB F: 22,117MiB T: 22,723MiB) TORCH: (R: 330MiB/330MiB, A: 319MiB/319MiB)
Traceback (most recent call last):
  File "/workspace/tensorizer/examples/deserialize_local.py", line 67, in <module>
    output = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1525, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2622, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 975, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 843, in forward
    outputs = block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 570, in forward
    attn_outputs = self.attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 522, in forward
    return self.attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 283, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 237, in _attn
    attn_weights = torch.where(causal_mask, attn_weights, mask_value)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

tensorizer: 2.8.1
transformers: 4.37.2
torch: 2.1.2
accelerate: 0.26.1
machine type: g5.2xlarge

nivida-smi:

root@nginx:/workspace/tensorizer/examples# nvidia-smi
Mon Feb 26 03:01:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   29C    P8              16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I also try out "mistralai/Mistral-7B-Instruct-v0.1" which I got a different error: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Let me know if anything needed, thanks.

Unexpected results on attention mask (Blip2ForConditionalGeneration)

I'm noticing relevant differences in models like Blip2 by using tensorizer versus from_pretrained method.

Any hint?

Is it possible to serialize LoRA adapter bin / safetensors file?

Hello team,
Thank you for your work on the tensorizer library. I have a question regarding the serialization capabilities of tensorizer. Specifically, is it possible to serialize LoRA adapter bin or safetensors file using tensorizer?

ValueError: Tensor index in the file is empty

I'm trying to serialize the 'runwayml/stable-diffusion-inpainting' model using the hf_serialize.py script in examples/. First, it seems to happen way too fast - in a matter of seconds. How long should we expect serialization to take for a 5GB model? Getting the error below.

$ poetry run scripts/hf_tensorize.py 'runwayml/stable-diffusion-inpainting' tensorized2 --model-type diffusers --validate --force
/home/x/.cache/pypoetry/virtualenvs/x-SHelyAR8-py3.10/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/x/.cache/pypoetry/virtualenvs/x-SHelyAR8-py3.10/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
2024-02-06 13:48:10,240 INFO hf_tensorize.py(113412) - MODEL PATH: runwayml/stable-diffusion-inpainting
2024-02-06 13:48:10,240 INFO hf_tensorize.py(113412) - OUTPUT PREFIX: tensorized2
unet/diffusion_pytorch_model.safetensors not found
Keyword arguments {'use_auth_token': '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'} are not expected by StableDiffusionPipeline and will be ignored.
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:

pip install accelerate

.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00,  1.57s/it]
2024-02-06 13:48:21,749 INFO hf_tensorize.py(113412) - Serializing model
2024-02-06 13:48:21,749 INFO hf_tensorize.py(113412) - GPU: NVIDIA GeForce RTX 3070 Ti Laptop GPU
2024-02-06 13:48:21,898 INFO hf_tensorize.py(113412) - PYTHON USED RAM: CPU: (maxrss: 9,485MiB F: 375MiB) GPU: (U: 167MiB F: 7,806MiB T: 7,973MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
2024-02-06 13:48:21,899 INFO hf_tensorize.py(113412) - Writing config to tensorized2/encoder-config.json
2024-02-06 13:48:21,908 INFO hf_tensorize.py(113412) - Writing tensors to tensorized2/encoder.tensors
2024-02-06 13:48:21,914 INFO hf_tensorize.py(113412) - Writing config to tensorized2/vae-config.json
2024-02-06 13:48:21,914 INFO hf_tensorize.py(113412) - Writing tensors to tensorized2/vae.tensors
2024-02-06 13:48:21,923 INFO hf_tensorize.py(113412) - Writing config to tensorized2/unet-config.json
2024-02-06 13:48:21,923 INFO hf_tensorize.py(113412) - Writing tensors to tensorized2/unet.tensors
2024-02-06 13:48:22,062 INFO hf_tensorize.py(113412) - Writing tensorized2/tokenizer.zip
2024-02-06 13:48:22,182 INFO hf_tensorize.py(113412) - Writing tensorized2/scheduler.zip
2024-02-06 13:48:23,235 INFO hf_tensorize.py(113412) - Validating serialization
2024-02-06 13:48:23,235 INFO hf_tensorize.py(113412) - Loading tensorized2/vae-config.json
2024-02-06 13:48:23,258 INFO hf_tensorize.py(113412) - Loading tensorized2/vae.tensors, CPU: (maxrss: 9,485MiB F: 6,741MiB) GPU: (U: 5,481MiB F: 2,492MiB T: 7,973MiB) TORCH: (R: 5,314MiB/5,314MiB, A: 5,257MiB/5,257MiB)
Traceback (most recent call last):
  File "/home/x/workspace/x/scripts/hf_tensorize.py", line 579, in <module>
    main()
  File "/home/x/workspace/x/scripts/hf_tensorize.py", line 571, in main
    df_main(args)
  File "/home/x/workspace/x/scripts/hf_tensorize.py", line 414, in df_main
    vae = load_model(
  File "/home/x/workspace/x/scripts/hf_tensorize.py", line 307, in load_model
    with _read_stream(tensors_uri) as tensor_stream, TensorDeserializer(
  File "/home/x/.cache/pypoetry/virtualenvs/x-SHelyAR8-py3.10/lib/python3.10/site-packages/tensorizer/serialization.py", line 1549, in __init__
    raise ValueError("Tensor index in the file is empty")
ValueError: Tensor index in the file is empty

pyproject.toml

[tool.poetry.dependencies]
python = ">=3.10,<3.12"
distro-info = "1.0"
scikit-build = "^0.16.7"
pillow = "^10.2.0"
ninja = "^1.11.1"
diffusers = "^0.15.0"
triton = "^2.2.0"
transformers = "^4.37.2"
pynsq = "^0.9.1"
compel = "^2.0.2"
scipy = "^1.12.0"
torch = "^2.2.0"
xformers = "^0.0.24"
accelerate = "^0.26.1"
deepspeed = "^0.13.1"
tensorizer = "^2.7.2"

Eager vs. Lazy Loading in Plaid Mode

Plaid Mode Eagerness

TensorDeserializer's plaid_mode parameter was supposed to imply lazy_load, but it didn't until 2e129b0. This has strange implications, because this mode was not expected to work without lazy loading enabled.

Lazy vs. Eager Loading

Tensorizer supports two deserialization modes, eager and lazy, determined by the lazy_load parameter to the TensorDeserializer() constructor.

Eager Loading Mode

In eager mode, all tensors are loaded from the disk or network to their destination device up-front, during the TensorDeserializer() constructor call, and all data accesses after the deserializer's instantiation reach in and take a cached tensor from the internal TensorDeserializer._cache instance variable.

Lazy Loading Mode

In lazy loading mode, no tensors are loaded during the constructor, and each is instead loaded on-demand when you attempt to access deserializer[<key>], or iterate through a deserializer's entries (either in a loop or implicitly through deserializer.load_into_module()).

Normal lazy loading mode retains all loaded tensors in the deserializer's cache, but it has been assumed that plaid_mode could not do this while maintaining correctness.

Plaid Mode

Whereas the standard loading mode pre-allocates enough memory to hold all of the tensors expected to be loaded simultaneously in RAM, the plaid_mode optimization shares a single CPU memory region for all tensors being loaded, only as large as the single largest tensor, and overwrites each previous tensor's data upon loading the next tensor. plaid_mode is only legal for loading GPU tensors.

It had been our understanding that it was not valid for multiple tensors loaded this way to exist simultaneously, and that it would lead to corruption of the internal cache if attempted. The rationale for this is that when plaid_mode=True, entries that are loaded later will overwrite the tensor data in the CPU buffer associated with still-in-use tensors, and thus the contents of older cache entries may be invalidated. plaid_mode (originally called oneshot) was then intended as an optional flag to enable sharing a buffer for such loads, with the restriction that tensors could only be streamed—i.e., read and used before continuing, and that you could not go backwards again to access older tensors.

This is challenged by the existence of this bug, as the actual behaviour of specifying plaid_mode=True along with lazy_load=False has been to load all tensors up-front during the constructor, first from disk/network into the shared CPU buffer, then onto the GPU, whereupon each is stored in the cache and made available for later cached accesses. All of these GPU tensors appear to work fine, none conflicting with the others' existence.

The reason for this logically seems to be that each finished GPU tensor is no longer associated with the shared CPU buffer once they've been offloaded to the GPU (which happens immediately after loading, before continuing to the next tensor), so they can't actually interfere with each other, and everything is fine.

This behaviour makes sense, but we aren't sure whether or not to trust it, because the aforementioned behaviour of tensors corrupting one another was reportedly observed at some point during development, and we don't have a solid reason why it should be working fine like this now, it just is.

What went wrong with all the code ensuring deletion of older tensors?

The intended caching situation for a lazy-loading plaid_mode was as follows, keeping one item in the cache at a time:

deserializer = TensorDeserializer(..., plaid_mode=True)
# No tensors are loaded yet
_ = deserializer[<first key>]
# Now the <first key> tensor is loaded and temporarily cached
_ = deserializer[<first key>]
# The <first key> tensor is simply pulled from the cache
_ = deserializer[<second key>]
# The <first key> tensor is cleared from the cache, and then the <second key> tensor is loaded and then cached
_ = deserializer[<first key>]
# This raises an error, because the <first key> tensor has already been marked as unavailable in the cache

(Note that there is a distinction between keys that have never been loaded and ones that have already been loaded and then evicted.)

However, a bug caused the cache to be directly pre-populated even in plaid_mode during the constructor unless lazy_load=True was specified, skipping over the __getitem__ logic for cache deletions, so it instead did this:

deserializer = TensorDeserializer(..., plaid_mode=True)
# All tensors are loaded
_ = deserializer[<first key>]
# The <first key> tensor is pulled from the cache
_ = deserializer[<first key>]
# The <first key> tensor is pulled from the cache (again)
_ = deserializer[<second key>]
# The <first key> tensor is cleared from the cache, and <second key> is pulled from the cache
_ = deserializer[<first key>]
# This raises an error, because the <first key> tensor has already been marked as unavailable in the cache

Starting with everything cached, and then slowly clearing them away. This meant that the behaviour of "not being able to go back" was enforced correctly, yet only artificially, because they weren't actually being streamed.

What Happens Now?

Since the release of Tensorizer v1.0.0 debuting plaid_mode, we have neither encountered reports of it corrupting loaded tensors, nor have any of the developers ever seen it occur when running the test suite. This seems to suggest that plaid_mode does not need a lazy_load restriction (though it does still need its GPU-only restriction). However, suspicion remains as to why it didn't work that way before, and seemingly does work now, and whether running that way is truly guaranteed to be correct.

For now, the bug is fixed in 2e129b0 and plaid_mode implies lazy_load as it was originally intended to do. This may be updated to allow disabling lazy_load later, if we have confidence that it works.

Support S3 auth via the EC2 Instance Metadata Service / AWS SSO temporary credentials

The requirement for AWS credentials in _new_s3_client breaks support for implicitly using the EC2 Instance Metadata Service or AWS SSO temporary credentials for S3 authorization.

I'd be happy to contribute a PR to address this. Before I do so, are there any design constraints I should be aware of or guidance you'd like to share?

Tensorizer Support for Large Models( 70B+) that dont fit into a single GPU

I'm currently evaluating Tensorizer for handling large models, specifically models with parameters as larger than 70B that cannot be fit into a single GPU.

I have a few questions and concerns regarding Tensorizer's support for such large models and its handling of .tensors files.

Large Model Support: Can Tensorizer effectively handle large models with parameters in the range of 70B and above wiht multiple distributed GPUs ? I havent seen any benchmarks for streaming large models that cannot be fit into a single A100 GPU . Specifically, can Tensorizer scale effectively to accommodate large models across distributed environments, such as distributed training setups with multiple GPUs or nodes ?
.tensors File Format: Could you provide some insights into the .tensors file format used by Tensorizer? How does Tensorizer organize and store tensors within .tensors files, especially for large models? Does it employ any sharding or distribution mechanisms for managing large tensors?

I appreciate any insights or documentation you can provide regarding these questions. Understanding Tensorizer's capabilities and limitations will help us make informed decisions for our use case.

support setting S3 region and signature version in stream_io

I am trying to use TensorDeserializer to deserialize the model from aws s3 bucket but getting 400 Bad Request error. However, using TensorSerializer to serialize the model to s3 bucket has no issue.

After some testing, I found that changing the signature_version to s3v4 and setting a proper region_name can make thing work. From aws doc, v2 signature version is deprecated (getting Access Denied), and region is used to overwrite the default value which will point to us-east-1.

Jax support

I've recently noticed a lot of complaints (current and historic) within the jax community related to model loading, and this got me thinking about potential opportunities for tensorizer. I think part of the headache in the jax community is specifically associated with the pytree data structure. With the recent addition of support for nested structures in tensorizer, I'm wondering if maybe this tool is positioned to address some of the needs of the jax community.

A complication worth noting here is framework diversity. At present, tensorizer iis specifically geared towards loading data into pytorch modules. Targetting this (reasonably) high level object has thus far essentially addressed the entire pytorch ecosystem without complication (e.g. it doesn't matter if someone is using huggingface libraries or pytorch-lightning). The jax ecosystem is still fairly diverse and unstable. It's unclear to me if the jax ecosystem would be satisfied by a general works-for-anything-jax-y solution (e.g. loading into a pytree or some other analog of the current loading-tensors-into-nn.Modules tensorizer paradigm) vs. framework specific complexities associated with jax/flax/haiku/whatever.

This also might be the kind of thing that it would make more sense to not be concerned with until we have a customer asking for it. Which just to be clear: I'm not. This is a general suggestion for a feature that could potentially grow the customer base.

In the even that jax is actually already supported by the current feature set, we should add some demonstrations to the docs

`mmap` issues with loading models form PVC as of `tensorizer` `v2.4`

Customer and internal observed issues with loading tensorizer images from PVCs on CoreWeave. This works on local filesystems, however.

    deserializer = TensorDeserializer(f"{checkpoint_name}.tensors")
  File "/usr/local/lib/python3.10/dist-packages/tensorizer/serialization.py", line 783, in __init__
    self._buffer = mmap.mmap(-1, self.total_tensor_bytes, **mmap_args)
OSError: [Errno 22] Invalid argument