livepeer / ai-worker Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 19.0 2.22 MB

License: MIT License

Go 29.79% Shell 3.73% Python 64.50% Makefile 0.36% Dockerfile 1.62%

ai-worker's People

Contributors

Stargazers

Watchers

Forkers

codentell pschroedl ad-astra-video titan-node stronk-dev strykar rickstaa eliteprox yousefaldabbas papabear99 mikezupper jjassonn69 tegridydev franckultima

ai-worker's Issues

Add queue per GPU to ensure sequential inference requests

Fix sdxl-turbo container

It looks like its just returning a noise image for some reason.

Batch size too large will crash pipeline with 500 errors

Describe the bug

If a batch size too large for GPU is requested the container will start throwing 500 errors. I believe the container is not crashed on this instance but is bad experience for user.

Reproduction steps

Send text-to-image request with batch size of 10 to Bytedance/SDXL-Lightning model
ai-runner container will start throwing 500 errors as the retries continue if on RTX 4090 or lower VRAM gpu.

Expected behaviour

O/T should be able to specify max batch size with a default of 1. Any batch size above the set max batch size is run sequentially in the ai-runner to get the requested batch of images.

For ByteDance/SDXL-Lightning the processing of a batch of images in one request is linear in time it takes to process the same number of images sequentially (1 takes 700ms, 3 takes 2.1s when batched together). Its not exact but its really close so the user experience would be very similar with a batch or sequentially run image generation. I don't expect this to be the case for all models though. Testing and experience would drive how this works model to model but a batch size of 1 should be a safe start.

For some requests it may make sense to start returning images or making them available for download by the B as they are processed. For fast models like ByteDance/SDXL-Lightning, probably does not need to be concerned with that.

It could also be argued, if tickets are sized to the pixels requested, the batch request should be split between multiple Os to get the batch done faster.

Severity

None

Screenshots / Live demo link

No response

OS

Linux

Running on

Docker

AI-worker version

latest (alpha testnet)

Additional context

No response

Add support for FILM frame interpolation in post-processing for image-to-video (SVD)

We should be able to toggle this on/off via an environment variable.

See previous impl in old-cog branch.

Impl basic benchmarking tool that can be built from this repo

There should be a standard set of metrics outputted by the tool.
Should be able to select a specific pipeline:model combination

Since we're initially starting off with sequential inference requests the tool won't support batching yet, but can support it later on to see the impact of batch size on metrics.

Enhancing VRAM Usage and Inference Speed with Diffusers Optimizations

We're exploring various optimizations available in the Diffusers library to enhance VRAM usage and inference speed. @Titan-Node is currently benchmarking these optimizations, using his ai-benchmark wrapper, across his GPU pool and the Livepeer network to evaluate their effectiveness. Preliminary results are documented in this community spreadsheet via the ai-benchmarking wrapper.

Objective

The goal is to identify and implement the most impactful optimizations for improving the performance of AI models, focusing on inference speed and efficient VRAM usage while also keeping an eye on the quality of the results.

Current Optimizations

The following optimizations are already integrated into our codebase:

Half Precision: Utilizing half-precision weights was supported to enhance inference speed and reduce memory consumption, implemented in ai-worker/image_to_video pipeline.
SFAST (xformers & Triton): Adopted from stable-fast, currently speeds up inference and may reduce memory usage in the future. See implementation in ai-worker/sfast pipeline.

Future Explorations

CPU Offloading: @Titan-Node is currently investigating the potential to decrease memory usage by (sequential) CPU offloading certain computations to the CPU, as described in CPU offloading optimization.
- #34
- #35
- #36

Links and Resources

ValueError: mean must have 1 elements if it is an iterable, got 3 for certain images with svd-xt and svd-xt-film

https://platform.stability.ai/svd_kitten_init.png

https://platform.stability.ai/svd_alien_init.png

The above images need to be converted to RGB i.e. PIL.Image.open(image_path).convert("RGB") or else they will cause the following error with the svd-xt or svd-xt-film container.

Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/server/worker.py", line 218, in _predict
result = predict(**payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 23, in predict
self.pipeline(
File "/src/pipelines/svd_film.py", line 29, in __call__
frames = self.svd_xt_pipeline(
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py", line 424, in __call__
image_embeddings = self._encode_image(image, device, num_videos_per_prompt, self.do_classifier_free_guidance)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py", line 129, in _encode_image
image = self.feature_extractor(
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 549, in __call__
return self.preprocess(images, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/transformers/models/clip/image_processing_clip.py", line 316, in preprocess
images = [
^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/transformers/models/clip/image_processing_clip.py", line 317, in <listcomp>
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 619, in normalize
return normalize(
^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/transformers/image_transforms.py", line 386, in normalize
raise ValueError(f"mean must have {num_channels} elements if it is an iterable, got {len(mean)}")
ValueError: mean must have 1 elements if it is an iterable, got 3

Add seed and scheduler used in runner responses for diffusion pipelines

Add stable-fast packages for install in cog.yaml

[go-livepeer] Re-enable oapi request validation

oapi request validation was disabled in this commit livepeer/go-livepeer@8cc9c5e because request validation was rejecting requests that omitted optional parameters.

Figure out queue strat for worker

Cog no longer has a queue API so the containers themselves cannot queue requests.

Options:

Switch from Cog to something that does come with a queue
Implement a queue in the worker code itself that pulls requests off the queue to send to containers

Add support for stable-fast acceleration of diffusion pipelines

https://github.com/chengzeyi/stable-fast

This speeds up inference with the tradeoff that the first request will be slower because it will dynamically compile the model. We should be able to toggle stable-fast on/off via an environment variable that can be set by a user of runner.

See previous impl on old-cog branch.

Worth checking if this is still the best option for accelerating diffusion pipelines (i.e. is it still faster than TensorRT)?

[go-livepeer] Nil pointer with ImageToVideo handler on O

I0202 15:58:04.980337       1 ai_http.go:129] clientIP=REDACTED Received ImageToVideo request imageSize=622849 model_id=stabilityai/stable-video-diffusion-img2vid-xt
2024/02/02 15:58:05 http2: panic serving REDACTED: runtime error: invalid memory address or nil pointer dereference
goroutine 1843 [running]:
net/http.(*http2serverConn).runHandler.func1()
        /usr/local/go/src/net/http/h2_bundle.go:6186 +0x13b
panic({0x3b6d40?, 0x31478d0?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/livepeer/go-livepeer/core.(*LivepeerNode).imageToVideo(0xc0005029c0, {0xb04710, 0xc00063ab40}, {0x0, 0x0, {0xc0006ba360, {0x0, 0x0, 0x0}, {0x0, ...}}, ...})
        /src/core/orchestrator.go:906 +0xb7
github.com/livepeer/go-livepeer/core.(*orchestrator).ImageToVideo(0x4e?, {0xb04710?, 0xc00063ab40?}, {0x0, 0x0, {0xc0006ba360, {0x0, 0x0, 0x0}, {0x0, ...}}, ...})
        /src/core/orchestrator.go:117 +0x58
github.com/livepeer/go-livepeer/server.startAIServer.(*lphttp).ImageToVideo.func4({0xaffb90, 0xc0007ae008}, 0xb?)
        /src/server/ai_http.go:132 +0x2ad
net/http.HandlerFunc.ServeHTTP(0xc00090a100?, {0xaffb90?, 0xc0007ae008?}, 0xc000f12a00?)
        /usr/local/go/src/net/http/server.go:2136 +0x29
github.com/oapi-codegen/nethttp-middleware.OapiRequestValidatorWithOptions.func1.1({0xaffb90, 0xc0007ae008}, 0x1?)
        /go/pkg/mod/github.com/oapi-codegen/[email protected]/oapi_validate.go:64 +0xed
net/http.HandlerFunc.ServeHTTP(0x18fe59e?, {0xaffb90?, 0xc0007ae008?}, 0x1ac84df?)
        /usr/local/go/src/net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0xc000268e10?, {0xaffb90, 0xc0007ae008}, 0xc00090a100)
        /usr/local/go/src/net/http/server.go:2514 +0x142
github.com/livepeer/go-livepeer/server.(*lphttp).ServeHTTP(0xc0000e6ea0, {0xaffb90, 0xc0007ae008}, 0xc00090a100)
        /src/server/rpc.go:176 +0xa5
net/http.serverHandler.ServeHTTP({0xc00097be88?}, {0xaffb90?, 0xc0007ae008?}, 0xc000574800?)
        /usr/local/go/src/net/http/server.go:2938 +0x8e
net/http.initALPNRequest.ServeHTTP({{0xb04710?, 0xc000f88db0?}, 0xc00039ae00?, {0xc0007661e0?}}, {0xaffb90, 0xc0007ae008}, 0xc00090a100)
        /usr/local/go/src/net/http/server.go:3546 +0x231
net/http.(*http2serverConn).runHandler(0x34dd9e0?, 0x0?, 0x0?, 0xc000126120?)
        /usr/local/go/src/net/http/h2_bundle.go:6193 +0xbb
created by net/http.(*http2serverConn).scheduleHandler in goroutine 1148
        /usr/local/go/src/net/http/h2_bundle.go:6128 +0x21d

Add route + pipeline for ESRGAN upscaling

The design should support the following:

A video is decoded by go-livepeer
The frames of the video are passed to this route
The route returns upscaled frames
The upscaled frames can be encoded by go-livepeer

Reduce Cog image sizes

Black bar at the bottom of downloaded images generated with the ByteDance model

Describe the bug

Black bar visible at the bottom of downloaded images generated with the ByteDance model. The same issue is also present when using the RealVis model but since it's not an "officially" supported model I have only included images generated using the ByteDance model.

If you are viewing this page in dark mode you can't see the issue. Either download the images or switch to day mode.

Reproduction steps

Go to any page that uses Livepeer's AI video to generate it's images i.e. https://letsgenerate.ai , https://inference.stronk.rocks , https://lpt-aivideo-playground-2fbb4d44077b.herokuapp.com
Generate an image using the ByteDance model
Download the image
View locally

This can also be reproduced using a curl command to send and download an image from a Livepeer gateway.

Expected behaviour

The image should not have the above mentioned black at the bottom

Severity

None

Screenshots / Live demo link

No response

OS

Windows

Running on

Docker

AI-worker version

No response

Additional context

No response

Perform warm up call for stable-fast compiled model in setup() of Predictor

You can't seem to set num_frames to a lower value for svd in the warm up call - num_frames seems to need to be the same in the warm up call as in later calls. So, we can just do a warm up call for a dummy image in setup()?

Add support for NVENC encoding in VideoWriter

ai-worker/jobs/containers/svd-xt-film/pipelines/svd_film.py

Line 77 in c31108b

writer = VideoWriter(

The VideoWriter uses torchaudio.io.StreamWriter under the hood which should be able to use ffmpeg to do hardware encoding if NVENC is available.

Note: The perf gains in terms of speed are probably negligible for svd-xt-film right now since the videos are so short which makes this low pri. But, if output videos get longer than the perf gains could be worthwhile.

StableFast optimization research report

As highlighted in this article, we can leverage StableFast to enhance model inference speeds.

Remarks

StableFast employs dynamic pre-tracing, which means the initial request for a specific batch or image size will take considerably longer. This extended startup time is due to the need for pre-tracing when the model loads. Consequently, for pipelines that utilize both batch and image size parameters, this method may not be ideal due to the significant increase in startup time (see details here).

Benchmarks

Let's perform some bench-marking tests using the bench.py script and the following parameters:

docker run --gpus 0 -v /home/ricks/.lpData/models:/models livepeer/ai-runner:latest python bench.py --pipeline <PIPELINE> --model_id <MODEL_ID> --runs 10 --batch_size 1

and for StableFast:

docker run --gpus 0 -e STABLEFAST="true" -v /home/ricks/.lpData/models:/models livepeer/ai-runner:latest python bench.py --pipeline <PIPELINE> --model_id <MODEL_ID> --runs 10 --batch_size 1

These tests were performed on a NVIDIA 3090 and the default num_inference_steps and guidance_scale have been used (i.e. 50 and 7.5). It uses a 1024x1024 image size.

Image-to-video

StableFast will roughly provide a 53% speedup.

I used batch_size=1 because my GPU could not handle a higher batch size.

stabilityai/stable-video-diffusion-img2vid-xt

Original

pipeline load time: 1.535s
pipeline load max GPU memory allocated: 4.231GiB
pipeline load max GPU memory reserved: 4.441GiB
avg inference time: 94.074s
avg inference time per output: 94.074s
avg inference max GPU memory allocated: 12.000GiB
avg inference max GPU memory reserved: 17.195GiB

StableFast

pipeline load time: 176.696s
pipeline load max GPU memory allocated: 11.999GiB
pipeline load max GPU memory reserved: 16.328GiB
avg warmup inference time: 69.700s
avg warmup inference time per output: 69.700s
avg warmup inference max GPU memory allocated: 11.999GiB
avg warmup inference max GPU memory reserved: 16.328GiB
avg inference time: 71.735s
avg inference time per output: 71.735s
avg inference max GPU memory allocated: 11.999GiB
avg inference max GPU memory reserved: 16.328GiB

stabilityai/stable-video-diffusion-img2vid-xt-1-1

StableFast will roughly provide a 47% speedup.

Original

pipeline load time: 3.669s
pipeline load max GPU memory allocated: 4.230GiB
pipeline load max GPU memory reserved: 4.402GiB
avg inference time: 94.283s
avg inference time per output: 94.283s
avg inference max GPU memory allocated: 12.002GiB
avg inference max GPU memory reserved: 17.176GiB

StableFast

pipeline load time: 178.541s
pipeline load max GPU memory allocated: 11.998GiB
pipeline load max GPU memory reserved: 16.344GiB
avg warmup inference time: 69.146s
avg warmup inference time per output: 69.146s
avg warmup inference max GPU memory allocated: 11.998GiB
avg warmup inference max GPU memory reserved: 16.344GiB
avg inference time: 68.760s
avg inference time per output: 68.760s
avg inference max GPU memory allocated: 11.998GiB
avg inference max GPU memory reserved: 16.344GiB

svd-xt doesn't output h264 video

Meanwhile svd-xt-film does output h264 video.

UserWarning about 1 leaked semaphore object when ttopping a container

/root/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Possibly a leak in Cog?

Torch Compile optimization research report

As highlighted in this article, we can leverage Torch.Compile to enhance model inference speeds.

Remarks

torch.compile needs to compile the request on its first use, which can take several minutes and is therefore not feasible for our implementation. Additionally, it can easily throw errors if not set up correctly. Let's hold off on further investigation for now.

Benchmarks

Let's perform some bench-marking tests using the bench.py script and the following parameters:

docker run --gpus 0 -v /home/ricks/.lpData/models:/models livepeer/ai-runner:latest python bench.py --pipeline <PIPELINE> --model_id <MODEL_ID> --runs 10 --batch_size 1

and for torch.compile:

docker run --gpus 0 -e TORCH_COMPILE="true" -v /home/ricks/.lpData/models:/models livepeer/ai-runner:latest python bench.py --pipeline <PIPELINE> --model_id <MODEL_ID> --runs 10 --batch_size 1

These tests were performed on a NVIDIA 3090 and the default num_inference_steps and guidance_scale have been used (i.e. 50 and 7.5). It uses a 1024x1024 image size.

SG161222/RealVisXL_V4.0_Lightning

Compiling the first request with this model took 12 minutes and on the second request I got an error:

File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/_dynamo/variables/base.py", line 306, in call_function
    unimplemented(f"call_function {self} {args} {kwargs}")
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/_dynamo/exc.py", line 172, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call_function UserDefinedClassVariable() [] {'sample': TensorVariable()}

Add error specs for runner responses

Design initial pricing

See the conversation in #20 (comment) for context.

Design communication protocol for long running requests (i.e. image-to-video)

How will a caller (i.e. a B) know whether to switch Os for long running requests (i.e. > 30s)? Perhaps we can breakdown the job into smaller pieces to make switching/failover easier.

[go-livepeer ] Integrate O discovery into AI mediaserver endpoints

Support accepting batches in FILMPipeline

ai-worker/jobs/containers/svd-xt-film/pipelines/svd_film.py

Line 90 in b651702

    
           # TODO: Refactor FILMPipeline so it can accept a batch instead of reading frame by frame

Whisper Pipeline has difficulties with some audio files

We have observed that while most audio files are processed correctly, some MP3 files cause issues. We are actively working on improving the pipeline to address these problems. This issue serves as a platform for users to submit any problematic audio files they encounter. Your contributions will help us enhance the pipeline's robustness and ensure a smoother experience for everyone. Please upload your problematic audio files below. Thank you in advance for your help 🙏🏻!

SDXL-Lightning inference steps ignored

Premise: this is a model which does not accept the guidance_scale param and loads a specific set of model weights according to the amount of num_inference_steps you want to do (1, 2, 4 or 8 steps).

As apps would request the ByteDance/SDXL-Lightning model, the following code would make it default to 2 steps:

ai-worker/runner/app/pipelines/text_to_image.py

Lines 57 to 71 in 0a26654

    
           if SDXL_LIGHTNING_MODEL_ID in model_id: 
        
               base = "stabilityai/stable-diffusion-xl-base-1.0" 
        
               # ByteDance/SDXL-Lightning-2step 
        
               if "2step" in model_id: 
        
                   unet_id = "sdxl_lightning_2step_unet" 
        
               # ByteDance/SDXL-Lightning-4step 
        
               elif "4step" in model_id: 
        
                   unet_id = "sdxl_lightning_4step_unet" 
        
               # ByteDance/SDXL-Lightning-8step 
        
               elif "8step" in model_id: 
        
                   unet_id = "sdxl_lightning_8step_unet" 
        
               else: 
        
                   # Default to 2step 
        
                   unet_id = "sdxl_lightning_2step_unet"

And then when running inference, it would override num_inference_steps to 2:

ai-worker/runner/app/pipelines/text_to_image.py

Lines 188 to 201 in 0a26654

    
           elif SDXL_LIGHTNING_MODEL_ID in self.model_id: 
        
               # SDXL-Lightning models should have guidance_scale = 0 and use 
        
               # the correct number of inference steps for the unet checkpoint loaded 
        
               kwargs["guidance_scale"] = 0.0 
        
               if "2step" in self.model_id: 
        
                   kwargs["num_inference_steps"] = 2 
        
               elif "4step" in self.model_id: 
        
                   kwargs["num_inference_steps"] = 4 
        
               elif "8step" in self.model_id: 
        
                   kwargs["num_inference_steps"] = 8 
        
               else: 
        
                   # Default to 2step 
        
                   kwargs["num_inference_steps"] = 2

Apparently apps needs to append 4step or 8step to the model ID if they want to do a different amount of num_inference_steps. This can be very confusing to app developers, who likely just request ByteDance/SDXL-Lightning with a specific number of num_inference_steps, which then quietly get overwritten during inference.

This would also explain why people have reported this model to have bad output, as running this model at 8 steps provides a vastly different output than at 2 steps.

Proposed solutions could be to switch unet/LoRas during inference or to make the documentation very clear how this specifc model behaves. Luckily with models like RealVisXL_V4.0_Lightning you're not tied to a specific amount of inference_steps

DeepCache optimization research report

As mentioned in this article, we can utilize DeepCache to accelerate model inference. However, this may compromise image quality. In the future, we might need to implement a quality weighting option that dApps can adjust to filter out orchestrators with DeepCache enabled.

Remarks

DeepCache can only accelerate multiple-step models. So, it is not applicable to SDXL Turbo if the inference step is only 1.

Benchmarks

Let's perform some bench-marking tests using the bench.py script and the following parameters:

docker run --gpus 0 -v /home/ricks/.lpData/models:/models livepeer/ai-runner:latest python bench.py --pipeline <PIPELINE> --model_id <MODEL_ID> --runs 10 --batch_size 3

and for Deepcache:

docker run --gpus 0 -e DEEPCACHE="true" -v /home/ricks/.lpData/models:/models livepeer/ai-runner:latest python bench.py --pipeline <PIPELINE> --model_id <MODEL_ID> --runs 10 --batch_size 3

These tests were performed on a NVIDIA 3090 and the default num_inference_steps and guidance_scale have been used (i.e. 50 and 7.5). It uses a 1024x1024 image size.

Image-to-video

DeepCache will roughly provide a 53% speedup.

I used batch_size=1 because my GPU could not handle a higher batch size.

stabilityai/stable-video-diffusion-img2vid-xt

Original

pipeline load time: 1.419s
pipeline load max GPU memory allocated: 4.228GiB
pipeline load max GPU memory reserved: 4.422GiB
avg inference time: 96.771s
avg inference time per output: 96.771s
avg inference max GPU memory allocated: 12.000GiB
avg inference max GPU memory reserved: 17.200GiB

DeepCache

pipeline load time: 2.150s
pipeline load max GPU memory allocated: 4.228GiB
pipeline load max GPU memory reserved: 4.422GiB
avg inference time: 45.229s
avg inference time per output: 45.229s
avg inference max GPU memory allocated: 16.824GiB
avg inference max GPU memory reserved: 21.832GiB

stabilityai/stable-video-diffusion-img2vid-xt-1-1

DeepCache will roughly provide a 47% speedup.

Original

pipeline load time: 1.389s
pipeline load max GPU memory allocated: 4.230GiB
pipeline load max GPU memory reserved: 4.402GiB
avg inference time: 98.439s
avg inference time per output: 98.439s
avg inference max GPU memory allocated: 12.002GiB
avg inference max GPU memory reserved: 17.177GiB

DeepCache

pipeline load time: 1.434s
pipeline load max GPU memory allocated: 4.228GiB
pipeline load max GPU memory reserved: 4.422GiB
avg inference time: 46.407s
avg inference time per output: 46.407s
avg inference max GPU memory allocated: 16.826GiB
avg inference max GPU memory reserved: 21.832GiB

DeepCache + Sfast

pipeline load time: 174.285s
pipeline load max GPU memory allocated: 11.998GiB
pipeline load max GPU memory reserved: 16.338GiB
avg warmup inference time: 67.834s
avg warmup inference time per output: 67.834s
avg warmup inference max GPU memory allocated: 11.998GiB
avg warmup inference max GPU memory reserved: 16.338GiB
avg inference time: 70.435s
avg inference time per output: 70.435s
avg inference max GPU memory allocated: 11.998GiB
avg inference max GPU memory reserved: 16.338GiB

Image-to-image

ByteDance/SDXL-Lightning

DeepCache will roughly provide a 17% speedup when not batching and 6% with a batch size of 3. Speedup is low because we use two-step inference by default (see remarks and this code).

Original

Batch 1:

pipeline load time: 15.640s
pipeline load max GPU memory allocated: 9.729GiB
pipeline load max GPU memory reserved: 10.408GiB
avg inference time: 0.825s
avg inference time per output: 0.825s
avg inference max GPU memory allocated: 10.480GiB
avg inference max GPU memory reserved: 17.084GiB

Batch 3:

pipeline load time: 14.977s
pipeline load max GPU memory allocated: 9.729GiB
pipeline load max GPU memory reserved: 10.408GiB
avg inference time: 2.550s
avg inference time per output: 0.850s
avg inference max GPU memory allocated: 17.985GiB
avg inference max GPU memory reserved: 22.033GiB

DeepCache

Batch 1:

pipeline load time: 15.289s
pipeline load max GPU memory allocated: 9.729GiB
pipeline load max GPU memory reserved: 10.408GiB
avg inference time: 0.678s
avg inference time per output: 0.678s
avg inference max GPU memory allocated: 10.634GiB
avg inference max GPU memory reserved: 17.084GiB

Batch 3:

pipeline load time: 15.279s
pipeline load max GPU memory allocated: 9.729GiB
pipeline load max GPU memory reserved: 10.408GiB
avg inference time: 2.395s
avg inference time per output: 0.798s
avg inference max GPU memory allocated: 18.446GiB
avg inference max GPU memory reserved: 22.033GiB

stabilityai/sd-turbo

DeepCache will NOT provide a speedup. Most likely since this we use two-step inference by default (see remarks and this code).

Original

Batch 1:

pipeline load time: 1.281s
pipeline load max GPU memory allocated: 2.420GiB
pipeline load max GPU memory reserved: 2.484GiB
avg inference time: 0.657s
avg inference time per output: 0.657s
avg inference max GPU memory allocated: 4.805GiB
avg inference max GPU memory reserved: 5.775GiB

Batch 3:

pipeline load time: 1.188s
pipeline load max GPU memory allocated: 2.421GiB
pipeline load max GPU memory reserved: 2.490GiB
avg inference time: 2.001s
avg inference time per output: 0.667s
avg inference max GPU memory allocated: 8.432GiB
avg inference max GPU memory reserved: 11.535GiB

DeepCache

Batch 1:

pipeline load time: 1.189s
pipeline load max GPU memory allocated: 2.419GiB
pipeline load max GPU memory reserved: 2.465GiB
avg inference time: 0.654s
avg inference time per output: 0.654s
avg inference max GPU memory allocated: 5.017GiB
avg inference max GPU memory reserved: 5.777GiB

Batch 3:

pipeline load time: 1.184s
pipeline load max GPU memory allocated: 2.420GiB
pipeline load max GPU memory reserved: 2.480GiB
avg inference time: 1.990s
avg inference time per output: 0.663s
avg inference max GPU memory allocated: 9.062GiB
avg inference max GPU memory reserved: 11.547GiB

stabilityai/sdxl-turbo

DeepCache will NOT provide a speedup. Most likely since this we use two-step inference by default (see remarks and this code).

Original

Batch 1:

pipeline load time: 2.177s
pipeline load max GPU memory allocated: 6.569GiB
pipeline load max GPU memory reserved: 6.797GiB
avg inference time: 1.009s
avg inference time per output: 1.009s
avg inference max GPU memory allocated: 10.467GiB
avg inference max GPU memory reserved: 14.543GiB

Batch 3:

pipeline load time: 2.079s
pipeline load max GPU memory allocated: 6.567GiB
pipeline load max GPU memory reserved: 6.838GiB
avg inference time: 3.093s
avg inference time per output: 1.031s
avg inference max GPU memory allocated: 17.968GiB
avg inference max GPU memory reserved: 22.150GiB

DeepCache

Batch 1:

pipeline load time: 2.089s
pipeline load max GPU memory allocated: 6.572GiB
pipeline load max GPU memory reserved: 6.779GiB
avg inference time: 1.011s
avg inference time per output: 1.011s
avg inference max GPU memory allocated: 10.636GiB
avg inference max GPU memory reserved: 14.537GiB

Batch 3:

pipeline load time: 2.218s
pipeline load max GPU memory allocated: 6.571GiB
pipeline load max GPU memory reserved: 6.812GiB
avg inference time: 3.120s
avg inference time per output: 1.040s
avg inference max GPU memory allocated: 18.447GiB
avg inference max GPU memory reserved: 22.105GiB

Text-to-image

ByteDance/SDXL-Lightning

DeepCache will roughly provide a 17% speedup.

Original

Batch 1:

pipeline load time: 15.762s
pipeline load max GPU memory allocated: 9.729GiB
pipeline load max GPU memory reserved: 10.408GiB
avg inference time: 0.831s
avg inference time per output: 0.831s
avg inference max GPU memory allocated: 10.478GiB
avg inference max GPU memory reserved: 15.078GiB

Batch 3:

Couldn't test because I ran out of memory on my system.

DeepCache

Batch 1:

pipeline load time: 14.979s
pipeline load max GPU memory allocated: 9.729GiB
pipeline load max GPU memory reserved: 10.408GiB
avg inference time: 0.686s
avg inference time per output: 0.686s
avg inference max GPU memory allocated: 10.642GiB
avg inference max GPU memory reserved: 15.078GiB

Batch 3:

Couldn't test because I ran out of memory on my system.

stabilityai/sd-turbo

DeepCache will roughly provide a 50% speedup when not batching and 52% with a batch size of 3. It will lead to a 30% speedup when num_inference_steps=6.

Original

Batch 1:

pipeline load time: 1.184s
pipeline load max GPU memory allocated: 2.420GiB
pipeline load max GPU memory reserved: 2.473GiB
avg inference time: 1.568s
avg inference time per output: 1.568s
avg inference max GPU memory allocated: 3.024GiB
avg inference max GPU memory reserved: 3.617GiB

Batch 3:

pipeline load time: 1.137s
pipeline load max GPU memory allocated: 2.421GiB
pipeline load max GPU memory reserved: 2.488GiB
avg inference time: 3.490s
avg inference time per output: 1.163s
avg inference max GPU memory allocated: 3.930GiB
avg inference max GPU memory reserved: 5.551GiB

DeepCache

Batch 1:

pipeline load time: 1.192s
pipeline load max GPU memory allocated: 2.419GiB
pipeline load max GPU memory reserved: 2.465GiB
avg inference time: 0.784s
avg inference time per output: 0.784s
avg inference max GPU memory allocated: 3.081GiB
avg inference max GPU memory reserved: 3.617GiB

Batch 3:

pipeline load time: 1.173s
pipeline load max GPU memory allocated: 2.420GiB
pipeline load max GPU memory reserved: 2.484GiB
avg inference time: 1.646s
avg inference time per output: 0.549s
avg inference max GPU memory allocated: 4.087GiB
avg inference max GPU memory reserved: 5.596GiB

Original (6 step)

pipeline load time: 2.072s
pipeline load max GPU memory allocated: 3.554GiB
pipeline load max GPU memory reserved: 3.611GiB
avg inference time: 0.338s
avg inference time per output: 0.338s
avg inference max GPU memory allocated: 4.159GiB
avg inference max GPU memory reserved: 4.756GiB

DeepCache (6 step)

pipeline load time: 2.009s
pipeline load max GPU memory allocated: 3.553GiB
pipeline load max GPU memory reserved: 3.613GiB
avg inference time: 0.234s
avg inference time per output: 0.234s
avg inference max GPU memory allocated: 4.216GiB
avg inference max GPU memory reserved: 4.763GiB

stabilityai/sdxl-turbo

DeepCache will roughly provide a 55% speedup when not batching and 56% with a batch size of 3. It will lead to a 36% speedup when num_inference_steps=6.

Original

Batch 1:

pipeline load time: 2.181s
pipeline load max GPU memory allocated: 6.574GiB
pipeline load max GPU memory reserved: 6.781GiB
avg inference time: 2.845s
avg inference time per output: 2.845s
avg inference max GPU memory allocated: 7.657GiB
avg inference max GPU memory reserved: 8.726GiB

Batch 3:

pipeline load time: 2.139s
pipeline load max GPU memory allocated: 6.575GiB
pipeline load max GPU memory reserved: 6.857GiB
avg inference time: 5.829s
avg inference time per output: 1.943s
avg inference max GPU memory allocated: 9.534GiB
avg inference max GPU memory reserved: 11.381GiB

DeepCache

Batch 1:

pipeline load time: 2.135s
pipeline load max GPU memory allocated: 6.568GiB
pipeline load max GPU memory reserved: 6.834GiB
avg inference time: 1.278s
avg inference time per output: 1.278s
avg inference max GPU memory allocated: 7.696GiB
avg inference max GPU memory reserved: 8.807GiB

Batch 3:

pipeline load time: 2.151s
pipeline load max GPU memory allocated: 6.571GiB
pipeline load max GPU memory reserved: 6.812GiB
avg inference time: 2.515s
avg inference time per output: 0.838s
avg inference max GPU memory allocated: 9.653GiB
avg inference max GPU memory reserved: 11.442GiB

Original (6 step)

pipeline load time: 3.002s
pipeline load max GPU memory allocated: 7.701GiB
pipeline load max GPU memory reserved: 7.908GiB
avg inference time: 0.546s
avg inference time per output: 0.546s
avg inference max GPU memory allocated: 8.787GiB
avg inference max GPU memory reserved: 9.854GiB

DeepCache (6 step)

pipeline load time: 8.395s
pipeline load max GPU memory allocated: 7.701GiB
pipeline load max GPU memory reserved: 7.926GiB
avg inference time: 0.358s
avg inference time per output: 0.358s
avg inference max GPU memory allocated: 8.830GiB
avg inference max GPU memory reserved: 9.877GiB

SG161222/RealVisXL_V4.0

DeepCache will roughly provide a 61% speedup when not batching and 47% with a batch size of 3. It will lead to a 48% speedup when num_inference_steps=6.

Original

Batch 1:

pipeline load time: 5.682s
pipeline load max GPU memory allocated: 6.569GiB
pipeline load max GPU memory reserved: 6.797GiB
avg inference time: 13.657s
avg inference time per output: 13.657s
avg inference max GPU memory allocated: 10.467GiB
avg inference max GPU memory reserved: 13.818GiB

Batch 3:

pipeline load time: 2.728s
pipeline load max GPU memory allocated: 6.567GiB
pipeline load max GPU memory reserved: 6.838GiB
avg inference time: 41.419s
avg inference time per output: 13.806s
avg inference max GPU memory allocated: 17.971GiB
avg inference max GPU memory reserved: 21.812GiB

DeepCache

Batch 1:

pipeline load time: 2.165s
pipeline load max GPU memory allocated: 6.574GiB
pipeline load max GPU memory reserved: 6.783GiB
avg inference time: 5.245s
avg inference time per output: 5.245s
avg inference max GPU memory allocated: 10.796GiB
avg inference max GPU memory reserved: 14.121GiB

Batch 3:

pipeline load time: 4.857s
pipeline load max GPU memory allocated: 6.575GiB
pipeline load max GPU memory reserved: 6.812GiB
avg inference time: 19.382s
avg inference time per output: 6.461s
avg inference max GPU memory allocated: 18.922GiB
avg inference max GPU memory reserved: 21.747GiB

Original (6 step)

pipeline load time: 6.241s
pipeline load max GPU memory allocated: 7.702GiB
pipeline load max GPU memory reserved: 7.959GiB
avg inference time: 2.086s
avg inference time per output: 2.086s
avg inference max GPU memory allocated: 11.600GiB
avg inference max GPU memory reserved: 14.957GiB

DeepCache (6 step)

pipeline load time: 3.266s
pipeline load max GPU memory allocated: 7.708GiB
pipeline load max GPU memory reserved: 7.928GiB
avg inference time: 1.089s
avg inference time per output: 1.089s
avg inference max GPU memory allocated: 11.931GiB
avg inference max GPU memory reserved: 15.404GiB

SG161222/RealVisXL_V4.0_Lightning

DeepCache will roughly provide a 61% speedup when not batching and 47% with a batch size of 3. It will lead to a 45% speedup when num_inference_steps=6.

Strangely the lightning model is not faster than the regular model. Maybe because we use a different num_inference_steps and guidance_scalethan for which it was trained.

Original

Batch 1:

pipeline load time: 2.444s
pipeline load max GPU memory allocated: 6.572GiB
pipeline load max GPU memory reserved: 6.783GiB
avg inference time: 13.803s
avg inference time per output: 13.803s
avg inference max GPU memory allocated: 10.470GiB
avg inference max GPU memory reserved: 13.781GiB

Batch 3:

pipeline load time: 2.855s
pipeline load max GPU memory allocated: 6.572GiB
pipeline load max GPU memory reserved: 6.783GiB
avg inference time: 39.807s
avg inference time per output: 13.269s
avg inference max GPU memory allocated: 17.975GiB
avg inference max GPU memory reserved: 21.757GiB

DeepCache

Batch 1:

pipeline load time: 4.297s
pipeline load max GPU memory allocated: 6.574GiB
pipeline load max GPU memory reserved: 6.781GiB
avg inference time: 5.315s
avg inference time per output: 5.315s
avg inference max GPU memory allocated: 10.792GiB
avg inference max GPU memory reserved: 14.177GiB

Batch 3:

pipeline load time: 9.491s
pipeline load max GPU memory allocated: 6.571GiB
pipeline load max GPU memory reserved: 6.812GiB
avg inference time: 18.570s
avg inference time per output: 6.190s
avg inference max GPU memory allocated: 18.918GiB
avg inference max GPU memory reserved: 20.279GiB

Original (6 step)

pipeline load time: 3.236s
pipeline load max GPU memory allocated: 7.702GiB
pipeline load max GPU memory reserved: 7.959GiB
avg inference time: 2.089s
avg inference time per output: 2.089s
avg inference max GPU memory allocated: 11.600GiB
avg inference max GPU memory reserved: 14.957GiB

DeepCache (6 step)

pipeline load time: 5.489s
pipeline load max GPU memory allocated: 7.702GiB
pipeline load max GPU memory reserved: 7.922GiB
avg inference time: 1.137s
avg inference time per output: 1.137s
avg inference max GPU memory allocated: 11.926GiB
avg inference max GPU memory reserved: 15.217GiB

Conclusion

DeepCache seems to result in a significant inference time speedup for most models. Strangely enough for some Turbo and Lightning models there is no speedup.

A test Bounty

Is your feature request related to a problem? Please describe.

Tip

Bounty: 10 LPT
Issued by: @rickstaa (AI SPE)

Add frame interpolation support

Let's use FILM for now since that is what the Stability API uses.

There is a PyTorch implementation at https://github.com/dajes/frame-interpolation-pytorch.

I think it makes sense to start by creating a svd-xt-film container that uses Diffusers to run svd-xt and then interpolates additional frames using FILM. We can consider adding a standalone container for frame interpolation separately.

Adding a standalone container right now seems questionable - how often are you just frame interpolating a video? My guess is that this is typically going to be used as a technique to create better outputs using svd. And it would be inefficient to create a video with svd and then decode to run frame interpolation again. I think the output of svd from Diffusers is a set of frames right now anyway so this is already the format that we want to use as input into FILM rather than encoding the frames into a video and then decoding into frames again to interpolate.

DeepCache quality report

A issue to research the output quality of using the https://github.com/horseee/DeepCache optimizaiton algorithm with the different pipelines.

Image-to-Video

For this test we use the following images:

For this test we use the following parameters.

curl -X POST http://0.0.0.0:8935/image-to-video     -F "model_id=stabilityai/stable-video-diffusion-img2vid-xt-1-1"     -F "width=1024"     -F "height=1024"     -F "motion_bucket_id=50"     -F "fps=25"     -F "noise_aug_strength=0.05"     -F "seed=0"    -F "image=@/home/ricks/Downloads/<IMAGE>.png"

Stable Fast

Deep Cache

cat_seed_0_deepcache.mp4

0263c80e.mp4

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Hi, I am getting this error when tried to run:

docker run --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/stabilityai/sd-turbo

Any idea why is this happening?
Thanks

Add support for writing PNG files to shared volumes

At the moment, the runner app routes return images as base64 encoded data URLs. This can be convenient if the response to a user is meant to be a base64 encoded image. But, this is inefficient due to the base64 encoding particularly if there are many images being transmitted which is the case for image-to-video which returns multiple frames of a video as PNG files - this inefficiency is amplified in the image-to-video case because go-livepeer will decode the data URLs and write the PNG files to disk in order use the ffmpeg image2 demuxer so that go-livepeer can encode the PNG files into a video.

There should be an option for the runner app routes to write images to an output directory of a shared volume that go-livepeer can then read from. This can be the default behavior for the image-to-video route. The default behavior for the routes that return single images can still be to return base64 encoded data URLs, but there can be an option to change this. go-livepeer's APIs would allow callers to request an HTTP URL served by the node's object store and the runner API would allow callers to request a file:// URL served by the shared storage for the runner app.

[go-livepeer] Integrate O selection into AI mediaserver endpoints

Memory NodeStorage may create very limited historical

Describe the bug

My understaning is drivers.NodeStorage is set in starter.go on line 1254 to a MemoryDriver with a cache length of 12. I think the current implementation of [capability]_[model] in the AI selector as the session name will be limited to 12 results before cache starts to clear. This could result in ErrNotFound errors when more than 12 requests are done before the 1st request could be downloaded.

Reproduction steps

Send 13 requests to ByteDance/SDXL-Lightning model and try to download first image. Expect to get ErrNotFound.

Expected behaviour

A session pool per request would solve this and limiting inference requests to 12 per request.

Severity

None

Screenshots / Live demo link

No response

OS

None

Running on

None

AI-worker version

No response

Additional context

No response

huggingface-cli download stabilityai/sd-turbo --include "*.fp16.safetensors" "*.json" "*.txt" --exclude ".onnx" ".onnx_data" --cache-dir models

Got the following OS Error. It downloaded the tokenizer_config.json file but is 0kb and cannot open it on my system, which is really odd.

AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py", line 916, in _create_symlink
    os.symlink(src_rel_or_abs, abs_dst)

OSError: [WinError 1314] A required privilege is not held by the client: '..\\..\\..\\blobs\\bd2abe19377557ff5771584921f9b65fa041fef0' -> 'C:\\local\\models\\models\\models--stabilityai--sd-turbo\\snapshots\\1681ed09e0cff58eeb41e878a49893228b78b94c\\tokenizer\\tokenizer_config.json'

Anyway the benchmark still works, not sure if this files is required or not.

	if SDXL_LIGHTNING_MODEL_ID in model_id:
	base = "stabilityai/stable-diffusion-xl-base-1.0"

	# ByteDance/SDXL-Lightning-2step
	if "2step" in model_id:
	unet_id = "sdxl_lightning_2step_unet"
	# ByteDance/SDXL-Lightning-4step
	elif "4step" in model_id:
	unet_id = "sdxl_lightning_4step_unet"
	# ByteDance/SDXL-Lightning-8step
	elif "8step" in model_id:
	unet_id = "sdxl_lightning_8step_unet"
	else:
	# Default to 2step
	unet_id = "sdxl_lightning_2step_unet"

	elif SDXL_LIGHTNING_MODEL_ID in self.model_id:
	# SDXL-Lightning models should have guidance_scale = 0 and use
	# the correct number of inference steps for the unet checkpoint loaded
	kwargs["guidance_scale"] = 0.0

	if "2step" in self.model_id:
	kwargs["num_inference_steps"] = 2
	elif "4step" in self.model_id:
	kwargs["num_inference_steps"] = 4
	elif "8step" in self.model_id:
	kwargs["num_inference_steps"] = 8
	else:
	# Default to 2step
	kwargs["num_inference_steps"] = 2

livepeer / ai-worker Goto Github PK

ai-worker's People

Contributors

Stargazers

Watchers

Forkers

ai-worker's Issues

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

Objective

Current Optimizations

Future Explorations

Links and Resources

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

Remarks

Benchmarks

Image-to-video

Remarks

Benchmarks

Remarks

Benchmarks

Image-to-video

Image-to-image

Text-to-image

Conclusion

Is your feature request related to a problem? Please describe.

Image-to-Video

Stable Fast

Deep Cache

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

Recommend Projects

Recommend Topics

Recommend Org