pabannier / bark.cpp Goto Github PK

View Code? Open in Web Editor NEW

559.0 559.0 37.0 47.88 MB

Suno AI's Bark model in C/C++ for fast text-to-speech

License: MIT License

Python 14.22% C++ 83.95% CMake 1.12% Dockerfile 0.71%

inference machine-learning text-to-speech

bark.cpp's Introduction

About me

I work as a Data Scientist at AI biotech Owkin.
Previously, I interned at INRIA Parietal on solving neuroscience (M/EEG) inverse problems.
I graduated from Ecole Polytechnique and HEC Paris with a double major in data science and management.

In 2022, I co-created skglm, a fast sklearn-compatible solver for sparse generalized linear models. More recently, I've become interested in fast inference for large language models. I have implemented Bark.cpp, a port of SunoAI's Bark model in C/C++, as well as specialized models like BioGPT.cpp.

Cool open-source projects I contributed to

MNE-Python, a toolkit for exploring neurophysiological data in Python
Linfa, the leading crate for machine learning and data analysis in Rust
Benchopt, a benchmarking suite for optimization algorithms

Other projects I worked on

Encodec.cpp, Meta's neural codec model ported in C++
SparseGLM, a fast coordinate descent solver in Rust
Nanograd, a lightweight deep learning framework built around Numpy arrays
NarrateMate.ai, a Next.JS web app to practice language comprehension listening to YouTube videos

bark.cpp's People

Contributors

Stargazers

Watchers

bark.cpp's Issues

What's the output length?

I think I remember reading bark generates 30s of audio at a time. Is that also true for bark.cpp?

I've tried letting it read some article and it crashed. Is that a length limitation or something else?

Also: Is there example code to make it read back a whole news article, a dialogue or anything useful?

Update on model development

Please make a simple model for this test program, which can be used immediately. I'm not very good at python, sorry to bother you

Vocos

Be sure to consider Vocos which is a far superior vocoder and is like for like compatible and is a drop in replacement. There are a few repos that have Vocos with Bark but nothing fully integrated last I checked.

https://github.com/charactr-platform/vocos

https://github.com/rsxdalv/tts-generation-webui

OSX metal?

DId OSX metal support ever get implemented?

missing requirements.txt

The readme tells you to:

python3 -m pip install -r bark/requirements.txt

but there is no such file. It looks like it got accidentally removed.
https://github.com/PABannier/bark.cpp/blob/3c4411d33e7093fb6a5586463d40e2c196aa146a/bark/requirements.txt

ENH Convert weights of the 4 models into one large binary file

Currently, the convert script generates 4 binary files corresponding to the weights and metadata of the 4 major components of bark. For simplicity purposes, we should make one big large file.

Can we have compiled exe for usage please?

Hello,
Thank you for this!

Can u please compile ur project like other cpp ai such as stable diffusion cpp? so we only can download the exe from release and start to use it, i'm a simple user and don't know much about compiling and coding.

And could u please give an ETA when audiocraft will be supported in the future?

kind regards

Not enough space in the context's memory pool

Following your instructions, I get the following:

$ ./build/bin/main -m ./ggml_weights/ -p "this is an audio"
bark_load_model_from_file: loading model from './ggml_weights/'
bark_load_model_from_file: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 304 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_load_model_from_file: reading bark vocab

bark_load_model_from_file: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 304 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_load_model_from_file: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 304 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_load_model_from_file: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_load_model_from_file: total model size  =  4170.64 MB

bark_tokenize_input: prompt: 'this is an audio'
bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595 
bark_forward_text_encoder: ...........................................................................................................

bark_print_statistics: mem per token =     4.81 MB
bark_print_statistics:   sample time =    23.58 ms / 109 tokens
bark_print_statistics:  predict time =  9675.77 ms / 87.96 ms per token
bark_print_statistics:    total time =  9702.40 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_print_statistics: mem per token =     8.53 MB
bark_print_statistics:   sample time =     6.76 ms / 324 tokens
bark_print_statistics:  predict time = 50832.34 ms / 156.41 ms per token
bark_print_statistics:    total time = 50843.50 ms

ggml_new_object: not enough space in the context's memory pool (needed 4115076720, available 4112941056)
Segmentation fault (core dumped)

Is this related to my machine memory?

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            39Gi       6.3Gi       8.2Gi       1.1Gi        24Gi        28Gi
Swap:           19Gi       0.0Ki        19Gi

Voice cloning with bark

FEAT Add 8-bit and 4-bit quantization script

BUG Tokenizer

For some inputs like "john", the tokenizer adds indefinitely "##".

Support for piper models

It would be helpful to add support for piper models into bark.cpp

there is already a c++ library for piper but it is difficult to compile and does not work well cross platform. Piper is currently running on the onnx runtime.

https://github.com/rhasspy/piper

Support iOS and Android?

Hi,
Is it possible to support iOS and Android? General guidelines on how you'd approach that.

Thanks,
Hussain

Support multiple voices

Unable to build

Hi, when I try to build both on colab as well as locally I get this error :

/content/bark.cpp
/content/bark.cpp/build
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:24 (add_subdirectory):
The source directory

/content/bark.cpp/ggml

does not contain a CMakeLists.txt file.

-- Configuring incomplete, errors occurred!
gmake: Makefile: No such file or directory
gmake: *** No rule to make target 'Makefile'. Stop.

Whats up with this ?

Working example on Google Colab?

Can anyone show a working example on Google Colab where a concrete audio file is generated? In my attempts, execution strangely breaks after these lines.

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token = 8.51 MB
bark_forward_coarse_encoder: sample time = 8.16 ms
bark_forward_coarse_encoder: predict time = 95368.38 ms / 294.35 ms per token
bark_forward_coarse_encoder: total time = 95518.55 ms

Here is the link to my attempt on Google Colab:
https://colab.research.google.com/drive/1JVtJ6CDwxtKfFmEd8J4FGY2lzdL0d0jT?usp=sharing

support GPU or not ?

I have checked the project description that :
The main goal of bark.cpp is to synthesize audio from a textual input with the Bark model in efficiently using only CPU.

Could I know the it support GPU or not ? I suppose that using GPU should be more faster that using CPU

Support CoreML models

Support 8-bit quantization

Some broken things for first timers

First of all, thanks for taking up the challenge and democratising this wunderful model.

encodec_24khz-d7cc33bc.th doesn't download for me

Downloading: "https:/dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /Users/tatsch/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
Traceback (most recent call last):
  File "/Users/tatsch/workspace/bark.cpp/download_weights.py", line 41, in <module>
    state_dict = torch.hub.load_state_dict_from_url(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/hub.py", line 746, in load_state_dict_from_url
    download_url_to_file(url, cached_file, hash_prefix, progress=progress)
  File "/opt/homebrew/lib/python3.11/site-packages/torch/hub.py", line 611, in download_url_to_file
    u = urlopen(req)
        ^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 516, in open
    req = meth(req)
          ^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1272, in do_request_
    raise URLError('no host given')
urllib.error.URLError: <urlopen error no host given>

curl -o models/encodec_24khz-d7cc33bc.th https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th

vocab.txt also isnt there for me in models, maybe related to the aforementioned issue

curl -o models/vocab.txt https://huggingface.co/suno/bark/blob/main/vocab.txt

but I guess its the wrong one because when I run it

bark_model_load: reading bark vocab
bark_vocab_load: wrong voculary size (305 != 119547)
bark_model_load: invalid model file './ggml_weights//ggml_vocab.bin' (bad text)
main: failed to load model from './ggml_weights/'

also the call in the readme should be

./main -m ./ggml_weights/ -p "this is an audio"
instead of

./main -m ./models/ggml_weights/ -p "this is an audio"
for the default folder structure.

9k bounty if you finish porting this

Yo pierre can you finish the last part already !

there's a 9k bounty with your name on it if you commit that last commit https://replit.com/bounties/@reddestbull/improve-inference-by

bark_forward_fine_encoder tried to allocate 30GB of memory during forward pass.

As explained before in one of the issues, during a forward pass, the bark_forward_fine_encoder tried to allocate 30GB of memory.

The console log looks something like this:

./main -m ./ggml_weights -p "this is an audio" 
bark_model_load: loading model from './ggml_weights'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_model_load: total model size  =  4170.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595 
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token =     4.80 MB
bark_forward_text_encoder:   sample time =    13.86 ms
bark_forward_text_encoder:  predict time =  6651.94 ms / 18.22 ms per token
bark_forward_text_encoder:    total time =  6737.75 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token =     8.51 MB
bark_forward_coarse_encoder:   sample time =     3.54 ms
bark_forward_coarse_encoder:  predict time = 31155.62 ms / 96.16 ms per token
bark_forward_coarse_encoder:    total time = 31228.26 ms

fine_gpt_eval: failed to allocate 31987885670 bytes
bark_forward_fine_encoder: ggml_aligned_malloc: insufficient memory (attempted to allocate 30506.03 MB)
GGML_ASSERT: ggml.c:4408: ctx->mem_buffer != NULL
zsh: killed     ./main -m ./ggml_weights -p "this is an audio"

So far I am unable to track the cause. But will keep trying.

Support Metal accelerators for MacOS

List of errors in build and seg fault on inference

I started off from update_submodule branch

/oos/bark.cpp/bark/bark.cpp:2048:43: error: use of undeclared identifier 'encodec_verbosity_level'; did you mean 'bark_verbosity_level'?
        encodec_model_path, n_gpu_layers, encodec_verbosity_level::LOW);
                                          ^~~~~~~~~~~~~~~~~~~~~~~
                                          bark_verbosity_level
/oos/bark.cpp/bark/./bark.h:23:12: note: 'bark_verbosity_level' declared here
enum class bark_verbosity_level {
           ^
/oos/bark.cpp/bark/bark.cpp:2047:37: error: no matching function for call to 'encodec_load_model'
    struct encodec_context * ectx = encodec_load_model(
                                    ^~~~~~~~~~~~~~~~~~
/oos/bark.cpp/encodec.cpp/./encodec.h:193:26: note: candidate function not viable: requires 2 arguments, but 3 were provided
struct encodec_context * encodec_load_model(
                         ^
/oos/bark.cpp/bark/bark.cpp:2060:5: error: use of undeclared identifier 'encodec_set_sample_rate'
    encodec_set_sample_rate(ectx, sample_rate);
    ^
3 errors generated.
make[2]: *** [CMakeFiles/bark.dir/bark.cpp.o] Error 1
make[1]: *** [CMakeFiles/bark.dir/all] Error 2
make: *** [all] Error 2

I added fixes here:
#132
PABannier/encodec.cpp#34

But even then, the example ./bark/build/examples/main/main -m ./ggml_weights/ -p "this is an audio" will produce noise, and any string other than "this is an audio" will cause a segmentation fault. This is on M macOS.
Cmake version:
cmake version 3.28.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Mixed F16 / F32 precision

FEAT Add support to FP16

Currently bark.cpp only supports FP32. For faster inference, we should make FP16 available.

Quantize doesn't seem to work for codec model

The text, coarse, and fine models are converted successfully, but the codec model always results in a 0 byte output. After a quick look, it seems the header in the codec model may be slightly different than the other models and it can't read the correct ftype from the file because the offsets are wrong.

Additionally, running the models as f32 or f16 produces very similar output for the same prompt/seed. Running the text, coarse, and fine models quantized at q8_0 produces an entirely different output for the same prompt/seed.

memsize should be int64_t rather than int32_t

memsize in gpt_model and bark_model is currently int32_t, and it overflows when adding the individual model sizes together. Changing them to int64_t should resolve this.

How to accurately set up prompt？

Hi! I've tried different prompts, but the results are very strange. See the following examples:

Precision: fp32. Prompt: "one two three four five six seven eight nine ten." The output is 9 seconds long, but it
only takes the first 3s to read out "eight nine ten", and the other 6s almost contain nothing.
Precision: q4. Prompt: "one two three four five six seven eight nine ten." The output is a 12-second-long murmur
Precision: q4. Prompt: "one two three four five six." The output only reads out "two three four five six".
There are also some issues that occur when using different random seeds or prompts like "[MAN] one two three four five six" and "[happy piano music, playing for ten seconds]".
Are there any solutions or suggestions for setting the prompts accurately (especially for playing music)? Thx!

[NEW MODEL] AudioCraft

Meta just release AudioCraft consisting of three main pieces: MusicGen, AudioGen and EnCoDec. We should be able to implement them once bark.cpp is done.

https://ai.meta.com/blog/audiocraft-musicgen-audiogen-encodec-generative-ai-audio/

unable to clone repository

git clone --recursive https://github.com/PABannier/bark.cpp.git
Cloning into 'bark.cpp'...
remote: Enumerating objects: 700, done.
remote: Counting objects: 100% (360/360), done.
remote: Compressing objects: 100% (145/145), done.
remote: Total 700 (delta 292), reused 224 (delta 214), pack-reused 340
Receiving objects: 100% (700/700), 47.85 MiB | 10.92 MiB/s, done.
Resolving deltas: 100% (390/390), done.
Submodule 'encodec.cpp' (https://github.com/PABannier/encodec.cpp) registered for path 'encodec.cpp'
Cloning into '/mnt/ubuntu/home/jape/ai/bark.cpp/encodec.cpp'...
remote: Enumerating objects: 275, done.
remote: Counting objects: 100% (122/122), done.
remote: Compressing objects: 100% (64/64), done.
remote: Total 275 (delta 84), reused 68 (delta 52), pack-reused 153
Receiving objects: 100% (275/275), 3.93 MiB | 9.86 MiB/s, done.
Resolving deltas: 100% (155/155), done.
fatal: remote error: upload-pack: not our ref e50cd96d28c89f6c1343c291042b14bab6f3b83b
fatal: Fetched in submodule path 'encodec.cpp', but it did not contain e50cd96d28c89f6c1343c291042b14bab6f3b83b. Direct fetching of that commit failed.

Generate CoreML weights

Performance Estimate Benchmarks

Commendable Effort! Could do some form of performance benchmarking? Latency, Memory Usage etc. maybe on Colab with multiple different configurations. Maybe if Batch processing is enabled also an estimate on the largest batch size to use etc.

FEAT Support Unicode characters for tokenization

CI/MNT: Write unit tests

We should have CI running unit tests.

Seeding the Mersenne twister and for a family of inputs, check we have the same tokenization as bark for:

Text encoder
Coarse encoder
Fine encoder

Additionally, we should test the Bert tokenizer.

Support GGUF file org

Integrate with Vits.cpp to use Vits / Piper models

It might be helpful for your project to integrate the code from Vits.cpp into yours

https://github.com/maxilevi/vits.cpp

This would allow you to run vits / piper models

License of this repository

Hi!
You are building a great project. I plan to use it in my next open source. However, I need to know the license of the project before starting with it.
Could you add a LICENSE file so that we can know what we can do with this project?
Thank you very much!

Support Vocos

submodule encodec.cpp is dropped

Fetched in submodule path '../encodec.cpp', but it did not contain e50cd96d28c89f6c1343c291042b14bab6f3b83b. Direct fetching of that commit failed.

ENH: Create a Bark context

Currently, we need to free memory we need to call:

ggml_free(model.coarse_model.ctx);
ggml_free(model.fine_model.ctx);
ggml_free(model.text_model.ctx);
ggml_free(model.codec_model.ctx);

This is cumbersome and could be replaced easily by a bark_free function, similar to llama_free.

invalid number of codes - SIGSEGV (Address boundary error)

encodec_build_graph: invalid number of codes
fish: Job 1, './bark/build/examples/main/main…' terminated by signal SIGSEGV (Address boundary error)

./bark/build/examples/main/main -m ./ggml_weights_q4/ -p "this is an audio"

Support history prompts / custom voice

MacBook

Why do you want to make it only work on a MacBook?

If it works on a MacBook couldn't it work on other computers too?

Ggml works on any computer, couldn't this do the same?

simple but obvious, the cmake is missing the main :)
the vocab.bin ships with the repo, so why require it for the conversion (i commented it out)
running main yields in a allocation error, trying to allocate 47GiB 🤣

$ ./main -m models/bark_v0/
bark_model_load: loading model from 'models/bark_v0/'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_model_load: total model size  =  4170.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token =     4.80 MB
bark_forward_text_encoder:   sample time =    17.30 ms
bark_forward_text_encoder:  predict time =  6746.21 ms / 18.48 ms per token
bark_forward_text_encoder:    total time =  6825.61 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token =     8.51 MB
bark_forward_coarse_encoder:   sample time =     4.79 ms
bark_forward_coarse_encoder:  predict time = 30730.57 ms / 94.85 ms per token
bark_forward_coarse_encoder:    total time = 30784.73 ms

fine_gpt_eval: failed to allocate 50200313856 bytes
bark_forward_fine_encoder: ggml_aligned_malloc: insufficient memory (attempted to allocate 47874.75 MB)
GGML_ASSERT: ggml.c:4408: ctx->mem_buffer != NULL
Aborted (core dumped)

Unable to build.

Git clone submodules recursive fails, but I was able to manually fix that.

╰─(base) ⠠⠵ git submodule update --init --recursive                                                                                                                                        on main|✚1
fatal: remote error: upload-pack: not our ref e50cd96d28c89f6c1343c291042b14bab6f3b83b
fatal: Fetched in submodule path 'encodec.cpp', but it did not contain e50cd96d28c89f6c1343c291042b14bab6f3b83b. Direct fetching of that commit failed.

But then when I do cmake --build . --config Release I get:

╰─(base) ⠠⠵ cmake --build . --config Release                                                                                                                                               on main|✚1
[  4%] Building C object encodec.cpp/ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  8%] Building C object encodec.cpp/ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 12%] Building C object encodec.cpp/ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 16%] Linking C shared library libggml.so
[ 16%] Built target ggml
[ 20%] Building CXX object encodec.cpp/CMakeFiles/encodec.dir/encodec.cpp.o
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:319:22: warning: multi-character character constant [-Wmultichar]
  319 |         if (magic != ENCODEC_FILE_MAGIC) {
      |                      ^~~~~~~~~~~~~~~~~~
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp: In function ‘void print_tensor(ggml_tensor*)’:
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:79:27: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   79 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                        ~~~^                        ~~~~~~~~
      |                           |                               |
      |                           long long int                   int64_t {aka long int}
      |                        %ld
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:79:33: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   79 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                              ~~~^                            ~~~~~~~~
      |                                 |                                   |
      |                                 long long int                       int64_t {aka long int}
      |                              %ld
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:79:39: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   79 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                                    ~~~^                                ~~~~~~~~
      |                                       |                                       |
      |                                       long long int                           int64_t {aka long int}
      |                                    %ld
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:79:45: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 5 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   79 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                                          ~~~^                                    ~~~~~~~~
      |                                             |                                           |
      |                                             long long int                               int64_t {aka long int}
      |                                          %ld
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp: In function ‘bool encodec_load_model_weights(const std::string&, encodec_model&, int)’:
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:714:89: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 5 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
  714 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld, %lld], expected [%d, %d, %d]\n",
      |                                                                                      ~~~^
      |                                                                                         |
      |                                                                                         long long int
      |                                                                                      %ld
  715 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], tensor->ne[2], ne[0], ne[1], ne[2]);
      |                                                ~~~~~~~~~~~~~                             
      |                                                            |
      |                                                            int64_t {aka long int}
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:714:95: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 6 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
  714 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld, %lld], expected [%d, %d, %d]\n",
      |                                                                                            ~~~^
      |                                                                                               |
      |                                                                                               long long int
      |                                                                                            %ld
  715 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], tensor->ne[2], ne[0], ne[1], ne[2]);
      |                                                               ~~~~~~~~~~~~~                    
      |                                                                           |
      |                                                                           int64_t {aka long int}
/home/arthur/dev/ai/bark.cpp/encodec.cpp/encodec.cpp:714:101: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 7 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
  714 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld, %lld], expected [%d, %d, %d]\n",
      |                                                                                                  ~~~^
      |                                                                                                     |
      |                                                                                                     long long int
      |                                                                                                  %ld
  715 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], tensor->ne[2], ne[0], ne[1], ne[2]);
      |                                                                              ~~~~~~~~~~~~~           
      |                                                                                          |
      |                                                                                          int64_t {aka long int}
[ 25%] Linking CXX static library libencodec.a
[ 25%] Built target encodec
[ 29%] Building CXX object CMakeFiles/bark.dir/bark.cpp.o
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp: In function ‘void print_tensor(ggml_tensor*)’:
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:74:27: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   74 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                        ~~~^                        ~~~~~~~~
      |                           |                               |
      |                           long long int                   int64_t {aka long int}
      |                        %ld
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:74:33: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   74 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                              ~~~^                            ~~~~~~~~
      |                                 |                                   |
      |                                 long long int                       int64_t {aka long int}
      |                              %ld
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:74:39: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   74 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                                    ~~~^                                ~~~~~~~~
      |                                       |                                       |
      |                                       long long int                           int64_t {aka long int}
      |                                    %ld
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:74:45: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 5 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
   74 |         printf("shape=[%lld, %lld, %lld, %lld]\n", a->ne[0], a->ne[1], a->ne[2], a->ne[3]);
      |                                          ~~~^                                    ~~~~~~~~
      |                                             |                                           |
      |                                             long long int                               int64_t {aka long int}
      |                                          %ld
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp: In function ‘void bark_print_statistics(gpt_model*)’:
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:123:47: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
  123 |     printf("%s:   sample time = %8.2f ms / %lld tokens\n", __func__, model->t_sample_us/1000.0f, model->n_sample);
      |                                            ~~~^                                                  ~~~~~~~~~~~~~~~
      |                                               |                                                         |
      |                                               long long int                                             int64_t {aka long int}
      |                                            %ld
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp: In function ‘bool bark_generate_audio(bark_context*, std::string&, std::string&, int, bark_verbosity_level)’:
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:2048:43: error: ‘encodec_verbosity_level’ has not been declared
 2048 |         encodec_model_path, n_gpu_layers, encodec_verbosity_level::LOW);
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~
/home/arthur/dev/ai/bark.cpp/bark/bark.cpp:2060:5: error: ‘encodec_set_sample_rate’ was not declared in this scope
 2060 |     encodec_set_sample_rate(ectx, sample_rate);
      |     ^~~~~~~~~~~~~~~~~~~~~~~
gmake[2]: *** [CMakeFiles/bark.dir/build.make:76: CMakeFiles/bark.dir/bark.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:256: CMakeFiles/bark.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
╭─arthur at aquarelle in ~/dev/ai/bark.cpp/bark/build on main✘✘✘ 24-03-06 - 5:17:37
╰─(base) ⠠⠵ cmake --build . --config Release

any ideas?
thanks!

Support for AudioLDM2

We seem to have a working implementation of AudioLDM2

I understand you have already mentioned you will implement Vocos and AudioCraft. But it seems to me that AudioLDM produces better outputs.

Please have a look?
:)