filecoin-project / bellperson Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zkcrypto/bellman

178.0 21.0 114.0 4.67 MB

zk-SNARK library

License: Other

Rust 100.00%

bellperson's Introduction

bellperson

This is a fork of the great bellman library.

bellman is a crate for building zk-SNARK circuits. It provides circuit traits and primitive structures, as well as basic gadget implementations such as booleans and number abstractions.

Backend

There is currently one backend available for the implementation of Bls12 381:

blstrs - optimized with hand tuned assembly, using blst

GPU

This fork contains GPU parallel acceleration to the FFT and Multiexponentation algorithms in the groth16 prover codebase under the compilation features cuda and opencl.

Requirements

NVIDIA (Turing or newer) or
AMD GPU Graphics Driver (OpenCL)

( For AMD devices we recommend ROCm )

Environment variables

The gpu extension contains some env vars that may be set externally to this library.

BELLMAN_NO_GPU

Will disable the GPU feature from the library and force usage of the CPU.
```
// Example
env::set_var("BELLMAN_NO_GPU", "1");
```
BELLMAN_VERIFIER

Chooses the device in which the batched verifier is going to run. Can be cpu, gpu or auto.
```
Example
env::set_var("BELLMAN_VERIFIER", "gpu");
```
RUST_GPU_TOOLS_CUSTOM_GPU

Will allow for adding a GPU not in the tested list. This requires researching the name of the GPU device and the number of cores in the format ["name:cores"].
```
// Example
env::set_var("RUST_GPU_TOOLS_CUSTOM_GPU", "GeForce RTX 2080 Ti:4352, GeForce GTX 1060:1280");
```
BELLMAN_CPU_UTILIZATION

Can be set in the interval [0,1] to designate a proportion of the multiexponenation calculation to be moved to cpu in parallel to the GPU to keep all hardware occupied.
```
// Example
env::set_var("BELLMAN_CPU_UTILIZATION", "0.5");
```
RAYON_NUM_THREADS

Restricts the number of threads used in the library to roughly that number (best effort). In the past this was done using BELLMAN_NUM_CPUS which is now deprecated. The default is set to the number of logical cores reported on the machine.
```
// Example
env::set_var("RAYON_NUM_THREADS", "6");
```
EC_GPU_NUM_THREADS

Restricts the number of threads used by the FFT and multiexponentiation calculations. In the past this setting was shared with RAYON_NUM_THREADS, now they are separate settings that can be controlled independently. The default is set to the number of logical cores reported on the machine.
```
// Example
env::set_var("EC_GPU_NUM_THREADS", "6");
```
BELLMAN_GPU_FRAMEWORK

Bellman can be compiled with both, OpenCL and CUDA support. When both are available, BELLMAN_GPU_FRAMEWORK can be used to set it to a specific one, either cuda or opencl.
```
// Example
env::set_var("BELLMAN_GPU_FRAMEWORK", "opencl");
```
BELLMAN_CUDA_NVCC_ARGS

By default the CUDA kernel is compiled for several architectures, which may take a long time. BELLMAN_CUDA_NVCC_ARGS can be used to override those arguments. The input and output file will still be automatically set.
```
// Example for compiling the kernel for only the Turing architecture
env::set_var("BELLMAN_CUDA_NVCC_ARGS", "--fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75");
```
BELLPERSON_GPUS_PER_LOCK

Restricts the number of devices used by the FFT and multiexponentiation calculations.
- If it's not set, a single lock will be created, and each calculation uses all devices
- If BELLPERSON_GPUS_PER_LOCK = 0, no lock will be created, each calculation uses all devices, and each device can run multiple calculations. WARNING: this option can break things easily. Each kernel expects that it's run without anything else running on the GPU at the same time. If two kernels run at the same time, they might interfere with each other and lead to crashes or wrong results.
- If BELLPERSON_GPUS_PER_LOCK > 0, create a lock for each device, each calculation uses BELLPERSON_GPUS_PER_LOCK (up to device number) devices
```
// Example
env::set_var("BELLPERSON_GPUS_PER_LOCK", "0");
env::set_var("BELLPERSON_GPUS_PER_LOCK", "1");
```

Supported / Tested Cards

Depending on the size of the proof being passed to the gpu for work, certain cards will not be able to allocate enough memory to either the FFT or Multiexp kernel. Below are a list of devices that work for small sets. In the future we will add the cuttoff point at which a given card will not be able to allocate enough memory to utilize the GPU.

Device Name	Cores	Comments
Quadro RTX 6000	4608
TITAN RTX	4608
Tesla V100	5120
Tesla P100	3584
Tesla T4	2560
Quadro M5000	2048
GeForce RTX 3090	10496
GeForce RTX 3080	8704
GeForce RTX 3070	5888
GeForce RTX 2080 Ti	4352
GeForce RTX 2080 SUPER	3072
GeForce RTX 2080	2944
GeForce RTX 2070 SUPER	2560
GeForce GTX 1080 Ti	3584
GeForce GTX 1080	2560
GeForce GTX 2060	1920
GeForce GTX 1660 Ti	1536
GeForce GTX 1060	1280
GeForce GTX 1650 SUPER	1280
GeForce GTX 1650	896

gfx1010	2560	AMD RX 5700 XT
gfx906	7400	AMD RADEON VII
------------------------	-------	----------------

Running Tests

RUSTFLAGS="-C target-cpu=native" cargo test --release --all

To run using CUDA and OpenCL, you can use:

RUSTFLAGS="-C target-cpu=native" cargo test --release --all --features cuda,opencl

To run the multiexp_consistency test you can use:

RUST_LOG=info cargo test --features cuda,opencl -- --exact multiexp::gpu_multiexp_consistency --nocapture

Considerations

Bellperson uses rust-gpu-tools as its CUDA/OpenCL backend, therefore you may see a directory named ~/.rust-gpu-tools in your home folder, which contains the compiled binaries of OpenCL kernels used in this repository.

Experimental

The instance aggregation provided by groth16::aggregate::prove::aggregate_proofs_and_instances() has not yet been audited so should be used with caution. It is not recommended to use instance aggregation in production until it has been audited.

License

Licensed under either of

Apache License, Version 2.0, |LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

bellperson's People

Contributors

Stargazers

Watchers

Forkers

rampenke luozijun nickboot cryptonemo vmx woodstock-xkx 15012700225 hqwgeorge xiaodreak newmai dtynn shvo robquistnl rauliu storswiftlabs szdengbw filcloud ipfsunion filtester jamescheuk91 whoami233 codex8080 oopos alexzhenyu ruyuewangzi cerblue xib1uvxi ppiaas linxxx3 s1m0n21 cupnfish 2019jack starpool-com jackoelv ocean00 caizhiwei brucechin tianru996007 zondax daotlresearch stuberman beck-8 baokangx simonatsn aliensero luo4lu royromny nikkolasg kumataz chw683 huhuapop mason0510 disy-yin zeethio jfjun j1362212330 maweis1981 luzhenyan zhiquan911 filecoin-performance yuan520521 kubuxu cszz12 vezenovm mfdzh gominer-dev andrewpedia everythingisnothing jobdeng xiaogao koalacxr shandiekeji gyllone paulip1792 woshidama323 lurk-lab phoenixzhua araquiel isabella232 stebalien isgasho 282548573 karim-agha maolin-sen loveyiy immanuelsegol all-forks free1139 miss-bug ramtej nalinbhardwaj xinaxu elhorses qope lagrange-studio zhiqiangxu kikakkz leonardoalt filecoin-update huitseeker

bellperson's Issues

[arm64 server] GPU failed, Error: Ocl Error: OPENCL ERROR

Describe the bug
[arm64 server] GPU FFT failed! Falling back to CPU... Error: Ocl Error: OPENCL ERROR

[arm64 server] To Reproduce( )
Steps to reproduce the behavior:

Run './bench sealing --sector-size=512M'
See error
################################ OPENCL ERROR ###############################

Error executing function: clCreateProgramWithBinary

Status error code: CL_INVALID_VALUE (-30)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateProgramWithBinary.html#errors

#############################################################################

out of range for slice of length

2021-03-05T05:46:31.296 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 55005916
2021-03-05T05:46:31.296 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 44286599
2021-03-05T05:46:31.296 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 128795429
2021-03-05T05:46:31.297 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 120850541
2021-03-05T05:46:31.297 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 35470004
2021-03-05T05:46:31.298 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 96664777
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 63143936-63144448 for 63144211
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 88801792-88802304 for 88801901
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 108494848-108495360 for 108495240
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 22749696-22750208 for 22749959
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 14698496-14699008 for 14698963
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 98592768-98593280 for 98593246
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 7117824-7118336 for 7118271
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 101633024-101633536 for 101633070
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 5264896-5265408 for 5265230
2021-03-05T05:46:31.299 DEBUG merkletree::merkle > leafs 134217728, branches 8, total size 153391689, total row_count 10, cache_size 299593, rows_to_discard 2, partial_row_count 4, cached_leafs 262144, segment_width 512, segment range 121800192-121800704 for 121800669
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 14698963
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 22749959
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 5265230
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 7118271
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 88801901
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 101633070
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 98593246
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 63144211
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 121800669
2021-03-05T05:46:31.301 DEBUG merkletree::merkle > generated partial_tree of row_count 4 and len 585 with 8 branches for proof at 108495240
2021-03-05T05:46:31.325 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish
2021-03-05T05:46:31.731 INFO storage_proofs_core::compound_proof > snark_proof:start
2021-03-05T05:46:31.856 INFO bellperson::groth16::prover > Bellperson 0.12.5 is being used!
2021-03-05T05:46:56.860 INFO bellperson::groth16::prover > starting proof timer
2021-03-05T05:46:59.156 DEBUG bellperson::gpu::locks > Acquiring priority lock at "/data/tmpdir/bellman.priority.lock" ...
2021-03-05T05:46:59.156 DEBUG bellperson::gpu::locks > Priority lock acquired!
2021-03-05T05:46:59.372 INFO bellperson::gpu::locks > GPU is available for FFT!
2021-03-05T05:46:59.372 DEBUG bellperson::gpu::locks > Acquiring GPU lock at "/data/tmpdir/bellman.gpu.lock" ...
2021-03-05T05:46:59.372 DEBUG bellperson::gpu::locks > GPU lock acquired!
2021-03-05T05:46:59.469 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2021-03-05T05:46:59.470 INFO bellperson::gpu::fft > FFT: Device 0: GeForce RTX 2080 Ti
2021-03-05T05:46:59.470 INFO bellperson::domain > GPU FFT kernel instantiated!
2021-03-05T05:47:24.467 DEBUG bellperson::gpu::locks > GPU lock released!
thread '' panicked at 'range end index 6043200150033888668 out of range for slice of length 60780374232', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.5/src/groth16/mapped_params.rs:141:16
stack backtrace:
0: 0x2576ca0 - std::backtrace_rs::backtrace::libunwind::trace::he85dfb3ae4206056
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/../../backtrace/src/backtrace/libunwind.rs:96
1: 0x2576ca0 - std::backtrace_rs::backtrace::trace_unsynchronized::h1ad28094d7b00c21
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/../../backtrace/src/backtrace/mod.rs:66
2: 0x2576ca0 - std::sys_common::backtrace::_print_fmt::h901b54610713cd21
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/sys_common/backtrace.rs:79
3: 0x2576ca0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hb0ad78ee1571f7e0
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/sys_common/backtrace.rs:58
4: 0x25e545c - core::fmt::write::h1857a60b204f1b6a
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/core/src/fmt/mod.rs:1080
5: 0x2568952 - std::io::Write::write_fmt::hf7b7d7b243f84a36
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/io/mod.rs:1516
6: 0x257ba7d - std::sys_common::backtrace::_print::hd093978a5287b8ff
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/sys_common/backtrace.rs:61
7: 0x257ba7d - std::sys_common::backtrace::print::h20f46787581d56d7
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/sys_common/backtrace.rs:48
8: 0x257ba7d - std::panicking::default_hook::{{closure}}::h486cbb4b82ffc357
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/panicking.rs:208
9: 0x257b728 - std::panicking::default_hook::h4190c9e3edd4d591
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/panicking.rs:227
10: 0x257c2a1 - std::panicking::rust_panic_with_hook::h72e78719cdda225c
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/panicking.rs:577
11: 0x257be49 - std::panicking::begin_panic_handler::{{closure}}::h8bd07dbd34150a96
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/panicking.rs:484
12: 0x257712c - std::sys_common::backtrace::__rust_end_short_backtrace::hdb6b3066ad29028a
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/sys_common/backtrace.rs:153
13: 0x257be09 - rust_begin_unwind
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/std/src/panicking.rs:483
14: 0x25e1781 - core::panicking::panic_fmt::hb15d6f55e8472f62
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/core/src/panicking.rs:85
15: 0x25e7612 - core::slice::index::slice_end_index_len_fail::hd6713db859210b4a
at /rustc/beb5ae474d2835962ebdf7416bd1c9ad864fe101/library/core/src/slice/index.rs:41
16: 0x1f8c204 - bellperson::groth16::mapped_params::read_g1::h004f43cae7169a0b
17: 0x1e8fa9f - <core::iter::adapters::ResultShunt<I,E> as core::iter::traits::iterator::Iterator>::next::h6c8c6ce98d19654b
18: 0x2767520 - <alloc::vec::Vec as alloc::vec::SpecFromIter<T,I>>::from_iter::h8a6371b1a6f7802d
19: 0x1f99b5e - <&bellperson::groth16::mapped_params::MappedParameters as bellperson::groth16::params::ParameterSource>::get_h::h949f5f6ca34ae937
20: 0x1e826a4 - <core::iter::adapters::ResultShunt<I,E> as core::iter::traits::iterator::Iterator>::next::h3634294668c630ae
21: 0x2769f43 - <alloc::vec::Vec as alloc::vec::SpecFromIter<T,I>>::from_iter::ha59406e4b13af32a
22: 0x1ec321d - core::iter::adapters::process_results::he7aadfffdfa06262
23: 0x203d7d3 - bellperson::groth16::prover::create_random_proof_batch_priority::h276c650f8150c79e
24: 0x29246c4 - storage_proofs_core::compound_proof::CompoundProof::circuit_proofs::h77cfb9f28969c7e1
25: 0x2926e9f - storage_proofs_core::compound_proof::CompoundProof::prove::ha227d03058c967dd
26: 0x2111259 - filecoin_proofs::api::post::generate_window_post::h3ce375a36a3720e3
27: 0x1eecb91 - filecoin_proofs_api::post::generate_window_post_inner::h2412de5044ab1125
28: 0x1eebd70 - filecoin_proofs_api::post::generate_window_post::h98bd97b00231eae4
29: 0x1c9bdf1 - <std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once::hec3921fd7d7125a0
30: 0x260c9de - ffi_toolkit::catch_panic_response::he354153ed6d2e058
31: 0x1ddf6f0 - fil_generate_window_post
32: 0x1b50ae6 - _cgo_a8fa62747d41_Cfunc_fil_generate_window_post
at /tmp/go-build/cgo-gcc-prolog:596
33: 0x5a7370 - runtime.asmcgocall
at /usr/local/go/src/runtime/asm_amd64.s:656
2021-03-05T05:47:25.427 DEBUG bellperson::gpu::locks > Priority lock released!
2021-03-05T05:47:25.616+0800 �[34mINFO�[0m storageminer storage/wdpost_run.go:595 computing window post {"batch": 0, "elapsed": 55.337996454}
2021-03-05T05:47:25.616+0800 �[31mERROR�[0m storageminer storage/wdpost_run.go:104 runPost failed: running window post failed:
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).runPost
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:630

Rust panic: no unwind information
github.com/filecoin-project/filecoin-ffi.GenerateWindowPoSt
/data/lotus_sources/lotus/lotus/extern/filecoin-ffi/proofs.go:587
github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).GenerateWindowPoSt
/data/lotus_sources/lotus/lotus/extern/sector-storage/ffiwrapper/verifier_cgo.go:45
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).runPost
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:592
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).runGeneratePoST
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:102
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).startGeneratePoST.func1
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:86
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1374
2021-03-05T05:47:25.616+0800 �[31mERROR�[0m storageminer storage/wdpost_run.go:48 Got err running window post failed:
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).runPost
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:630
Rust panic: no unwind information
github.com/filecoin-project/filecoin-ffi.GenerateWindowPoSt
/data/lotus_sources/lotus/lotus/extern/filecoin-ffi/proofs.go:587
github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).GenerateWindowPoSt
/data/lotus_sources/lotus/lotus/extern/sector-storage/ffiwrapper/verifier_cgo.go:45
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).runPost
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:592
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).runGeneratePoST
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:102
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).startGeneratePoST.func1
/data/lotus_sources/lotus/lotus/storage/wdpost_run.go:86
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1374 - TODO handle errors

Implement new performance improvements for proving

In https://github.com/filecoin-project/bellperson/tree/wip-cuda-faster a variety of performance improvements have been found. These should all be ported and tested

Multiple GPU based FFTs in parallel #228
Optimize param loading scheduling #217
Optimize exponent generation for GPU multiexp #219
Optimize LinearCombination operations #216
and of course the cuda work in #218

Stuck acquiring priority lock

Currently experiencing issues around acquiring priority locks in the lotus project. This issue is to track it from the bellperson side.

(Lotus issue for reference filecoin-project/lotus#5446)

I don't know if this is actually an issue in the bellperson library or an integration issue into lotus. But we are seeing things get stuck up around acquiring the priority lock. This has occurred both while running window post as well as winning post.

The issue seems to be much more common when running in docker, I can reproduce the issue with consistent results. It occurs during the very first winning post of the network. I have observed once a winning post completing, but locks up on the second.

I haven't been able to get the issue to occur when running the lotus bench tool in docker though.

This issue occurs regardless of the value of BELLMAN_NO_GPU.

Jan 26 20:13:46 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:46.928+0000","logger":"storage_proofs_core::compound_proof","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-core-5.4.0/src/compound_proof.rs:86","msg":"vanilla_proofs:finish"}
Jan 26 20:13:46 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:46.971+0000","logger":"storage_proofs_core::compound_proof","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-core-5.4.0/src/compound_proof.rs:92","msg":"snark_proof:start"}
Jan 26 20:13:46 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:46.972+0000","logger":"bellperson::groth16::prover","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.1/src/groth16/prover.rs:274","msg":"Bellperson 0.12.1 is being used!"}
Jan 26 20:13:47 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:47.663+0000","logger":"bellperson::groth16::prover","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.1/src/groth16/prover.rs:309","msg":"starting proof timer"}
Jan 26 20:13:47 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"debug","ts":"2021-01-26T20:13:47.663+0000","logger":"bellperson::gpu::locks","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.1/src/gpu/locks.rs:40","msg":"Acquiring priority lock..."}

Issue has been seen in bellman 0.12.1, 0.12.3

Possibility Related
#126

multiexp: Cannot initialize kernel for device

I think I've also seen this error on CI:

[2021-03-15T17:13:42Z ERROR bellperson::gpu::multiexp] Cannot initialize kernel for device 'Tesla V100-SXM2-16GB'! Error: OpenCL Error: Ocl Error: 
    
    ###################### OPENCL PROGRAM BUILD DEBUG OUTPUT ######################
    
    <kernel>:160:40: warning: excess elements in array initializer
      if(!Fr_gte(a, b)) res = Fr_add_(res, Fr_P);
                                           ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:167:18: warning: excess elements in array initializer
      if(Fr_gte(res, Fr_P)) res = Fr_sub_(res, Fr_P);
                     ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:167:44: warning: excess elements in array initializer
      if(Fr_gte(res, Fr_P)) res = Fr_sub_(res, Fr_P);
                                               ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:189:26: warning: excess elements in array initializer
        Fr_mac_with_carry(m, Fr_P.val[0], t[0], &carry);
                             ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:191:39: warning: excess elements in array initializer
          t[j - 1] = Fr_mac_with_carry(m, Fr_P.val[j], t[j], &carry);
                                          ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:200:21: warning: excess elements in array initializer
      if(Fr_gte(result, Fr_P)) result = Fr_sub_(result, Fr_P);
                        ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:200:53: warning: excess elements in array initializer
      if(Fr_gte(result, Fr_P)) result = Fr_sub_(result, Fr_P);
                                                        ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:193:5: warning: array index -1 is before the beginning of the array
        t[Fr_LIMBS - 1] = Fr_add_with_carry(t[Fr_LIMBS], &carry);
        ^ ~~~~~~~~~~~~
    <kernel>:179:3: note: array 't' declared here
      Fr_limb t[Fr_LIMBS + 2] = {0};
      ^
    <kernel>:83:17: note: expanded from macro 'Fr_limb'
    #define Fr_limb ulong
                    ^
    <kernel>:217:16: warning: excess elements in array initializer
      if(Fr_gte(a, Fr_P)) a = Fr_sub_(a, Fr_P);
                   ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:217:38: warning: excess elements in array initializer
      if(Fr_gte(a, Fr_P)) a = Fr_sub_(a, Fr_P);
                                         ^~~~
    <kernel>:87:23: note: expanded from macro 'Fr_P'
    #define Fr_P ((Fr){ { 64513 } })
                          ^~~~~
    <kernel>:516:40: warning: excess elements in array initializer
      if(!Fq_gte(a, b)) res = Fq_add_(res, Fq_P);
                                           ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:523:18: warning: excess elements in array initializer
      if(Fq_gte(res, Fq_P)) res = Fq_sub_(res, Fq_P);
                     ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:523:44: warning: excess elements in array initializer
      if(Fq_gte(res, Fq_P)) res = Fq_sub_(res, Fq_P);
                                               ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:545:26: warning: excess elements in array initializer
        Fq_mac_with_carry(m, Fq_P.val[0], t[0], &carry);
                             ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:547:39: warning: excess elements in array initializer
          t[j - 1] = Fq_mac_with_carry(m, Fq_P.val[j], t[j], &carry);
                                          ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:556:21: warning: excess elements in array initializer
      if(Fq_gte(result, Fq_P)) result = Fq_sub_(result, Fq_P);
                        ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:556:53: warning: excess elements in array initializer
      if(Fq_gte(result, Fq_P)) result = Fq_sub_(result, Fq_P);
                                                        ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:549:5: warning: array index -1 is before the beginning of the array
        t[Fq_LIMBS - 1] = Fq_add_with_carry(t[Fq_LIMBS], &carry);
        ^ ~~~~~~~~~~~~
    <kernel>:535:3: note: array 't' declared here
      Fq_limb t[Fq_LIMBS + 2] = {0};
      ^
    <kernel>:439:17: note: expanded from macro 'Fq_limb'
    #define Fq_limb ulong
                    ^
    <kernel>:573:16: warning: excess elements in array initializer
      if(Fq_gte(a, Fq_P)) a = Fq_sub_(a, Fq_P);
                   ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    <kernel>:573:38: warning: excess elements in array initializer
      if(Fq_gte(a, Fq_P)) a = Fq_sub_(a, Fq_P);
                                         ^~~~
    <kernel>:443:23: note: expanded from macro 'Fq_P'
    #define Fq_P ((Fq){ { 64513 } })
                          ^~~~~
    Error: Empty parameter types are not supported
    
    ###############################################################################

I can reproduce this with running:

RUST_LOG=debug cargo test --features gpu --lib -- groth16::tests::test_xordemo --show-output

In order to see the log messages, the logger needs to be initialized. Do that by adding let _ = env_logger::try_init(); to the test_xordemo().

This error doesn't seem to do any harm, the tests still pass. But I thought it's worth mentioning to see if the code does anything strange.

Support AMD GPUs

Currently AMD GPUs are not supported, this issue is for tracking what needs to happen to enable them.

Belpeople insist on grabbing GPU lock with no GPU hardware and GPU-off flags

I am running a lotus miner on a GPU-less machine with:
BELLMAN_NO_GPU=1 lotus-miner run --enable-gpu-proving=false

Regardless I still get /tmp/bellman.gpu.lock being exlock-ed ( discovered due to incorrect permissions on said file leading to sealing failures )

Bellperson should not do GPU-related things when told not to.

FFT error

128GB mem+128GB swap+2080Ti,V27
[2020-06-22T15:23:20.959][^[[33mWARN^[[0m][bellperson::domain:572] Cannot instantiate GPU FFT kernel! Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clCreateContext

Status error code: CL_OUT_OF_HOST_MEMORY (-6)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateContext.html#errors

#############################################################################

Test compatibility between paired and blstrs

@cryptonemo @dignifiedquire

CI: run tests also on non-gpu machines

Currently the CI is running the tests on machines that have a GPU. Run all those test combinations also on machines that don't have a GPU.

Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!

Describe the bug
Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!

To Reproduce
Steps to reproduce the behavior:

nvidia-smi -L
GPU 0: GeForce RTX 3080 (UUID: GPU-xxx)

clinfo -l
Platform #0: Clover
Platform #1: NVIDIA CUDA
 `-- Device #0: GeForce RTX 3080

export BELLMAN_CUSTOM_GPU="GeForce RTX 3080:8704" # UUID is also tried
BELLMAN_NO_GPU=0
BELLMAN_VERIFIER=auto
BELLMAN_CPU_UTILIZATION=0.875

# remove /tmp/bellman*.lock if any

lotus-bench sealing --sector-size=2KiB

Screenshots

2021-01-20T09:09:58.679 WARN bellperson::domain > Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!
2021-01-20T09:10:00.339 INFO bellperson::gpu::locks > GPU is available for FFT!
2021-01-20T09:10:00.339 WARN bellperson::domain > Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!
2021-01-20T09:10:01.865 INFO bellperson::gpu::locks > GPU is available for FFT!
2021-01-20T09:10:01.865 WARN bellperson::domain > Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!
2021-01-20T09:10:03.307 INFO bellperson::gpu::locks > GPU is available for FFT!
2021-01-20T09:10:03.307 WARN bellperson::domain > Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!
2021-01-20T09:10:04.806 INFO bellperson::gpu::locks > GPU is available for FFT!
2021-01-20T09:10:04.806 WARN bellperson::domain > Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!
2021-01-20T09:10:06.512 INFO bellperson::gpu::locks > GPU is available for FFT!
2021-01-20T09:10:06.513 WARN bellperson::domain > Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!

Bellperson version:
0.9.2

test_groth16_aggregation fails when run with 2 threads

Description

When running RAYON_NUM_THREADS=2 cargo test --release --tests test_groth16_aggregation -- --exact, it fails with:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', /tmp/bellman/src/groth16/aggregate/accumulator.rs:152:41

Acceptance criteria

Test should successfully pass.

Risks + pitfalls

No risk with fixing it, but there is a more general risk that there are more parallelism bugs as we use Rayon in combination with sending messages.

Where to begin

Run the test.

Benchmark of GPU and CPU

Hi, I am running this library for exploring how fast is the GPU implementation. Unfortunately, I found GPU is 10 times slower than CPU and I want to know why.
My CPU: Intel(R) Xeon(R) Gold 6145 CPU @ 2.00GHz, 80 cores, memory is 377G
My GPU: I have eight GTX 1080 on my server.

I compile the code (one with gpu feature and one without that) and run the compiled binary output of mimc, I get the following result:

Run on GPU (enable gpu feature):
Creating parameters...
Creating proofs...
test test_mimc ... test test_mimc has been running for over 60 seconds
Average proving time: 4.691798915s
Average verifying time: 0.164126917s
Batch verification of 50 proofs: 0.074657728s (0.00149316052s/proof)

Run on CPU:
Creating parameters...
Creating proofs...
Average proving time: 0.540332641s
Average verifying time: 0.194030763s
Batch verification of 50 proofs: 0.184567745s (0.00369135856s/proof)

Is that reasonable? If that's correct and reasonable, I think the reason is my CPU has too many cores? Or it's totally unreasonable at all....

Make the multiexp kernel work on CUDA

Add support to run the multiexp kernel on CUDA.

OpenCL Error

System infomation

CPU : Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, 4 core 8 thread
GPU0: GeForce GTX 1080 Ti (UUID: GPU-c573168d-304b-8512-a6ed-20e8e1e4d132)
GPU1: GeForce GTX 1080 Ti (UUID: GPU-12fe064d-455a-5369-3a1e-2f1dde43ce7e)
Memory : 16GB

Reproduce issue

git clone https://github.com/filecoin-project/rust-fil-proofs.git
cd rust-fil-proofs/fil-proofs-tooling;

cargo build --release --bin benchy;

# Success
RUST_LOG=debug ./target/release/benchy election-post --size=1
# Error
RUST_LOG=debug ./target/release/benchy election-post --size=262144

Error Function:

Fn Path : crate::bellman::gpu::FFTKernel::create(...)
CodeLine: https://github.com/filecoin-project/bellman/blob/master/src/gpu/fft.rs#L32

Error detail:

################################ OPENCL ERROR ############################### 

Error executing function: clCreateContext  

Status error code: CL_OUT_OF_HOST_MEMORY (-6)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateContext.html#errors  

#############################################################################

`Boolean` XOR

https://github.com/filecoin-project/bellman/blob/27b62813eb78c6d0188ab78e82fd01f632ce1cb6/src/gadgets/boolean.rs#L461

if x is Boolean::Constant(true), will return Ok(Boolean::Constant(false)). But true | true should be true not false ?

Multi Cuda Stream Support in Bellperson?

Is there a plan to add multi streams to Cuda implementation ?
I read the Cuda doc and it says multi streams can decrease the computation time.

[Feature request] An option to use single GPU for Multi-Exp

If a miner starts a winning post while it is calculating SNARK for windowed post, the winning post tends to fail due to the delays caused by the conflict. This issue gets more prominent as the miner's storage power increases.

By default, a multi-exp task takes up all available GPUs.
This makes it impossible to avoid the conflict described above because no matter how many GPUs are available, all of them get used up by a single task.

I suggest providing an option to force a multi-exp task to use only one GPU as it seems to be the simplest workaround.
Ex) MULTI_EXP_SINGLE_GPU=true

For reference:
https://filecoinproject.slack.com/archives/CEGB67XJ8/p1602589807190700

GPU name with colon not able to set with BELLMAN_CUSTOM_GPU

`022-01-11T22:24:02.140 INFO bellperson::gpu::fft > FFT: 3 working device(s) selected.
2022-01-11T22:24:02.140 INFO bellperson::gpu::fft > FFT: Device 0: gfx1011:xnack-
2022-01-11T22:24:02.140 INFO bellperson::gpu::fft > FFT: Device 1: gfx906:sramecc-:xnack-
2022-01-11T22:24:02.140 INFO bellperson::gpu::fft > FFT: Device 2: gfx906:sramecc-:xnack-
2022-01-11T22:24:02.140 INFO bellperson::domain > GPU FFT kernel instantiated!
2022-01-11T22:24:02.209 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2022-01-11T22:24:02.209 WARN bellperson::gpu::utils > Number of CUDA cores for your device (gfx1011:xnack-) is unknown! Best performance is only achieved when the number of CUDA cores is known! You can find the instructions on how to support custom GPUs here: https://lotu.sh/en+hardware-mining
2022-01-11T22:24:02.209 INFO bellperson::gpu::program > Using kernel on OpenCL.
2022-01-11T22:24:02.219 WARN bellperson::gpu::utils > Number of CUDA cores for your device (gfx906:sramecc-:xnack-) is unknown! Best performance is only achieved when the number of CUDA cores is known! You can find the instructions on how to support custom GPUs here: https://lotu.sh/en+hardware-mining
2022-01-11T22:24:02.219 INFO bellperson::gpu::program > Using kernel on OpenCL.
2022-01-11T22:24:02.228 WARN bellperson::gpu::utils > Number of CUDA cores for your device (gfx906:sramecc-:xnack-) is unknown! Best performance is only achieved when the number of CUDA cores is known! You can find the instructions on how to support custom GPUs here: https://lotu.sh/en+hardware-mining
2022-01-11T22:24:02.228 INFO bellperson::gpu::program > Using kernel on OpenCL.`

"your device (gfx906:sramecc-:xnack-) is unknown!"

Setting gfx906:sramecc-:xnack- is not possible due to the name already having a colon in it for obvious reasons.
export BELLMAN_CUSTOM_GPU="gfx906:sramecc-:xnack-:3840"

- Rust panic: Invalid BELLMAN_CUSTOM_GPU!

Solution: ability to escape characters or put a check in for a integer after the colon.

Add coverage reports

To get a better idea of which code paths are currently tested, add code coverage reports to the CI.

bellpersonv0.13: GPU acquired by a high priority process! Freeing up Multiexp kernels

when i run my program, i met so many issues like this, so it take 16 hours when my program end.
which one can figure out the problem? thanks a lot.

Breaking change to RAYON_NUM_THREADS env var

Currently we have an environment variable called RAYON_NUM_THREADS which limits the number of threads used. bellperson roughly uses two times as many threads. The reason for the "two times" is, that we have a Rayon thread pool, and a separate thread pool for the CPU version of FFT and Multiexp.

Those FFT and Multiexp implementations are refactored into the ec-gpu library. To control the number of threads there, an environment variable called EC_GPU_NUM_THREADS is used.

Is it a problem if in future releases of bellperson the RAYON_NUM_THREADS env var would only control the Rayon thread pool and if you want to limit the threads used for CPU FFT and Multiexp, you'd additionally have to set the EC_GPU_NUM_THREADS variable?

It would be quite hard to not making this breaking change. The idea would be to read the RAYON_NUM_THREADS variable, dividing it by two and set EC_GPU_NUM_THREADS to that numbers. Sadly this is not trivial as the maximum number of threads used in ec-gpu is set only set once (via OnceCell), hence the EC_GPU_NUM_THREADS variable is evaluated only once when that variable is first accessed. But we don't know exactly when it will be accessed the first time, so we would need to set the EC_GPU_NUM_THREADS env var in all possible cases. That is pretty error prone.

bellperson 0.18 fails to build with nvcc

Here is the function that fails to compile

KERNEL void radix_fft(GLOBAL Fr* x, // Source buffer
                      GLOBAL Fr* y, // Destination buffer
                      GLOBAL Fr* pq, // Precalculated twiddle factors
                      GLOBAL Fr* omegas, // [omega, omega^2, omega^4, ...]
                      LOCAL Fr* u_arg, // Local buffer to store intermediary values
                      uint n, // Number of elements
                      uint lgp, // Log2 of `p` (Read more in the link above)
                      uint deg, // 1=>radix2, 2=>radix4, 3=>radix8, ...
                      uint max_deg) // Maximum degree supported, according to `pq` and `omegas`
{

The 5th argument is marked as "LOCAL", and in the case of CUDA, "LOCAL" is defined as "shared", and then I got this error message that this argument cannot be "shared" from the nvcc compiler.

Is there any workaround that can be done to it so I can compile it with CUDA?

Extend support for GPUs

Current list of requested GPUs

In addition we should add an environment flag BELLMAN_SUPPORT_GPUS which allows specifying additional cards, without having to recompile

> BELLMAN_SUPPORT_GPUS="<card name>:<cuda cores>,<card name>:<cuda cores>

It takes about 40-50s to finish winningPoST on RTX3090

Is it the compatible issue about gpu?

2021-04-06T08:26:35.232 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2021-04-06T08:26:35.232 INFO bellperson::gpu::fft > FFT: Device 0: GeForce RTX 3090
2021-04-06T08:26:35.232 INFO bellperson::domain > GPU FFT kernel instantiated!
2021-04-06T08:26:35.480 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2021-04-06T08:26:36.587+0800 ^[[34mINFO^[[0m miner miner/miner.go:442 Time delta between now and our mining base: 6s (nulls: 0)
2021-04-06T08:26:57.181 INFO bellperson::gpu::multiexp > Multiexp: 1 working device(s) selected. (CPU utilization: 0)
2021-04-06T08:26:57.181 INFO bellperson::gpu::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702)
2021-04-06T08:26:57.181 INFO bellperson::multiexp > GPU Multiexp kernel instantiated!
2021-04-06T08:26:58.224 INFO bellperson::groth16::prover > prover time: 44.201947641s
2021-04-06T08:26:58.227 INFO storage_proofs_core::compound_proof > snark_proof:finish
2021-04-06T08:26:58.227 INFO filecoin_proofs::api::post > generate_winning_post:finish
2021-04-06T08:26:58.229 INFO filcrypto::proofs::api > generate_winning_post: finish
2021-04-06T08:26:58.229+0800 ^[[34mINFO^[[0m storageminer storage/miner.go:263 GenerateWinningPoSt took 46.249162605s
2021-04-06T08:26:58.229+0800 ^[[34mINFO^[[0m miner miner/warmup.go:75 winning PoSt warmup successful {"took": 46.250996108}

Make FFT kernel work on CUDA

Add support to run the FFT kernel on CUDA.

Test Failures on GTX3070

When I run the tests with the command $ RUSTFLAGS="-C target-cpu=native" cargo test --release --all --features gpu

I get several failed tests, specifically:

test domain::tests::gpu_fft_consistency ... FAILED
and
test groth16::proof::test_with_bls12_381::serialization ... FAILED

Test stdout follows:

    Finished release [optimized] target(s) in 0.10s
     Running target/release/deps/bellperson-6774ccb95b1358ea

running 73 tests
test gadgets::boolean::test::test_allocated_bit ... ok
test gadgets::boolean::test::test_boolean_negation ... ok
test gadgets::boolean::test::test_and_not ... ok
test gadgets::boolean::test::test_and ... ok
test gadgets::lookup::test::test_synth ... ok
test gadgets::boolean::test::test_xor ... ok
test gadgets::num::test::test_allocated_num ... ok
test gadgets::num::test::test_num_conditional_reversal ... ok
[2021-09-21T04:48:55Z WARN  bellperson::multicore] BELLMAN_NUM_CPUS is deprecated, please switch to RAYON_NUM_THREADS
[2021-09-21T04:48:55Z WARN  bellperson::multicore] BELLMAN_NUM_CPUS is deprecated, please switch to RAYON_NUM_THREADS
test gadgets::num::test::test_num_squaring ... ok
test gadgets::num::test::test_num_multiplication ... ok
test gadgets::boolean::test::test_boolean_and ... ok
test gadgets::boolean::test::test_nor ... ok
test gadgets::boolean::test::test_alloc_conditionally ... ok
test gadgets::num::test::test_num_scale ... ok
test gadgets::num::test::test_num_nonzero ... ok
test gadgets::test::test_cs ... ok
test gadgets::boolean::test::test_boolean_xor ... ok
test gadgets::boolean::test::test_enforce_equal ... ok
test gadgets::uint32::test::test_uint32_rotr ... ok
test gadgets::boolean::test::test_boolean_sha256_ch ... ok
test gadgets::uint32::test::test_uint32_from_bits_be ... ok
test gadgets::uint32::test::test_uint32_from_bits ... ok
test gadgets::boolean::test::test_u64_into_boolean_vec_le ... ok
test gadgets::lookup::test::test_lookup3_xy_with_conditional_negation ... ok
test gadgets::boolean::test::test_boolean_sha256_maj ... ok
test gadgets::lookup::test::test_lookup3_xy ... ok
test groth16::proof::test_with_bls12_381::test_size ... ok
test gadgets::uint32::test::test_uint32_shr ... ok
test multicore::tests::test_read_num_cpus ... ok
test util_cs::test_cs::tests::test_compute_path ... ok
test groth16::aggregate::transcript::test::test_transcript ... ok
test util_cs::test_cs::tests::test_cs ... ok
test multicore::tests::test_log2_floor ... ok
test multiexp::tests::test_extend_density_regular ... ok
test tests::test_add_simplify ... ok
test multiexp::tests::test_extend_density_input ... ok
test gadgets::boolean::test::test_field_into_allocated_bits_le ... ok
test groth16::aggregate::proof::tests::test_proof_check ... ok
test groth16::prover::tests::test_proving_assignment_extend ... ok
test gadgets::uint32::test::test_uint32_addmany_constants ... ok
test domain::parallel_fft_consistency ... ok
test groth16::aggregate::proof::tests::test_proof_io ... ok
test gadgets::num::test::test_into_bits_strict ... ok
[2021-09-21T04:48:55Z INFO  bellperson::groth16::prover] Bellperson 0.16.3 is being used!
[2021-09-21T04:48:55Z INFO  bellperson::groth16::prover] starting proof timer
[2021-09-21T04:48:55Z INFO  bellperson::gpu::locks] GPU is available for FFT!
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] Acquiring GPU lock at "/tmp/bellman.gpu.lock" ...
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] GPU lock acquired!
test groth16::aggregate::accumulator::test::test_pairing_randomize ... ok
test groth16::aggregate::srs::test::test_srs_invalid_length ... ok
test domain::fft_composition ... ok
test groth16::aggregate::commit::tests::test_commit_single ... ok
test groth16::aggregate::commit::tests::test_commit_pair ... ok
test gadgets::blake2s::test::test_blake2s_constant_constraints ... ok
test gadgets::blake2s::test::test_blank_hash ... ok
[2021-09-21T04:48:55Z DEBUG rust_gpu_tools::opencl::utils] loaded devices: [Device { vendor: Nvidia, name: "NVIDIA GeForce RTX 3070", memory: 8367439872, pci_id: PciId(16640), uuid: Some(5dbeddfe-c81d-fc88-bdf7-b90e59dba3f4), device: Device { id: 139792729927488 } }]
[2021-09-21T04:48:55Z INFO  bellperson::gpu::utils] Device: Device { vendor: Nvidia, name: "NVIDIA GeForce RTX 3070", memory: 8367439872, pci_id: PciId(16640), uuid: Some(5dbeddfe-c81d-fc88-bdf7-b90e59dba3f4), device: Device { id: 139792729927488 } }
[2021-09-21T04:48:55Z INFO  bellperson::gpu::utils] Device: Device { vendor: Nvidia, name: "NVIDIA GeForce RTX 3070", memory: 8367439872, pci_id: PciId(16640), uuid: Some(5dbeddfe-c81d-fc88-bdf7-b90e59dba3f4), device: Device { id: 139792729927488 } }
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] Acquiring GPU lock at "/tmp/bellman.gpu.lock" ...
[2021-09-21T04:48:55Z INFO  bellperson::gpu::utils] Device: Device { vendor: Nvidia, name: "NVIDIA GeForce RTX 3070", memory: 8367439872, pci_id: PciId(16640), uuid: Some(5dbeddfe-c81d-fc88-bdf7-b90e59dba3f4), device: Device { id: 139792729927488 } }
test gpu::utils::test_list_devices ... ok
test gadgets::sha256::test::test_blank_hash ... ok
[2021-09-21T04:48:55Z INFO  bellperson::gpu::locks] GPU is available for Multiexp!
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] Acquiring GPU lock at "/tmp/bellman.gpu.lock" ...
test gadgets::uint32::test::test_uint32_sha256_maj ... ok
test gadgets::uint32::test::test_uint32_sha256_ch ... ok
test gadgets::uint32::test::test_uint32_addmany ... ok
test gadgets::uint32::test::test_uint32_xor ... ok
test gadgets::blake2s::test::test_blake2s_constraints ... ok
test gadgets::blake2s::test::test_blake2s_precomp_constraints ... ok
test gadgets::sha256::test::test_full_block ... ok
[2021-09-21T04:48:55Z INFO  bellperson::gpu::fft] FFT: 1 working device(s) selected.
[2021-09-21T04:48:55Z INFO  bellperson::gpu::fft] FFT: Device 0: NVIDIA GeForce RTX 3070
[2021-09-21T04:48:55Z INFO  bellperson::domain] GPU FFT kernel instantiated!
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] GPU lock released!
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] GPU lock acquired!
[2021-09-21T04:48:55Z INFO  bellperson::gpu::locks] GPU is available for Multiexp!
[2021-09-21T04:48:55Z DEBUG bellperson::gpu::locks] Acquiring GPU lock at "/tmp/bellman.gpu.lock" ...
[2021-09-21T04:48:56Z INFO  bellperson::gpu::fft] FFT: 1 working device(s) selected.
[2021-09-21T04:48:56Z INFO  bellperson::gpu::fft] FFT: Device 0: NVIDIA GeForce RTX 3070
[2021-09-21T04:48:56Z DEBUG bellperson::gpu::locks] GPU lock acquired!
[2021-09-21T04:48:56Z DEBUG bellperson::gpu::locks] GPU lock released!
test domain::tests::gpu_fft_consistency ... FAILED
[2021-09-21T04:48:56Z WARN  bellperson::gpu::utils] Number of CUDA cores for your device (NVIDIA GeForce RTX 3070) is unknown! Best performance is only achieved when the number of CUDA cores is known! You can find the instructions on how to support custom GPUs here: https://lotu.sh/en+hardware-mining
[2021-09-21T04:48:56Z INFO  bellperson::gpu::multiexp] Multiexp: 1 working device(s) selected. (CPU utilization: 0)
[2021-09-21T04:48:56Z INFO  bellperson::gpu::multiexp] Multiexp: Device 0: NVIDIA GeForce RTX 3070 (Chunk-size: 4405294)
[2021-09-21T04:48:56Z INFO  bellperson::multiexp] GPU Multiexp kernel instantiated!
test groth16::multiscalar::tests::test_multiscalar_par ... ok
test groth16::multiscalar::tests::test_multiscalar_single ... ok
test gadgets::blake2s::test::test_blake2s_256_vars ... ok
[2021-09-21T04:48:57Z DEBUG bellperson::gpu::locks] GPU lock released!
[2021-09-21T04:48:57Z DEBUG bellperson::gpu::locks] GPU lock acquired!
[2021-09-21T04:48:57Z WARN  bellperson::gpu::utils] Number of CUDA cores for your device (NVIDIA GeForce RTX 3070) is unknown! Best performance is only achieved when the number of CUDA cores is known! You can find the instructions on how to support custom GPUs here: https://lotu.sh/en+hardware-mining
test multiexp::gpu_multiexp_consistency ... ok
[2021-09-21T04:48:57Z INFO  bellperson::gpu::multiexp] Multiexp: 1 working device(s) selected. (CPU utilization: 0)
[2021-09-21T04:48:57Z INFO  bellperson::gpu::multiexp] Multiexp: Device 0: NVIDIA GeForce RTX 3070 (Chunk-size: 4405294)
[2021-09-21T04:48:57Z INFO  bellperson::multiexp] GPU Multiexp kernel instantiated!
[2021-09-21T04:48:57Z DEBUG bellperson::gpu::locks] GPU lock released!
[2021-09-21T04:48:57Z INFO  bellperson::groth16::prover] prover time: 1.984064297s
test groth16::proof::test_with_bls12_381::serialization ... FAILED
test gadgets::blake2s::test::test_blake2s_700_vars ... ok
test gadgets::multipack::test_multipacking ... ok
test multiexp::test_with_bls12 ... ok
test gadgets::blake2s::test::test_blake2s_test_vectors ... ok
test gadgets::num::test::test_into_bits ... ok
test domain::polynomial_arith ... ok
test gadgets::blake2s::test::test_blake2s ... ok
test gadgets::sha256::test::test_against_vectors ... ok

failures:

---- domain::tests::gpu_fft_consistency stdout ----
Testing FFT for 2 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
thread 'domain::tests::gpu_fft_consistency' panicked at 'assertion failed: v1.coeffs == v2.coeffs', src/domain.rs:618:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- groth16::proof::test_with_bls12_381::serialization stdout ----
thread 'groth16::proof::test_with_bls12_381::serialization' panicked at 'assertion failed: verify_proof(&pvk, &proof, &[c]).unwrap()', src/groth16/proof.rs:326:13


failures:
    domain::tests::gpu_fft_consistency
    groth16::proof::test_with_bls12_381::serialization

test result: FAILED. 71 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 13.31s

error: test failed, to rerun pass '-p bellperson --lib'

Make coverage reports viewable

Currently coverage reports are not uploaded to Codecov. We should look into alternatives. LLVM creates nice looking static HTML reports, we should be able to either publish them as build artifact on CircleCI (I'm not sure if you can view them there directly) or we put them on IPFS.

And alternative could be other code coverage services like https://coveralls.io/.

do windowpost err

CPU,AMD-7402
MEM:512G
GPU.3080 * 2
power. ≈10P
This error occurs occasionally,when do windowpost,About once a day.
I don't know what happened

GPU FFT failed! Error: Ocl Error on GeForce RTX 3080

original issue by @nickboot

Describe the bug
GeForce RTX 3080，GPU FFT failed! Falling back to CPU

To Reproduce
Steps to reproduce the behavior:

nvidia-smi -L
GPU 0: GeForce RTX 3080
export BELLMAN_CUSTOM_GPU="GeForce RTX 3080:8704"
lotus-bench sealing --storage-dir=/benchtmp --sector-size=32GiB --num-sectors=2 --parallel=2

Screenshots
2020-09-22T03:20:30.079 INFO filecoin_proofs::api::seal > seal_commit_phase2:start
2020-09-22T03:20:30.079 INFO filecoin_proofs::caches > trying parameters memory cache for: STACKED[34359738368]
2020-09-22T03:20:30.079 INFO filecoin_proofs::caches > no params in memory cache for STACKED[34359738368]
2020-09-22T03:20:32.249 INFO filecoin_proofs::api::seal > got groth params (34359738368) while sealing
2020-09-22T03:20:32.249 INFO filecoin_proofs::api::seal > snark_proof:start
2020-09-22T03:20:32.259 INFO bellperson::groth16::prover > Bellperson 0.9.2 is being used!
2020-09-22T03:22:32.470 TRACE storage_proofs_porep::stacked::vanilla::proof > processing config 1/8 with column nodes 217728
2020-09-22T03:22:46.683 TRACE storage_proofs_porep::stacked::vanilla::proof > base data len 134217728, tree data len 19173961
2020-09-22T03:22:46.683 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2020-09-22T03:22:46.683 TRACE storage_proofs_porep::stacked::vanilla::proof > flattening tree_c base data of 134217728 nodes using batch size 262144
2020-09-22T03:22:48.782 TRACE storage_proofs_porep::stacked::vanilla::proof > done flattening tree_c base data
2020-09-22T03:22:48.782 TRACE storage_proofs_porep::stacked::vanilla::proof > flattening tree_c tree data of 19173961 nodes using batch size 262144 and base offset 134217728
2020-09-22T03:22:49.113 TRACE storage_proofs_porep::stacked::vanilla::proof > done flattening tree_c tree data
2020-09-22T03:22:49.113 TRACE storage_proofs_porep::stacked::vanilla::proof > writing tree_c store data
2020-09-22T03:22:49.652 TRACE storage_proofs_porep::stacked::vanilla::proof > done writing tree_c store data
2020-09-22T03:26:37.792 INFO bellperson::gpu::locks > GPU is available for FFT!
2020-09-22T03:26:37.798 DEBUG bellperson::gpu::locks > Acquiring GPU lock...
2020-09-22T03:26:37.798 DEBUG bellperson::gpu::locks > GPU lock acquired!
2020-09-22T03:26:41.586 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2020-09-22T03:26:41.586 INFO bellperson::gpu::fft > FFT: Device 0: GeForce RTX 3080
2020-09-22T03:26:41.586 INFO bellperson::domain > GPU FFT kernel instantiated!
2020-09-22T03:27:05.456 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("radix_fft")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################

2020-09-22T03:27:53.186 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("radix_fft")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################
Version (run lotus version):
lotus-bench version 0.7.1

run proof.synthesize in parallel

I try to run proof.synthesize in parallel, but got an error while verify_seal:

porep proof for sector 0 was invalid:

Dedicated GPU for WinningPoSt

If a system has more than one GPU, there could be a setting to dedicate a GPU just to run WinningPoSts. If there is no such proof, the GPU will be idle. Only the other GPUs will be used to run other proofs.

This idea is based on discussions with @cryptonemo and @DrPeterVanNostrand and were triggered by #246.

I have a question about MappedParameters

git log: 33e5d2a

src/groth16/prover.rs line 380

    // snip
   //  line 380
    let h_s = a_s
        .into_iter()
        .map(|a| {
            let h = multiexp(
                &worker,
                params.get_h(a.len())?,            //   I found that a.len() is not really used here
                FullDensity,
                a,
                &mut multiexp_kern,
            );
            Ok(h)
        })
        .collect::<Result<Vec<_>, SynthesisError>>()?;

src/groth16/mapped_params.rs line 59

    fn get_h(&self, _num_h: usize) -> Result<Self::G1Builder, SynthesisError> {
        let builder = self
            .h
            .iter()
            .cloned()
            .map(|h| read_g1::<E>(&self.params, h, self.checked))
            .collect::<Result<_, _>>()?;

        Ok((Arc::new(builder), 0))
    }

Does this mean that the logic of get_h can be optimized, without having to create a new builder every time

radix_fft failure on RTX2070 Super

When running the lotus benchmarks as well as in the Committing section, the log gets filled with these errors:

2020-07-02T11:17:08.900 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error: 

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueNDRangeKernel("radix_fft")  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors  

#############################################################################

Performance also seems to be really bad (22 minutes on 2080TI, 2 hours 10 minutes on 2070Super)

v0.9.3 build err

~/Workspace/RustProject/bellman$ cargo build
Compiling bellperson v0.9.3 (/mnt/e/Workspace/RustProject/bellman)
error[E0309]: the parameter type F may not live long enough
--> src/multicore.rs:68:25
|
57 | pub fn scope<'a, F, R>(&self, elements: usize, f: F) -> R
| - help: consider adding an explicit lifetime bound F: 'a...
...
68 | THREAD_POOL.scope(|scope| f(scope, chunk_size))
| ^^^^^
|
note: ...so that the type [closure@src/multicore.rs:68:31: 68:59 f:F, chunk_size:&usize] will meet its required lifetime bounds
--> src/multicore.rs:68:25
|
68 | THREAD_POOL.scope(|scope| f(scope, chunk_size))
| ^^^^^

error: aborting due to previous error

For more information about this error, try rustc --explain E0309.
error: could not compile bellperson.

To learn more, run the command again with --verbose.

Add CUDA support

It should be possible to run all the bellperson code on either the CPU, on OpenCL or on CUDA. The CUDA support is currently missing. Once implemented, it should be possible to enable OpenCL and/or CUDA on run and compile-time.

This is a tracking issue for all the work that is needed.

#208
#209
There is an environment variable called BELLMAN_GPU_FRAMEWORK to select whether OpenCL or CUDA should be used at runtime
There are two feature flags, opencl and cuda to compile with either or both.
(nice to have) Have a single source file for both OpenCL and CUDA: #210

BELLMAN0.10.0

error[E0432]: unresolved import bellperson::GPU_NVIDIA_DEVICES
--> src/util/api.rs:6:5
|
6 | use bellperson::GPU_NVIDIA_DEVICES;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ no GPU_NVIDIA_DEVICES in the root

error: aborting due to previous error

For more information about this error, try rustc --explain E0432.
error: could not compile filcrypto.

To learn more, run the command again with --verbose.

rm -f /tmp/tmp.PI50JUvXrk
make[1]: *** [Makefile:11: .install-filcrypto] Error 101
make[1]: Leaving directory '/lotus/extern/filecoin-ffi'
make: *** [Makefile:37: build/.filecoin-install] Error 2

Refactor aggregation failure handling

This issue is triggered by #197.

@nikkolasg and I spent a lot of time debugging this and finding out whether there is a deeper root issue or not. It took so long because the code isn't really ideal. The problem was in the case an aggregation turns out to be invalid, then an "early termination" is the idea. This is triggered through a call to invalidate(). That call sets the valid variable to false, which will then exit the aggregation thread.

I want to present both of our view what the underlying issue is:

@nikkolasg prefers following the Go principles that the thread that spawns a child, should also be responsible to terminate it. This kind of is the case here, but it's so hidden, that it is almost invisible.
@vmx thinks (having learnt message passing in Erlang) the problem is the valid variable, which is kind of global state, which is manipulated somewhere hidden, with large consequences. I would hope that the code can be changed to terminate the aggregation thread as well as the ones that are sending through message passing.

This means that this issue doesn't really propose a proper solution, but is rather a placeholder that someone will hopefully find the time in the future to look deeper into this.

Restrict gpu features to machines with a supported GPU

This is more of a proposal.

Currently if you compile bellperson with a the gpu feature enabled there is a fallback to use the CPU instead. Though even today that doesn't always work. For example if I run cargo test --no-default-features --features blst,gpu locally (I don't have a supported GPU) I get

failures:

---- domain::tests::gpu_fft_consistency stdout ----
thread 'domain::tests::gpu_fft_consistency' panicked at 'Cannot initialize kernel!: Simple("No working GPUs found!")', src/domain.rs:597:54
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    domain::tests::gpu_fft_consistency

Hence I propose that running this library with the gpu flag enabled is not supported if you don't have a supported GPU. This should reduce the maintenance work.

To make this more explicit I propose adding an early error (e.g. when trying to access the GPU the first time), that clearly states that you need a supported GPU.

Review Worker code again

As part of my work of moving FFT and Multiexp into the ec-gpu project, I was reviewing the Worker code again.

The current Worker::scoped() code seems to do what the comment says. Though it seems to needlessly use a message passing primitive.

THREAD_POOL.scoped(), where THREAD_POOL is a yastl pool, hence it's call to Pool::scoped(), which according to it's docs is already waiting for the execution to finish before it returns. This means the additional message passing wouldn't be needed.

So either change the call to the pool to Pool::spawn(), which wouldn't wait for the execution to finish, or we remove the message passing. As we want it to block, I suggest we keep relying on Pool::scope().

Ability to run on Nvidia and AMD cards at the same time

Background

While participating in the Spacerace I got an idea to use GPUs I have for ETH mining in order to accelerate FIL commitments. Then I started experimenting with this repo using 2080 TI and RX 480 at the same time. Unfortunately, it turned out that they work only separately. Finally, I saw a small note in README_AMD telling that there was a failed attempt doing so but without any further information and it seems that no one ever tried to fix that. Hence, let this issue track the feature.

Test run results

BELLMAN_CUSTOM_GPU='GeForce RTX 2080 Ti:4352, gfx803:2304'
BELLMAN_CPU_UTILIZATION=0

INFO  bellperson::gpu::locks GPU is available for Multiexp!
INFO  bellperson::gpu::utils Adding "GeForce RTX 2080 Ti" to GPU list with 4352 CUDA cores.
INFO  bellperson::gpu::utils Adding "gfx803" to GPU list with 2304 CUDA cores.
INFO  bellperson::gpu::multiexp Multiexp: 1 working device(s) selected. (CPU utilization: 0)
INFO  bellperson::gpu::multiexp Multiexp: Device 0: gfx803 (Chunk-size: 3964764)
INFO  bellperson::multiexp GPU Multiexp kernel instantiated!

As you can see from the logs above, Bellman successfully detected both GPUs but selected only one device😞

BELLMAN_NO_GPU=0 makes problems

If BELLMAN_NO_GPU=0 is set, only PC2 uses GPU and C2 and WindowPoST uses CPU instead.
You must remove BELLMAN_NO_GPU ENV to be able to use GPU for C2/WindowPoST.
tested using RTX 3090 GPU.

Error executing function: clBuildProgram (CL_NV_INVALID_MEM_ACCESS)

There are warning printed when running with the gpu feature enabled on the CI. It seems to still use the GPU for groth16. The errors come from the domain module. Here's an example run: https://app.circleci.com/pipelines/github/filecoin-project/bellperson/802/workflows/717fea32-85d2-40d8-bde8-4abc58107759/jobs/3452/parallel-runs/0/steps/0-108

[2020-12-16T13:01:03Z WARN  bellperson::domain] Cannot instantiate GPU FFT kernel! Error: OpenCL Error: Ocl Error: 
    
    ################################ OPENCL ERROR ############################### 
    
    Error executing function: clBuildProgram  
    
    Status error code: CL_NV_INVALID_MEM_ACCESS (-9999)  
    
    Please visit the following url for more information: 
    
    https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clBuildProgram.html#errors  
    
    #############################################################################

Have single source files for both OpenCL and CUDA

It should be possible to have support for OpenCL and CUDA within a single source file. One for multiexp, one for FFT.

Test Failures on GTX2080Ti

# RUSTFLAGS="-C target-cpu=native" cargo test --release --all --features cuda,opencl
   Compiling rust-gpu-tools v0.5.0
   Compiling bellperson v0.17.0 (/opt/lotus/gitlab/office_v1.13/bellperson)
error: failed to run custom build command for `bellperson v0.17.0 (/opt/lotus/gitlab/office_v1.13/bellperson)`

Caused by:
  process didn't exit successfully: `/opt/lotus/gitlab/office_v1.13/bellperson/target/release/build/bellperson-ff553e61c49eb35e/build-script-build` (exit code: 101)
  --- stderr
  nvcc fatal   : Value 'sm_86' is not defined for option 'gpu-architecture'
  thread 'main' panicked at 'nvcc failed. See the kernel source at /opt/lotus/gitlab/office_v1.13/bellperson/target/release/build/bellperson-1487cf4f8f43ca05/out/aff002a9c65e25912927bcd50f01a0316db06390f6cf242ea369ba6ab96b5a06.cu', build.rs:68:13
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
error: build failed

# export BELLMAN_CUDA_NVCC_ARGS="--fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75"
# RUSTFLAGS="-C target-cpu=native" cargo test --release --all --features cuda,opencl
Compiling bellperson v0.17.0 (/opt/lotus/gitlab/office_v1.13/bellperson)
error: failed to run custom build command for `bellperson v0.17.0 (/opt/lotus/gitlab/office_v1.13/bellperson)`

Caused by:
 process didn't exit successfully: `/opt/lotus/gitlab/office_v1.13/bellperson/target/release/build/bellperson-ff553e61c49eb35e/build-script-build` (exit code: 101)
 --- stderr
 /opt/lotus/gitlab/office_v1.13/bellperson/target/release/build/bellperson-1487cf4f8f43ca05/out/62592a35546d3e7b215cb0b2a0748258c7037e28172d43d8dd7d1c61284137da.cu(1404): error: attribute "__shared__" does not apply here

 1 error detected in the compilation of "/mnt/md0/tmp/tmpxft_00003c39_00000000-6_62592a35546d3e7b215cb0b2a0748258c7037e28172d43d8dd7d1c61284137da.cpp1.ii".
 thread 'main' panicked at 'nvcc failed. See the kernel source at /opt/lotus/gitlab/office_v1.13/bellperson/target/release/build/bellperson-1487cf4f8f43ca05/out/62592a35546d3e7b215cb0b2a0748258c7037e28172d43d8dd7d1c61284137da.cu', build.rs:68:13
 stack backtrace:
    0: rust_begin_unwind
              at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
    1: std::panicking::begin_panic_fmt
              at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
    2: build_script_build::main
    3: core::ops::function::FnOnce::call_once
 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

# nvidia-smi
Fri Oct  8 19:52:14 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.67       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   40C    P0    57W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:21:00.0 Off |                  N/A |
| 24%   35C    P0    38W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+```

Benchmark FFT and multiexp

1. Is there any official benchmark code for FFT on GPU?

There is a one-line code for benchmarking multiexp on GPU

RUST_LOG=info cargo test --features gpu -- --exact multiexp::gpu_multiexp_consistency --nocapture

It shows that GPU is around 120x faster than the cpu version, when tested on a 3090 GPU.

Where can I find a similar one-line code for benchmarking FFT on GPU?

2. What is the maximum number of constraints supported?

As mentioned in the README, there is a certain upper bound on proof size (i.e., the number of constraints).

Depending on the size of the proof being passed to the gpu for work, certain cards will not be able to allocate enough memory to either the FFT or Multiexp kernel.

While this number certainly depends on the GPU card, is there any rough number (e.g., $2^20$ constraints) for a 10GB GPU?

3. Multi-GPU for FFT and multiexp

Currently, could multi-gpu be used to support a larger size when out-of-memory?
It seems the current implementation use each GPU independently for many small-size FFT and multiexp, according to fft.cl and multiexp.cl.

Thanks!

System disk usage high when running groth16 solver

I'm benchmarking a lot to see what would be the fastest setup to run a miner - but unfortunately I see this part of the benchmark using a disk I don't want it to use - but I'm not sure how to control it.

This part in the benchmark;

2020-07-23T14:23:14.735 INFO filecoin_proofs::api::seal > got groth params (34359738368) while sealing
2020-07-23T14:23:14.735 INFO filecoin_proofs::api::seal > snark_proof:start
2020-07-23T14:23:14.753 INFO bellperson::groth16::prover > Bellperson 0.9.2 is being used!

RAM slowly climbs to 128GB with 50% of all cores at 100%.

Once it reaches the 128GB, core usage drops everywhere, and I can see a ton or writes on my OS disk (sdb)

Then after a few minutes it went over to:

2020-07-23T14:38:34.473 INFO bellperson::gpu::locks > GPU is available for FFT!
2020-07-23T14:38:40.023 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2020-07-23T14:38:40.025 INFO bellperson::gpu::fft > FFT: Device 0: GeForce GTX 1080 Ti
2020-07-23T14:38:40.025 INFO bellperson::domain > GPU FFT kernel instantiated!

and I see a ton of reads on my OS disk (sdb);

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,1%    0,0%    0,8%    3,9%    0,0%   95,2%

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop0
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop1
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop2
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop3
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop4
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop5
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop6
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop7
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop8
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k loop9
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k nvme0n1
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k sda
  5375,00        21,0M         0,0k         0,0k      21,0M       0,0k       0,0k sdb
     0,00         0,0k         0,0k         0,0k       0,0k       0,0k       0,0k sdc

The command I used to run the benchmark was like so;

env RUST_LOG=info FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1 FIL_PROOFS_USE_GPU_TREE_BUILDER=1 FIL_PROOFS_MAXIMIZE_CACHING=1 FIL_PROOFS_PARENT_CACHE=/mnt/nvme_2tb/filecoin-parents/ FIL_PROOFS_PARAMETER_CACHE=/mnt/nvme_2tb/filecoin-proof-parameters TMPDIR=/mnt/nvme_2tb ./lotus-bench sealing --storage-dir /mnt/nvme_2tb/bench --sector-size 32GiB --save-commit2-input ~/commit2_nvme_64_1080.json 2>&1 | tee bench.log

nvme0n1 is mounted on /mnt/nvme_2tb - and different parts of the benchmark using those disks are reflected in iostat.

So, my question is:

what is happening on the disk?
how can I move it to another disk?

CL_INVALID_PROGRAM_EXECUTABLE

when i used GeForce GTX 1660 SUPER to run seal. the log printed error :

2020-07-30T15:31:11.818 INFO filecoin_proofs::api::seal > snark_proof:start
2020-07-30T15:31:11.819 INFO bellperson::groth16::prover > Bellperson 0.9.2 is being used!
2020-07-30T15:31:18.113 INFO bellperson::gpu::locks > GPU is available for FFT!
2020-07-30T15:31:18.750 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2020-07-30T15:31:18.751 INFO bellperson::gpu::fft > FFT: Device 0: GeForce GTX 1660 SUPER
2020-07-30T15:31:18.751 INFO bellperson::domain > GPU FFT kernel instantiated!
2020-07-30T15:31:18.792 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clCreateKernel("radix_fft")

Status error code: CL_INVALID_PROGRAM_EXECUTABLE (-45)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateKernel.html#errors

#############################################################################

Tag master as new version so filecoin-ffi can use AMD cards

I've just tried to build lotus myself using the updated bellman repository - but this is the first time I'm touching rust, as you can see;

  Compiling filcrypto v0.7.4 (/home/filecoin/lotus/extern/filecoin-ffi/rust)
error[E0599]: no method named `name` found for enum `std::result::Result<fil_ocl::standard::platform::Platform, bellperson::gpu::error::GPUError>` in the current scope
  --> src/util/api.rs:41:54
   |
41 |         log::info!("Platform selected: {}", platform.name());
   |                                                      ^^^^ method not found in `std::result::Result<fil_ocl::standard::platform::Platform, bellperson::gpu::error::GPUError>`

error[E0308]: mismatched types
  --> src/util/api.rs:43:51
   |
43 |         let devicelist: Vec<Device> = get_devices(&platform).unwrap_or_default();
   |                                                   ^^^^^^^^^ expected struct `fil_ocl::standard::platform::Platform`, found enum `std::result::Result`
   |
   = note: expected reference `&fil_ocl::standard::platform::Platform`
              found reference `&std::result::Result<fil_ocl::standard::platform::Platform, bellperson::gpu::error::GPUError>`

So, if its not too much of a problem, it would be nice if this master branch has a new tag, so the new version can be used in filecoin-ffi, so that we can start testing AMD on Lotus :)

Given I have 0 experience with rust (well, apart from this disappointing try), I have no idea how hard this is..

Decided to just file an issue - in the meantime I will be learning a bit of rust to see if I can get this problem solved myself :)

Cheers!

Test c2 V0.17.0 err

hello,i test c2 has some err

build with:

env RUSTFLAGS="-C target-cpu=native -g" FFI_BUILD_FROM_SOURCE=1 FFI_USE_CUDA=1 make clean all lotus-bench lotus-shed lotus-seed >make.log 2>&1 &

run with:

./lotus-bench prove ~/c2.json >./c2.log 2>&1 &

env

cuda_11.2.r11.2/compiler.29373293_0
lotus-bench version 1.13.0-rc2+mainnet
bellperson 0.17.0

filecoin-project / bellperson Goto Github PK

bellperson's Introduction

bellperson

Backend

GPU

Requirements

Environment variables

Supported / Tested Cards

Running Tests

Considerations

Experimental

License

Contribution

bellperson's People

Contributors

Stargazers

Watchers

Forkers

bellperson's Issues

Description

Acceptance criteria

Risks + pitfalls

Where to begin

Background

Test run results

1. Is there any official benchmark code for FFT on GPU?

2. What is the maximum number of constraints supported?

3. Multi-GPU for FFT and multiexp

build with:

run with:

env

Recommend Projects

Recommend Topics

Recommend Org