spacemeshos / post-rs Goto Github PK
View Code? Open in Web Editor NEWRust implementation of POST proving
License: MIT License
Rust implementation of POST proving
License: MIT License
Currently, K2 PoW (POST proving & verification path) uses scrypt-jane. A better, PoW-dedicated alternative was found - randomX.
We should change hash function for K2 PoW to randomX.
spacemesh-log-7f8f332c(2).txt.zip
Exception 0xc0000005 0x0 0x0 0x7ff93bb53e42
As in the title, we need the profiler tool available in the main branch of the repo, preferably even as a release artifact binary that anyone can download.
Quoting Conversation with @iddo333 from slack:
@iddo333:
k3pow won't be needed, but k3 itself will be needed when we switch to distributed verification. So the code that solves the PoW for k3pow isn't needed (it shouldn't be compiled, you can comment it out or delete it so it's only the git histroy, if we find a new reason to use k3pow and we can revive this code but it's unlikely. The code that takes the random-looking k3pow seed and verifies the subset (by using blake3 to create the random-looking sequence) will be used in the future for distributed verification, by replacing the k3pow seed with a different seed.
Is it OK to remove k3pow from the POST proof and k3_pow_difficulty from the configuration of POST prover then? I will leave the code to pick random indices for distr. verification.
Yes that's ok for now (the code will remain in the git history so it's no big deal, it's probably enough just in the git commit message mention k3pow removal, but I guess you can also do git tag).
I have a set of POS data that are valid when I use the postcli tool to verify. But when I use go-spacemesh, I get this error in the log.
2023-11-27T10:00:48.630-0800 INFO e2a89.post proving: generated proof {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "post"} 2023-11-27T10:00:48.632-0800 INFO e2a89.atxBuilder created the initial post {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder"} 2023-11-27T10:00:48.632-0800 INFO e2a89.atxBuilder verifying the initial post {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder", "post": {"nonce": 284, "indices":"8e22b74a006b4f2c275874d3e604766f65c2f022ec2f1d666faf46844f58ddba584abe73177d17584e03ea9dc87f502f1865900f7edf3962b9aa085c0a8426b4cbe131f3b1518d2fdd395d4e6b9d877d25531c7178622aa74f5bc8e47505576dbfcedb0036dafe627d7243bdf3ed73a836bdf48e35c045ea59ef451245a41fa764d98ca151407fefb8d16bfa32a881c5446d26be0baae1c7d86195831a47842e21667bfc1824ad57c9f8a501"}, "metadata": {"Challenge": "0000000000000000000000000000000000000000000000000000000000000000", "LabelsPerUnit": 4294967296}, "name": "atxBuilder"} 2023-11-27T10:00:48.649-0800 ERROR e2a89.nipostValidator Proof is invalid: invalid proof of work{"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "nipostValidator","module":"post::post_impl", "file": "ffi/src/post_impl.rs", "line": 203} 2023-11-27T10:00:48.649-0800 FATAL e2a89.atxBuilder initial POST proof is invalid. Probably the initialized POST data is corrupted. Please verify the data with postcli and regenerate the corrupted files. {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder", "errmsg": "verify PoST: invalid proof", "name": "atxBuilder"}
See spacemeshos/pm#290 for the general design.
Currently, there is no way to interrupt proof generation other than killing the application.
It should be possible to interrupt proof generation and continue later from the point it stopped.
For a structured, context-aware logging migrate to https://github.com/tokio-rs/tracing.
We need to validate AMD gpus on Linux for opencl. The problem is that installing amdgpu is problematic and does not work out of the box for 6600XT even after fully installed and correctly set.
Error executing function: clEnqueueNDRangeKernel("scrypt")
Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)
reported initially via discord: https://discord.com/channels/623195163510046732/1179746300057624608/1179746300057624608
cmdline:
time ./profiler --k2-pow-difficulty 891576961504 --data-file profiler_data.bin --data-size 1 --duration 10 -t 1
^C
real 5m3.591s
user 76m50.815s
sys 0m5.756s
There is also 0 data created on the disk despite running for 5mins (instead of 10seconds as I asked)
Additionally, there was nothing created on a disk. Screenshot presents CPU usage
When I removed --k2-pow-difficulty 891576961504
from the cmdline then it started to behave ok.
root@testnet-04-miner-1-0:~/data/post/profiler_d# time ./profiler --data-file profiler_data.bin --data-size 1 --duration 10 -t 1
{
"time_s": 10.6147993,
"speed_gib_s": 0.9420809303478777
}
real 0m10.644s
user 0m9.243s
sys 0m1.402s
Checked on M2:
nj ~/Downloads $ ./profiler
{
"time_s": 13.723401458,
"speed_gib_s": 0.2186046957222229
}
It clearly does not use hw aes acceleration.
Same machine compilled manually:
nj ~/workspace/post-rs main* $ RUSTFLAGS="--cfg=aes_armv8" cargo run --release -p profiler
Finished release [optimized] target(s) in 0.08s
Running `target/release/profiler`
{
"time_s": 10.456489666,
"speed_gib_s": 1.8170533904681048
}
#98 introduced a way to blacklist OCL platforms and devices via environment variables (POST_OCL_PLATFORMS_BLACKLIST
and POST_OCL_DEVICES_BLACKLIST
). It turns out to be too difficult to configure for less technical users.
The proposal is to let to configure the blacklist via the config instead.
It should be possible to provide a k2 proof of work to generate the POST with in order to avoid recomputing it.
generate_proof()
should accept an optional k2pow
parameter (an array per each nonce group?)There is a crash in OpenCL.dll when querying properties of an openCL platform that happens on some systems (for example by calling scrypt_ocl::get_providers_count()
).
It was narrowed down to Intel's integrated graphics (both Intel HD and Intel Iris). Disabling the driver (leaving Nvidia enabled) resolves the issue.
An issue in the ocl
library tracking the same: cogciprocate/ocl#219.
Please see the logs.
https://spacemesh.sentry.io/issues/4234116670/?project=6324919&query=is%3Aunresolved&referrer=issue-stream&stream_index=0 that's the report in sentry about that exact issue.
A POST proof generated on M1 Mac is invalid.
Related thread on Discord: https://discord.com/channels/623195163510046732/1145451846249500753
Logs: spacemesh-log-7c8cef2b.zip
The POST proof generation code thinks that the files are 4096B in size:
invalid POS file, expected size: 2147483648 vs actual size: 4096 {"node_id": "358c764132cff490de0266420624b561392bf14acada9112a9fc1d7e350adb6d", "module": "nipostValidator", "module": "post::reader", "file": "src/reader.rs", "line": 102}
2023/05/15 10:52:30 DEBUG initialization: file #2 current position: 39845888, remaining: 27262976
2023/05/15 10:52:32 DEBUG initialization: file #2 current position: 40894464, remaining: 26214400
2023/05/15 10:52:34 DEBUG initialization: file #2 current position: 41943040, remaining: 25165824
2023/05/15 10:52:36 DEBUG initialization: file #2 current position: 42991616, remaining: 24117248
2023/05/15 10:52:39 DEBUG initialization: file #2 current position: 44040192, remaining: 23068672
2023/05/15 10:52:41 DEBUG initialization: file #2 current position: 45088768, remaining: 22020096
2023/05/15 10:52:43 DEBUG initialization: file #2 current position: 46137344, remaining: 20971520
2023/05/15 10:52:45 DEBUG initialization: file #2 current position: 47185920, remaining: 19922944
2023/05/15 10:52:47 DEBUG initialization: file #2 current position: 48234496, remaining: 18874368
2023/05/15 10:52:50 DEBUG initialization: file #2 current position: 49283072, remaining: 17825792
2023/05/15 10:52:52 DEBUG initialization: file #2 current position: 50331648, remaining: 16777216
2023/05/15 10:52:54 DEBUG initialization: file #2 current position: 51380224, remaining: 15728640
2023/05/15 10:52:56 DEBUG initialization: file #2 current position: 52428800, remaining: 14680064
2023/05/15 10:52:58 DEBUG initialization: file #2 current position: 53477376, remaining: 13631488
2023/05/15 10:53:01 DEBUG initialization: file #2 current position: 54525952, remaining: 12582912
2023/05/15 10:53:03 DEBUG initialization: file #2 current position: 55574528, remaining: 11534336
2023/05/15 10:53:05 DEBUG initialization: file #2 current position: 56623104, remaining: 10485760
2023/05/15 10:53:07 DEBUG initialization: file #2 current position: 57671680, remaining: 9437184
2023/05/15 10:53:09 DEBUG initialization: file #2 current position: 58720256, remaining: 8388608
2023/05/15 10:53:12 DEBUG initialization: file #2 current position: 59768832, remaining: 7340032
2023/05/15 10:53:14 DEBUG initialization: file #2 current position: 60817408, remaining: 6291456
2023/05/15 10:53:16 DEBUG initialization: file #2 current position: 61865984, remaining: 5242880
2023/05/15 10:53:18 DEBUG initialization: file #2 current position: 62914560, remaining: 4194304
2023/05/15 10:53:21 DEBUG initialization: file #2 current position: 63963136, remaining: 3145728
Found new smallest nonce: Some(VrfNonce { index: 199214068, label: [0, 0, 0, 1, 10, 251, 93, 246, 161, 28, 201, 140, 231, 38, 180, 178, 27, 93, 64, 199, 96, 105, 160, 107, 8, 151, 1
54, 192, 67, 170, 150, 173] })
2023/05/15 10:53:23 INFO initialization: file #2, found nonce: 199214068, value: 000000010afb5df6a11cc98ce726b4b2
2023/05/15 10:53:23 INFO initialization: file #2, found new best nonce
2023/05/15 10:53:23 DEBUG initialization: file #2 current position: 65011712, remaining: 2097152
2023/05/15 10:53:25 DEBUG initialization: file #2 current position: 66060288, remaining: 1048576
2023/05/15 10:53:27 INFO initialization: file #2 completed; number of labels written: 67108864
2023/05/15 10:53:27 INFO initialization: starting to write file #3; target number of labels: 67108864, start position: 201326592
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/15 10:53:27 DEBUG initialization: file #3 current position: 0, remaining: 67108864
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(
################################ OPENCL ERROR ###############################
Error executing function: clEnqueueNDRangeKernel("scrypt")
Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)
Please visit the following url for more information:
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors
#############################################################################
)))', ffi/src/initialization.rs:146:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x7fd345a2fa7c m=9 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 1 [syscall]:
runtime.cgocall(0x523d16, 0xc0001495f8)
/root/go/src/runtime/cgocall.go:158 +0x5c fp=0xc0001495d0 sp=0xc000149598 pc=0x40595c
github.com/spacemeshos/post/internal/postrs._Cfunc_initialize(0x7fd13cdcf8a0, 0xc000000, 0xc0fffff, 0x7fd13d0020d0, 0xc0002820b0)
_cgo_gotypes.go:307 +0x4c fp=0xc0001495f8 sp=0xc0001495d0 pc=0x4f562c
github.com/spacemeshos/post/internal/postrs.cScryptPositions.func2(0x7fd13cdcf8a0, 0x1?, 0x7fd13d0020d0?, 0x657be0?, 0x0?)
/root/post/internal/postrs/initializer.go:109 +0x7f fp=0xc000149658 sp=0xc0001495f8 pc=0x4f6bdf
github.com/spacemeshos/post/internal/postrs.cScryptPositions(0xc00013a000?, 0xc000149700?, 0xc000000, 0xc0fffff)
/root/post/internal/postrs/initializer.go:109 +0xf1 fp=0xc000149708 sp=0xc000149658 pc=0x4f6831
github.com/spacemeshos/post/internal/postrs.(*Scrypt).Positions(0xc000280030, 0xc000000, 0xc0fffff)
/root/post/internal/postrs/api.go:159 +0x7e fp=0xc000149760 sp=0xc000149708 pc=0x4f60fe
github.com/spacemeshos/post/oracle.(*WorkOracle).Positions(0x4000000?, 0x55dbd9?, 0x20?)
/root/post/oracle/oracle.go:164 +0x33 fp=0xc0001497b8 sp=0xc000149760 pc=0x506513
github.com/spacemeshos/post/initialization.(*Initializer).initFile(0xc000176000, {0x588b18, 0xc00016c340}, 0xc000175a30?, 0x100000, 0xc000000, 0x4000000, {0xc00001a1a0, 0x20, 0x20})
/root/post/initialization/initialization.go:479 +0xab0 fp=0xc0001499d8 sp=0xc0001497b8 pc=0x51f450
github.com/spacemeshos/post/initialization.(*Initializer).Initialize(0xc000176000, {0x588b18, 0xc00016c340})
/root/post/initialization/initialization.go:266 +0x58a fp=0xc000149cb8 sp=0xc0001499d8 pc=0x51d9ea
main.main()
/root/post/cmd/postcli/main.go:133 +0x3e5 fp=0xc000149f80 sp=0xc000149cb8 pc=0x522c85
runtime.main()
/root/go/src/runtime/proc.go:250 +0x212 fp=0xc000149fe0 sp=0xc000149f80 pc=0x439052
runtime.goexit()
/root/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000149fe8 sp=0xc000149fe0 pc=0x465e41
goroutine 2 [force gc (idle), 6 minutes]:
I've seen a few reports of this. It happened to me when I initialized using a Mac M2. Another user just reported to me:
I still fail to smesh using my Mac M2 book after syncing. If I use the postclit with cmd:
./postcli -verify -datadir ~/post/7c8cef2b -fraction 0.1
It complains:
POS data is invalid: invalid label in file 0 at offset 532096 {"module": "post::initialization", "file": "ffi/src/initialization.rs", "line": 238}
Here are a couple of reports on Discord, there may be others:
Possibly related: #123
Currently, the library prints to stdout. It could be improved by using the log logging facade. Then Clibrary could expose an API to configure logging (either to log directly to stdout/stderr or provide a callback to pass logs up the stack to the user of post-rs lib).
The goal of the profiler is to help pick the right combination of threads and parallel nonces to get optimal POST proving speed (going over the data on disk and finding labels) given the user's real disk read speed.
The profiler creates a file on a disk filled with garbage and then goes over it doing its work a few times in a loop. The problem is that the file gets cached in RAM in the first iteration so all subsequent ones read from RAM instead of disk. This defeats the purpose because the benchmark is not limited by disk read speed anymore.
Error
ChildProcess.(posProfiler.ts)
Error: No such file or directory (os error 2)
Location:
profiler/src/util/macos.rs:4:16
Potential unaligned read
Details | |
---|---|
Status | unsound |
Package | atty |
Version | 0.2.14 |
URL | softprops/atty#50 |
Date | 2021-07-04 |
On windows, atty
dereferences a potentially unaligned pointer.
In practice however, the pointer won't be unaligned unless a custom global allocator is used.
In particular, the System
allocator on windows uses HeapAlloc
, which guarantees a large enough alignment.
A Pull Request with a fix has been provided over a year ago but the maintainer seems to be unreachable.
Last release of atty
was almost 3 years ago.
The below list has not been vetted in any way and may or may not contain alternatives;
See advisory page for additional details.
I noticed that when I run the profiler on a dedicated disk that's in deep sleep, the first read is extra slow, but a subsequent read is faster:
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_0.bin
{
"time_s": 12.901740377,
"speed_gib_s": 0.07750892288785359
}
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_0.bin
{
"time_s": 11.053322122,
"speed_gib_s": 0.18094107616924474
}
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_1.bin
{
"time_s": 10.225100351,
"speed_gib_s": 0.1955971023604089
}
(Changing nonces and threads here has no impact.)
Profiler should be smart enough to handle this case. It's probably as easy as having it perform one tiny "dummy read" before the timer starts, to make sure the disk is awake.
Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:37:09 DEBUG initialization: file #10 current position: 59768832, remaining: 7340032
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(
################################ OPENCL ERROR ###############################
Error executing function: clCreateContext
Status error code: CL_NV_INVALID_MEM_ACCESS (-9999)
Please visit the following url for more information:
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateContext.html#errors
#############################################################################
)))', ffi/src/initialization.rs:145:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
all was working ok and then this crashed.
could be partially caused by spacemeshos/post#138
Currently, there are separate instances of RandomX cache and dataset on proving and verifying path. In some cases, it could make sense to share them. For example, when a verifier is configured in FAST mode (2GB), then it could share its dataset with the prover.
Doing it in the other direction (to share a dataset with the verifier) makes less sense as a prover is created once every epoch and it exists only for a short time.
I think we should add a basic unit (64nonces 256GiB) benchmark to the profiler tool.
Now it's missing and could result in someone initializing too much.
[email protected]:~/post-rs$ cargo run --release -p initializer -- -l 12032000 --node-id "hBGTHs44tav7YR87sRVafuzZwObCZnK1Z/exYpxwqSQ=" --commitment-atx-id "ZuxocVjIYWfv7A/K1Lmm8+mNsHzAZaWVpbl5+KINx+I=" -m 1000000000
Compiling initializer v0.1.8 (/root/post-rs/initializer)
Finished release [optimized] target(s) in 1.88s
Running `target/release/initializer -l 12032000 --node-id hBGTHs44tav7YR87sRVafuzZwObCZnK1Z/exYpxwqSQ= --commitment-atx-id ZuxocVjIYWfv7A/K1Lmm8+mNsHzAZaWVpbl5+KINx+I= -m 1000000000`
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 4090
device memory: 24217 MB, max_mem_alloc_size: 6054 MB, max_compute_units: 128, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12096, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 387072 bytes
Allocating buffer for lookup: 6341787648 bytes
Error: initializing: Fail in OpenCL:
################################ OPENCL ERROR ###############################
Error executing function: clEnqueueNDRangeKernel("scrypt")
Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)
Please visit the following url for more information:
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors
#############################################################################
Location:
initializer/src/main.rs:199:22
```
Yesterday we noticed that after the upgrade there are problems on Apple Silicon with verify_proof.
We have checked and disabling JIT for RandomX makes the problem go away.
Running on Rosetta also helps.
ERROR 636ff.post Proof is invalid: MSB value for index: 117717007989 doesn't satisfy difficulty: 215 > 0 (label: [145, 107, 182, 205, 188, 156, 238, 115, 148, 20, 39, 181, 44, 4, 119, 80]) {"node_id": "", "module": "post", "module": "post::post_impl", "file": "ffi\src\post_impl.rs", "line": 242}
initial POST proof is invalid. Probably the initialized POST data is corrupted. Please verify the data with postcli and regenerate the corrupted files
I find this code:
(https://github.com/spacemeshos/post-rs/blob/main/ffi/src/post_impl.rs#L239)
and I want this code tell me which postdata file is error
We have a report in Discord that go-spacemesh unexpectedly quits.
Go-spacemesh logs look fine and do not contain any errors.
However, app-logs indicates that the node process exits with the code 3221225477.
Here is a part of log files:
Go-spacemesh:
...
2023-12-27T22:28:26.699Z INFO 800e3.post initialization: file already initialized {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "fileIndex": 671, "currentNumLabels": 134217728, "targetNumLabels": 134217728, "startPosition": 90060095488}
2023-12-27T22:28:26.699Z INFO 800e3.post initialization: completed, found nonce {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "nonce": 81997299014}
2023-12-27T22:28:26.699Z INFO 800e3.post post setup completed {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "commitment_atx": "9399f34ca0", "data_dir": "E:\\HDDNODE4", "num_units": "21", "labels_per_unit": "4294967296", "provider": "{824636000664}", "name": "post"}
2023-12-27T22:28:26.705Z INFO 800e3.atxBuilder loaded the initial post from disk {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder"}
2023-12-27T22:28:26.705Z INFO 800e3.atxBuilder verifying the initial post {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder", "post": {"nonce": 179, "indices": "9f32e0326039241d242889fa5e051c0a1fd2601fa8321fdea40f70842a139ca6003bcf8617af6ef504e3df3c9863586785070f2ed956f7a12531444074d6d6d888c894391bd159cceb2308b58ace844bff34a1c416cb849666a915dd62e301a371f021dd020f3a8efceae9dc329e3f4898e666e82609e110f5c5c2dca24eac98c234d313c6a698fa23a095dab51623fb3c247a6813d340abed5e9761bfbd88b54338f695cd2ee80ae9681701"}, "metadata": {"Challenge": "0000000000000000000000000000000000000000000000000000000000000000", "LabelsPerUnit": 4294967296}, "name": "atxBuilder"}
2023-12-27T22:28:26.815Z INFO 800e3.atxBuilder atx challenge is ready {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder", "sessionId": "0d2a3100-fd10-4ad5-a1f4-ca91d8129d18", "current_epoch": "11", "publish_epoch": "11", "target_epoch": "12", "name": "atxBuilder"}
2023-12-27T22:28:26.815Z INFO 800e3.nipostBuilder building nipost {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostBuilder", "sessionId": "0d2a3100-fd10-4ad5-a1f4-ca91d8129d18", "poet round start": "2023-12-11T08:00:00.000Z", "poet round end": "2023-12-24T20:00:00.000Z", "publish epoch": "11", "publish epoch end": "2023-12-29T08:00:00.000Z", "target epoch": "12", "name": "nipostBuilder"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostBuilder starting post execution {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostBuilder", "challenge": "VgDHwLYegT5coz4ptGFD9ORuJeEVBEoVIwiC0CsXcjo=", "name": "nipostBuilder"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator scaling post verifier {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "current": 9, "new": 1}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-8 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-1 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-5 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-7 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-4 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-2 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-3 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z INFO 800e3.nipostValidator.worker-6 stopped {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.960Z INFO 800e3.nipostValidator generating proof with PoW flags: RandomXFlag(FLAG_HARD_AES | FLAG_FULL_MEM | FLAG_JIT | FLAG_ARGON2_SSSE3 | FLAG_ARGON2_AVX2) and params: ProvingParams { difficulty: 5317578556, pow_difficulty: [0, 0, 170, 111, 105, 238, 214, 162, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195] } {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "module": "post::prove", "file": "src\\prove.rs", "line": 276}
2023-12-27T22:28:27.331Z INFO grpc finished streaming call with code OK {"module": "grpc", "grpc.start_time": "2023-12-27T22:19:58Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.SmesherService", "grpc.method": "PostSetupStatusStream", "peer.address": "127.0.0.1:49946", "grpc.code": "OK", "grpc.time_ms": 509008.72}
2023-12-27T22:28:27.642Z INFO grpc finished streaming call with code OK {"module": "grpc", "grpc.start_time": "2023-12-27T22:19:57Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.SmesherService", "grpc.method": "PostSetupStatusStream", "peer.address": "127.0.0.1:49946", "grpc.code": "OK", "grpc.time_ms": 510008.3}
2023-12-27T22:28:40.483Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "05686864f8", "layer_id": 47462, "name": "blockHandler"}
2023-12-27T22:28:40.514Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47461, "block": "0b73fbbbac", "state_hash": "0xb9e5cb60045e4c592357b51cc8611805fcea1658ad18b770fba1102619eb8833", "duration": "59.4948ms", "count": 3, "rewards": 42, "name": "executor"}
2023-12-27T22:28:40.633Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "86b9eb89c9", "layer_id": 47464, "name": "blockHandler"}
2023-12-27T22:28:40.887Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47462, "block": "05686864f8", "state_hash": "0xaf193fd9dba8fba8d5e14848f6e8348b2bdc0d27d4cba5ebb4ab36e2e68e9d6c", "duration": "43.1586ms", "count": 2, "rewards": 47, "name": "executor"}
2023-12-27T22:28:40.935Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47463, "block": "afcde47b00", "state_hash": "0xd2496d0bc4f538c98ff6d9c83de2ff7fde108e940524c8d716880ae719e44fca", "duration": "41.1809ms", "count": 1, "rewards": 48, "name": "executor"}
2023-12-27T22:28:40.975Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47464, "block": "86b9eb89c9", "state_hash": "0x697b54dda4cf332e696e8b00511172628514427c9053252f569467d301e847e4", "duration": "39.1292ms", "count": 0, "rewards": 57, "name": "executor"}
2023-12-27T22:28:44.688Z INFO grpc started streaming call {"module": "grpc", "grpc.start_time": "2023-12-27T22:28:44Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.AdminService", "grpc.method": "EventsStream", "peer.address": "127.0.0.1:49946"}
2023-12-27T22:28:45.856Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "5d7e197d88", "layer_id": 47465, "name": "blockHandler"}
2023-12-27T22:28:46.318Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "218cdf8128", "layer_id": 47466, "name": "blockHandler"}
2023-12-27T22:28:46.509Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "0ca98cc9c3", "layer_id": 47467, "name": "blockHandler"}
2023-12-27T22:28:46.520Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47465, "block": "5d7e197d88", "state_hash": "0xaddf8c52ca05242676aaa426f342139b00d6d16a8c5fe83bce63eaee28a774dd", "duration": "51.3029ms", "count": 1, "rewards": 42, "name": "executor"}
2023-12-27T22:28:47.238Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47466, "block": "218cdf8128", "state_hash": "0x1ebe77e764256def431b07f15e5f072f7788b51218653c32e2ee6fb8b7975490", "duration": "45.5812ms", "count": 3, "rewards": 43, "name": "executor"}
2023-12-27T22:28:47.285Z INFO 800e3.executor executed block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47467, "block": "0ca98cc9c3", "state_hash": "0xae8d7935e393aafbce6596cfc5ac88f0eddf54c2174591ed1eb5028b16ddfcab", "duration": "47.7282ms", "count": 1, "rewards": 48, "name": "executor"}
2023-12-27T22:28:47.514Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "bd553bf3d4", "layer_id": 47468, "name": "blockHandler"}
2023-12-27T22:28:47.927Z WARN 800e3.sync mesh failed to process layer from sync {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "sync", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "layer_id": 47470, "errmsg": "get block: get block bd553bf3d4: database: not found", "name": "sync"}
2023-12-27T22:28:47.986Z INFO 800e3.blockHandler new block {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "36ab2e082e", "layer_id": 47469, "name": "blockHandler"}
2023-12-27T22:29:02.364Z INFO 800e3.nipostValidator calculating proof of work for nonces 0..288 {"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "module": "post::prove", "file": "src\\prove.rs", "line": 131}
App-logs:
...
[2023-12-27 22:28:27.634] [info] SmesherService, in grpc PostDataCreationProgressStream, output: {"state":5,"numLabelsWritten":90194313216}
[2023-12-27 22:28:27.636] [info] SmesherService, in PostSetupStatusStream, output: {"postSetupState":5,"numLabelsWritten":90194313216,"opts":{"dataDir":"E:\\HDDNODE4","numUnits":21,"maxFileSize":2147483648,"providerId":4294967295}}
[2023-12-27 22:28:27.636] [info] SmesherService, in PostSetupStatusStream, output: "Status complete -> closing stream"
[2023-12-27 22:28:33.775] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":25,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47460}
[2023-12-27 22:28:38.098] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47460}
[2023-12-27 22:28:45.280] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47464}
[2023-12-27 22:28:49.045] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467}
[2023-12-27 22:29:02.323] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":25,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467}
[2023-12-27 22:29:13.199] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":24,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467}
[2023-12-27 22:29:14.619] [info] NodeManager, in sendNodeStatus, output: {"connectedPeers":23,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467}
[2023-12-27 22:29:34.373] [error] syncSmesherInfo, in subscribeNodeEvents, error: "Error: 14 UNAVAILABLE: read ECONNRESET"
[2023-12-27 22:29:34.821] [error] NodeManager, in Node Process close, error: 3221225477
...
I guess that prove.rs tries to read wrong address somehow and that crashes an app.
You can download full logs from the report in Discord or ask User to help with the investigation (e.g. running Smapp with --pprof-server
flag to pass it to go-spacemesh, and then fetch required profiles).
Given the problems with Vulkan and Cuda recently.
We need an alternative implementation. To lower the maintenance cost, let's use OpenCL.
The code has to be in this repo, and then we might need to integrate it with post
.
We have parallel effort with fixing gpu-post
.
Currently, the code transfers 32 bytes of each label from GPU to the host. 32B are needed to search for the POPS VRF nonce because all 32B are compared with difficulty. Later on, only the most significant (big-endian) 16B are preserved in the POS data.
It's possible to optimize the bandwidth and send only the 16B of each label. The missing lower 16B could be generated on the CPU on the rare occasion (1/2^128) when the most significant bytes are not enough to decide.
Given a 16B label
and 32B difficulty
, there are 3 possible cases to consider:
label > difficulty[0:16]
-> NOT a noncelabel == difficulty[0:16]
-> POSSIBLY a nonce - generate lower 16 bytes of labellabel < difficulty[0:16]
-> VALID nonceThe code would need to generate a label on the CPU only if the 16 bytes of the label are equal to the most significant 16B of the difficulty.
Add a new CLI parameter to the PoST service
--max-retries
that defines how many times the service should try to connect to the node before giving up and shutting down
Implement the GRPC post service as described in spacemeshos/pm#260
The CPU load during initialization is constantly at 100% for each postcli process. This has been reported from many users.
This issue is due to the specific Nvidia implementation of the OpenCL synchronization. The enqueued buffer read operation is constantly probing the status of the running OpenCL kernel, which puts the CPU under high load.
A workaround would be to put the CPU thread in sleep for a defined duration right after enqueuing the kernel, and only then enqueue a buffer read. The sleep duration can be obtained by averaging the execution time over a number of kernel executions and then subtracting a safety factor from it (e.g. sleep duration is 90% of kernel execution).
Ideally, the kernel execution time should be measured periodically by disabling the sleep for a couple of kernel executions every few seconds. This ensures that if the kernel execution time decreases below the sleep duration (e.g. becuase the user increased the power limit of the GPU), the decrease is properly detected.
Tests on a RTX 3080 Ti show, that a sleep duration of 25ms reduces the CPU load to 10% while maintaining the initialization speed. While a further increase of the sleep duration to 30ms reduced the CPU load to 2.5% it has a negative impact on the initialization speed.
sleep (ms) | CPU load (%) | init. speed (MiB/s) |
---|---|---|
0 | 100 | 3.30 |
25 | 10 | 3.30 |
30 | 2.5 | 2.75 |
We need to expose global_work_size so we don't leave perf on the table when integrating with post
.
And also exposing it will allow to make small but fitting batches so the overall UX will be better.
I've observed on multiple systems and OS that profiler speed resullts never exceed approx 3 GiB/s regardless of nonces and available hardware.
Edit: I have observed the same limitation on production nodes during the cycle gap.
Running multiple instances of profiler allows for cumulative read speeds that scale to the system CPU and IO limits, but still no more than ~3GiB/s per instance.
Tests run with --data-size=32
Currently, the K2 proof of work uses all available cores. It should respect the threads
parameter passed to generate_proof()
.
To support distributed verification, each verifier must verify a different subset of k3 indices. To achieve that, the randomness seed used to select k3 indices must include the public key of the verifying node.
To create a malfeasance proof, we need to know which index in the proof is invalid. The result from the verification method should include this information.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.