spacemeshos / post-rs Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 13.0 1.22 MB

Rust implementation of POST proving

License: MIT License

Rust 90.77% C 8.66% Dockerfile 0.58%

post-rs's People

Contributors

Stargazers

Watchers

Forkers

tukudev zhiqiangxu reythia schinzelh maxpla3 minerdao andreivcodes empereurfn hairtail 6block

post-rs's Issues

Use randomX for K2 PoW

Currently, K2 PoW (POST proving & verification path) uses scrypt-jane. A better, PoW-dedicated alternative was found - randomX.

We should change hash function for K2 PoW to randomX.

Exception 0xc0000005 0x0 0x0 0x7ff93bb53e42

spacemesh-log-7f8f332c(2).txt.zip

Exception 0xc0000005 0x0 0x0 0x7ff93bb53e42

Make profiler available in recent main

As in the title, we need the profiler tool available in the main branch of the repo, preferably even as a release artifact binary that anyone can download.

remove K3 pow

Quoting Conversation with @iddo333 from slack:
@iddo333:

k3pow won't be needed, but k3 itself will be needed when we switch to distributed verification. So the code that solves the PoW for k3pow isn't needed (it shouldn't be compiled, you can comment it out or delete it so it's only the git histroy, if we find a new reason to use k3pow and we can revive this code but it's unlikely. The code that takes the random-looking k3pow seed and verifies the subset (by using blake3 to create the random-looking sequence) will be used in the future for distributed verification, by replacing the k3pow seed with a different seed.

@poszu:

Is it OK to remove k3pow from the POST proof and k3_pow_difficulty from the configuration of POST prover then? I will leave the code to pick random indices for distr. verification.

@iddo333:

Yes that's ok for now (the code will remain in the git history so it's no big deal, it's probably enough just in the git commit message mention k3pow removal, but I guess you can also do git tag).

I have a set of POS data that are valid when I use the postcli tool to verify. But when I use go-spacemesh, I get this error in the log.
2023-11-27T10:00:48.630-0800 INFO e2a89.post proving: generated proof {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "post"} 2023-11-27T10:00:48.632-0800 INFO e2a89.atxBuilder created the initial post {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder"} 2023-11-27T10:00:48.632-0800 INFO e2a89.atxBuilder verifying the initial post {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder", "post": {"nonce": 284, "indices":"8e22b74a006b4f2c275874d3e604766f65c2f022ec2f1d666faf46844f58ddba584abe73177d17584e03ea9dc87f502f1865900f7edf3962b9aa085c0a8426b4cbe131f3b1518d2fdd395d4e6b9d877d25531c7178622aa74f5bc8e47505576dbfcedb0036dafe627d7243bdf3ed73a836bdf48e35c045ea59ef451245a41fa764d98ca151407fefb8d16bfa32a881c5446d26be0baae1c7d86195831a47842e21667bfc1824ad57c9f8a501"}, "metadata": {"Challenge": "0000000000000000000000000000000000000000000000000000000000000000", "LabelsPerUnit": 4294967296}, "name": "atxBuilder"} 2023-11-27T10:00:48.649-0800 ERROR e2a89.nipostValidator Proof is invalid: invalid proof of work{"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "nipostValidator","module":"post::post_impl", "file": "ffi/src/post_impl.rs", "line": 203} 2023-11-27T10:00:48.649-0800 FATAL e2a89.atxBuilder initial POST proof is invalid. Probably the initialized POST data is corrupted. Please verify the data with postcli and regenerate the corrupted files. {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder", "errmsg": "verify PoST: invalid proof", "name": "atxBuilder"}

implement certifier web service

See spacemeshos/pm#290 for the general design.

Acceptance criteria

a simple web service with an HTTP API allowing to certify that the given node ID has a valid POST proof,
easily deployable with Docker.

Interruptable proof generation

Status quo

Currently, there is no way to interrupt proof generation other than killing the application.

Desired behavior

It should be possible to interrupt proof generation and continue later from the point it stopped.

Migrate to tracing in the service

For a structured, context-aware logging migrate to https://github.com/tokio-rs/tracing.

Test new initialization on AMD on Linux

We need to validate AMD gpus on Linux for opencl. The problem is that installing amdgpu is problematic and does not work out of the box for 6600XT even after fully installed and correctly set.

Pos initialization failure on Windows10 nvidia 3060

Error executing function: clEnqueueNDRangeKernel("scrypt")
Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

reported initially via discord: https://discord.com/channels/623195163510046732/1179746300057624608/1179746300057624608

Profiler does not seem to work correctly on Linux amd64

cmdline:

time ./profiler --k2-pow-difficulty 891576961504 --data-file profiler_data.bin --data-size 1 --duration 10 -t 1
^C

real	5m3.591s
user	76m50.815s
sys	0m5.756s

There is also 0 data created on the disk despite running for 5mins (instead of 10seconds as I asked)

Additionally, there was nothing created on a disk. Screenshot presents CPU usage

When I removed --k2-pow-difficulty 891576961504 from the cmdline then it started to behave ok.

root@testnet-04-miner-1-0:~/data/post/profiler_d# time ./profiler  --data-file profiler_data.bin --data-size 1 --duration 10 -t 1
{
  "time_s": 10.6147993,
  "speed_gib_s": 0.9420809303478777
}

real	0m10.644s
user	0m9.243s
sys	0m1.402s

profiler release on M1/M2 does NOT use hw AES acceleration

Checked on M2:

nj ~/Downloads $ ./profiler
{
  "time_s": 13.723401458,
  "speed_gib_s": 0.2186046957222229
}

It clearly does not use hw aes acceleration.

Same machine compilled manually:

nj ~/workspace/post-rs main* $ RUSTFLAGS="--cfg=aes_armv8" cargo run --release -p profiler
    Finished release [optimized] target(s) in 0.08s
     Running `target/release/profiler`
{
  "time_s": 10.456489666,
  "speed_gib_s": 1.8170533904681048
}

Improve blacklisting initialization providers

#98 introduced a way to blacklist OCL platforms and devices via environment variables (POST_OCL_PLATFORMS_BLACKLIST and POST_OCL_DEVICES_BLACKLIST). It turns out to be too difficult to configure for less technical users.

The proposal is to let to configure the blacklist via the config instead.

Allow using externally created K2Pow to generate a POST proof

It should be possible to provide a k2 proof of work to generate the POST with in order to avoid recomputing it.

Requirements:

generate_proof() should accept an optional k2pow parameter (an array per each nonce group?)
if k2pow is provided for given nonce group - use it
otherwise - compute it

Exception when querying providers on Windows integrated graphics

There is a crash in OpenCL.dll when querying properties of an openCL platform that happens on some systems (for example by calling scrypt_ocl::get_providers_count()).

It was narrowed down to Intel's integrated graphics (both Intel HD and Intel Iris). Disabling the driver (leaving Nvidia enabled) resolves the issue.

An issue in the ocl library tracking the same: cogciprocate/ocl#219.

Please see the logs.

spacemesh-log-7f8f332c.txt

https://spacemesh.sentry.io/issues/4234116670/?project=6324919&query=is%3Aunresolved&referrer=issue-stream&stream_index=0 that's the report in sentry about that exact issue.

Postcli shows POS Data as Valid but still produces invalid proofs

A POST proof generated on M1 Mac is invalid.

Logs: spacemesh-log-7c8cef2b.zip

The POST proof generation code thinks that the files are 4096B in size:

invalid POS file, expected size: 2147483648 vs actual size: 4096	{"node_id": "358c764132cff490de0266420624b561392bf14acada9112a9fc1d7e350adb6d", "module": "nipostValidator", "module": "post::reader", "file": "src/reader.rs", "line": 102}

Crash from 3090 CL_MEM_OBJECT_ALLOCATION_FAILURE

2023/05/15 10:52:30     DEBUG   initialization: file #2 current position: 39845888, remaining: 27262976
2023/05/15 10:52:32     DEBUG   initialization: file #2 current position: 40894464, remaining: 26214400
2023/05/15 10:52:34     DEBUG   initialization: file #2 current position: 41943040, remaining: 25165824
2023/05/15 10:52:36     DEBUG   initialization: file #2 current position: 42991616, remaining: 24117248
2023/05/15 10:52:39     DEBUG   initialization: file #2 current position: 44040192, remaining: 23068672
2023/05/15 10:52:41     DEBUG   initialization: file #2 current position: 45088768, remaining: 22020096
2023/05/15 10:52:43     DEBUG   initialization: file #2 current position: 46137344, remaining: 20971520
2023/05/15 10:52:45     DEBUG   initialization: file #2 current position: 47185920, remaining: 19922944
2023/05/15 10:52:47     DEBUG   initialization: file #2 current position: 48234496, remaining: 18874368
2023/05/15 10:52:50     DEBUG   initialization: file #2 current position: 49283072, remaining: 17825792
2023/05/15 10:52:52     DEBUG   initialization: file #2 current position: 50331648, remaining: 16777216
2023/05/15 10:52:54     DEBUG   initialization: file #2 current position: 51380224, remaining: 15728640
2023/05/15 10:52:56     DEBUG   initialization: file #2 current position: 52428800, remaining: 14680064
2023/05/15 10:52:58     DEBUG   initialization: file #2 current position: 53477376, remaining: 13631488
2023/05/15 10:53:01     DEBUG   initialization: file #2 current position: 54525952, remaining: 12582912
2023/05/15 10:53:03     DEBUG   initialization: file #2 current position: 55574528, remaining: 11534336
2023/05/15 10:53:05     DEBUG   initialization: file #2 current position: 56623104, remaining: 10485760
2023/05/15 10:53:07     DEBUG   initialization: file #2 current position: 57671680, remaining: 9437184
2023/05/15 10:53:09     DEBUG   initialization: file #2 current position: 58720256, remaining: 8388608
2023/05/15 10:53:12     DEBUG   initialization: file #2 current position: 59768832, remaining: 7340032
2023/05/15 10:53:14     DEBUG   initialization: file #2 current position: 60817408, remaining: 6291456
2023/05/15 10:53:16     DEBUG   initialization: file #2 current position: 61865984, remaining: 5242880
2023/05/15 10:53:18     DEBUG   initialization: file #2 current position: 62914560, remaining: 4194304
2023/05/15 10:53:21     DEBUG   initialization: file #2 current position: 63963136, remaining: 3145728
Found new smallest nonce: Some(VrfNonce { index: 199214068, label: [0, 0, 0, 1, 10, 251, 93, 246, 161, 28, 201, 140, 231, 38, 180, 178, 27, 93, 64, 199, 96, 105, 160, 107, 8, 151, 1
54, 192, 67, 170, 150, 173] })
2023/05/15 10:53:23     INFO    initialization: file #2, found nonce: 199214068, value: 000000010afb5df6a11cc98ce726b4b2
2023/05/15 10:53:23     INFO    initialization: file #2, found new best nonce
2023/05/15 10:53:23     DEBUG   initialization: file #2 current position: 65011712, remaining: 2097152
2023/05/15 10:53:25     DEBUG   initialization: file #2 current position: 66060288, remaining: 1048576
2023/05/15 10:53:27     INFO    initialization: file #2 completed; number of labels written: 67108864
2023/05/15 10:53:27     INFO    initialization: starting to write file #3; target number of labels: 67108864, start position: 201326592
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/15 10:53:27     DEBUG   initialization: file #3 current position: 0, remaining: 67108864
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("scrypt")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################
)))', ffi/src/initialization.rs:146:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x7fd345a2fa7c m=9 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x523d16, 0xc0001495f8)
        /root/go/src/runtime/cgocall.go:158 +0x5c fp=0xc0001495d0 sp=0xc000149598 pc=0x40595c
github.com/spacemeshos/post/internal/postrs._Cfunc_initialize(0x7fd13cdcf8a0, 0xc000000, 0xc0fffff, 0x7fd13d0020d0, 0xc0002820b0)
        _cgo_gotypes.go:307 +0x4c fp=0xc0001495f8 sp=0xc0001495d0 pc=0x4f562c
github.com/spacemeshos/post/internal/postrs.cScryptPositions.func2(0x7fd13cdcf8a0, 0x1?, 0x7fd13d0020d0?, 0x657be0?, 0x0?)
        /root/post/internal/postrs/initializer.go:109 +0x7f fp=0xc000149658 sp=0xc0001495f8 pc=0x4f6bdf
github.com/spacemeshos/post/internal/postrs.cScryptPositions(0xc00013a000?, 0xc000149700?, 0xc000000, 0xc0fffff)
        /root/post/internal/postrs/initializer.go:109 +0xf1 fp=0xc000149708 sp=0xc000149658 pc=0x4f6831
github.com/spacemeshos/post/internal/postrs.(*Scrypt).Positions(0xc000280030, 0xc000000, 0xc0fffff)
        /root/post/internal/postrs/api.go:159 +0x7e fp=0xc000149760 sp=0xc000149708 pc=0x4f60fe
github.com/spacemeshos/post/oracle.(*WorkOracle).Positions(0x4000000?, 0x55dbd9?, 0x20?)
        /root/post/oracle/oracle.go:164 +0x33 fp=0xc0001497b8 sp=0xc000149760 pc=0x506513
github.com/spacemeshos/post/initialization.(*Initializer).initFile(0xc000176000, {0x588b18, 0xc00016c340}, 0xc000175a30?, 0x100000, 0xc000000, 0x4000000, {0xc00001a1a0, 0x20, 0x20})
        /root/post/initialization/initialization.go:479 +0xab0 fp=0xc0001499d8 sp=0xc0001497b8 pc=0x51f450
github.com/spacemeshos/post/initialization.(*Initializer).Initialize(0xc000176000, {0x588b18, 0xc00016c340})
        /root/post/initialization/initialization.go:266 +0x58a fp=0xc000149cb8 sp=0xc0001499d8 pc=0x51d9ea
main.main()
        /root/post/cmd/postcli/main.go:133 +0x3e5 fp=0xc000149f80 sp=0xc000149cb8 pc=0x522c85
runtime.main()
        /root/go/src/runtime/proc.go:250 +0x212 fp=0xc000149fe0 sp=0xc000149f80 pc=0x439052
runtime.goexit()
        /root/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000149fe8 sp=0xc000149fe0 pc=0x465e41

goroutine 2 [force gc (idle), 6 minutes]:

"POS data is invalid" error on Apple Silicon

I've seen a few reports of this. It happened to me when I initialized using a Mac M2. Another user just reported to me:

I still fail to smesh using my Mac M2 book after syncing. If I use the postclit with cmd:
./postcli -verify -datadir ~/post/7c8cef2b -fraction 0.1
It complains:
POS data is invalid: invalid label in file 0 at offset 532096 {"module": "post::initialization", "file": "ffi/src/initialization.rs", "line": 238}

Here are a couple of reports on Discord, there may be others:

Possibly related: #123

Posibility to specify cpu affinity for proving threads

Raised initially on Discord by Samovar

Improve logging

Currently, the library prints to stdout. It could be improved by using the log logging facade. Then Clibrary could expose an API to configure logging (either to log directly to stdout/stderr or provide a callback to pass logs up the stack to the user of post-rs lib).

profiler gives wrong results because of file caching

The goal of the profiler is to help pick the right combination of threads and parallel nonces to get optimal POST proving speed (going over the data on disk and finding labels) given the user's real disk read speed.

The profiler creates a file on a disk filled with garbage and then goes over it doing its work a few times in a loop. The problem is that the file gets cached in RAM in the first iteration so all subsequent ones read from RAM instead of disk. This defeats the purpose because the benchmark is not limited by disk read speed anymore.

Error: No such file or directory (os error 2)

Error

ChildProcess.(posProfiler.ts)

Error: No such file or directory (os error 2)

Location:
profiler/src/util/macos.rs:4:16

Sentry report

implement basic operator API

RUSTSEC-2021-0145: Potential unaligned read

Potential unaligned read

Details
Status	unsound
Package	`atty`
Version	`0.2.14`
URL	softprops/atty#50
Date	2021-07-04

On windows, atty dereferences a potentially unaligned pointer.

In practice however, the pointer won't be unaligned unless a custom global allocator is used.

In particular, the System allocator on windows uses HeapAlloc, which guarantees a large enough alignment.

atty is Unmaintained

A Pull Request with a fix has been provided over a year ago but the maintainer seems to be unreachable.

Last release of atty was almost 3 years ago.

Possible Alternative(s)

The below list has not been vetted in any way and may or may not contain alternatives;

std::io::IsTerminal - Stable since Rust 1.70.0
is-terminal - Standalone crate supporting Rust older than 1.70.0

See advisory page for additional details.

Don't start profiler timer until disk is spun up

I noticed that when I run the profiler on a dedicated disk that's in deep sleep, the first read is extra slow, but a subsequent read is faster:

> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_0.bin
{
  "time_s": 12.901740377,
  "speed_gib_s": 0.07750892288785359
}
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_0.bin
{
  "time_s": 11.053322122,
  "speed_gib_s": 0.18094107616924474
}
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_1.bin
{
  "time_s": 10.225100351,
  "speed_gib_s": 0.1955971023604089
}

(Changing nonces and threads here has no impact.)

Profiler should be smart enough to handle this case. It's probably as easy as having it perform one tiny "dummy read" before the timer starts, to make sure the disk is awake.

Fails on 3090 after 11G initialized - CL_NV_INVALID_MEM_ACCESS

Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:37:09     DEBUG   initialization: file #10 current position: 59768832, remaining: 7340032
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(

################################ OPENCL ERROR ###############################

Error executing function: clCreateContext

Status error code: CL_NV_INVALID_MEM_ACCESS (-9999)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateContext.html#errors

#############################################################################
)))', ffi/src/initialization.rs:145:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort

all was working ok and then this crashed.

could be partially caused by spacemeshos/post#138

Consider sharing randomx cache and dataset between prover and verifier

Currently, there are separate instances of RandomX cache and dataset on proving and verifying path. In some cases, it could make sense to share them. For example, when a verifier is configured in FAST mode (2GB), then it could share its dataset with the prover.

Doing it in the other direction (to share a dataset with the verifier) makes less sense as a prover is created once every epoch and it exists only for a short time.

Add k2pow calculation to the profiler

I think we should add a basic unit (64nonces 256GiB) benchmark to the profiler tool.

Now it's missing and could result in someone initializing too much.

Fails on 4090

[email protected]:~/post-rs$ cargo run --release -p initializer -- -l 12032000 --node-id "hBGTHs44tav7YR87sRVafuzZwObCZnK1Z/exYpxwqSQ=" --commitment-atx-id "ZuxocVjIYWfv7A/K1Lmm8+mNsHzAZaWVpbl5+KINx+I=" -m 1000000000
   Compiling initializer v0.1.8 (/root/post-rs/initializer)
    Finished release [optimized] target(s) in 1.88s
     Running `target/release/initializer -l 12032000 --node-id hBGTHs44tav7YR87sRVafuzZwObCZnK1Z/exYpxwqSQ= --commitment-atx-id ZuxocVjIYWfv7A/K1Lmm8+mNsHzAZaWVpbl5+KINx+I= -m 1000000000`
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 4090
device memory: 24217 MB, max_mem_alloc_size: 6054 MB, max_compute_units: 128, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12096, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 387072 bytes
Allocating buffer for lookup: 6341787648 bytes
Error: initializing: Fail in OpenCL:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("scrypt")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################


Location:
    initializer/src/main.rs:199:22
    ```

Crashing on Apple Silicon - verify_proof causes segfault

Yesterday we noticed that after the upgrade there are problems on Apple Silicon with verify_proof.

We have checked and disabling JIT for RandomX makes the problem go away.

Running on Rosetta also helps.

help for verify error data

ERROR 636ff.post Proof is invalid: MSB value for index: 117717007989 doesn't satisfy difficulty: 215 > 0 (label: [145, 107, 182, 205, 188, 156, 238, 115, 148, 20, 39, 181, 44, 4, 119, 80]) {"node_id": "", "module": "post", "module": "post::post_impl", "file": "ffi\src\post_impl.rs", "line": 242}

initial POST proof is invalid. Probably the initialized POST data is corrupted. Please verify the data with postcli and regenerate the corrupted files

I find this code:
(https://github.com/spacemeshos/post-rs/blob/main/ffi/src/post_impl.rs#L239)

and I want this code tell me which postdata file is error

Memory access violation (3221225477) crashes the Node while calculating PoW

We have a report in Discord that go-spacemesh unexpectedly quits.

Go-spacemesh logs look fine and do not contain any errors.
However, app-logs indicates that the node process exits with the code 3221225477.

Here is a part of log files:

Go-spacemesh:

...
2023-12-27T22:28:26.699Z	INFO	800e3.post	initialization: file already initialized	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "fileIndex": 671, "currentNumLabels": 134217728, "targetNumLabels": 134217728, "startPosition": 90060095488}
2023-12-27T22:28:26.699Z	INFO	800e3.post	initialization: completed, found nonce	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "nonce": 81997299014}
2023-12-27T22:28:26.699Z	INFO	800e3.post	post setup completed	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "commitment_atx": "9399f34ca0", "data_dir": "E:\\HDDNODE4", "num_units": "21", "labels_per_unit": "4294967296", "provider": "{824636000664}", "name": "post"}
2023-12-27T22:28:26.705Z	INFO	800e3.atxBuilder	loaded the initial post from disk	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder"}
2023-12-27T22:28:26.705Z	INFO	800e3.atxBuilder	verifying the initial post	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder", "post": {"nonce": 179, "indices": "9f32e0326039241d242889fa5e051c0a1fd2601fa8321fdea40f70842a139ca6003bcf8617af6ef504e3df3c9863586785070f2ed956f7a12531444074d6d6d888c894391bd159cceb2308b58ace844bff34a1c416cb849666a915dd62e301a371f021dd020f3a8efceae9dc329e3f4898e666e82609e110f5c5c2dca24eac98c234d313c6a698fa23a095dab51623fb3c247a6813d340abed5e9761bfbd88b54338f695cd2ee80ae9681701"}, "metadata": {"Challenge": "0000000000000000000000000000000000000000000000000000000000000000", "LabelsPerUnit": 4294967296}, "name": "atxBuilder"}
2023-12-27T22:28:26.815Z	INFO	800e3.atxBuilder	atx challenge is ready	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder", "sessionId": "0d2a3100-fd10-4ad5-a1f4-ca91d8129d18", "current_epoch": "11", "publish_epoch": "11", "target_epoch": "12", "name": "atxBuilder"}
2023-12-27T22:28:26.815Z	INFO	800e3.nipostBuilder	building nipost	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostBuilder", "sessionId": "0d2a3100-fd10-4ad5-a1f4-ca91d8129d18", "poet round start": "2023-12-11T08:00:00.000Z", "poet round end": "2023-12-24T20:00:00.000Z", "publish epoch": "11", "publish epoch end": "2023-12-29T08:00:00.000Z", "target epoch": "12", "name": "nipostBuilder"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostBuilder	starting post execution	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostBuilder", "challenge": "VgDHwLYegT5coz4ptGFD9ORuJeEVBEoVIwiC0CsXcjo=", "name": "nipostBuilder"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator	scaling post verifier	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "current": 9, "new": 1}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-8	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-1	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-5	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-7	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-4	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-2	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-3	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-6	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.960Z	INFO	800e3.nipostValidator	generating proof with PoW flags: RandomXFlag(FLAG_HARD_AES | FLAG_FULL_MEM | FLAG_JIT | FLAG_ARGON2_SSSE3 | FLAG_ARGON2_AVX2) and params: ProvingParams { difficulty: 5317578556, pow_difficulty: [0, 0, 170, 111, 105, 238, 214, 162, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195] }	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "module": "post::prove", "file": "src\\prove.rs", "line": 276}
2023-12-27T22:28:27.331Z	INFO	grpc	finished streaming call with code OK	{"module": "grpc", "grpc.start_time": "2023-12-27T22:19:58Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.SmesherService", "grpc.method": "PostSetupStatusStream", "peer.address": "127.0.0.1:49946", "grpc.code": "OK", "grpc.time_ms": 509008.72}
2023-12-27T22:28:27.642Z	INFO	grpc	finished streaming call with code OK	{"module": "grpc", "grpc.start_time": "2023-12-27T22:19:57Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.SmesherService", "grpc.method": "PostSetupStatusStream", "peer.address": "127.0.0.1:49946", "grpc.code": "OK", "grpc.time_ms": 510008.3}
2023-12-27T22:28:40.483Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "05686864f8", "layer_id": 47462, "name": "blockHandler"}
2023-12-27T22:28:40.514Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47461, "block": "0b73fbbbac", "state_hash": "0xb9e5cb60045e4c592357b51cc8611805fcea1658ad18b770fba1102619eb8833", "duration": "59.4948ms", "count": 3, "rewards": 42, "name": "executor"}
2023-12-27T22:28:40.633Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "86b9eb89c9", "layer_id": 47464, "name": "blockHandler"}
2023-12-27T22:28:40.887Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47462, "block": "05686864f8", "state_hash": "0xaf193fd9dba8fba8d5e14848f6e8348b2bdc0d27d4cba5ebb4ab36e2e68e9d6c", "duration": "43.1586ms", "count": 2, "rewards": 47, "name": "executor"}
2023-12-27T22:28:40.935Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47463, "block": "afcde47b00", "state_hash": "0xd2496d0bc4f538c98ff6d9c83de2ff7fde108e940524c8d716880ae719e44fca", "duration": "41.1809ms", "count": 1, "rewards": 48, "name": "executor"}
2023-12-27T22:28:40.975Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47464, "block": "86b9eb89c9", "state_hash": "0x697b54dda4cf332e696e8b00511172628514427c9053252f569467d301e847e4", "duration": "39.1292ms", "count": 0, "rewards": 57, "name": "executor"}
2023-12-27T22:28:44.688Z	INFO	grpc	started streaming call	{"module": "grpc", "grpc.start_time": "2023-12-27T22:28:44Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.AdminService", "grpc.method": "EventsStream", "peer.address": "127.0.0.1:49946"}
2023-12-27T22:28:45.856Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "5d7e197d88", "layer_id": 47465, "name": "blockHandler"}
2023-12-27T22:28:46.318Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "218cdf8128", "layer_id": 47466, "name": "blockHandler"}
2023-12-27T22:28:46.509Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "0ca98cc9c3", "layer_id": 47467, "name": "blockHandler"}
2023-12-27T22:28:46.520Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47465, "block": "5d7e197d88", "state_hash": "0xaddf8c52ca05242676aaa426f342139b00d6d16a8c5fe83bce63eaee28a774dd", "duration": "51.3029ms", "count": 1, "rewards": 42, "name": "executor"}
2023-12-27T22:28:47.238Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47466, "block": "218cdf8128", "state_hash": "0x1ebe77e764256def431b07f15e5f072f7788b51218653c32e2ee6fb8b7975490", "duration": "45.5812ms", "count": 3, "rewards": 43, "name": "executor"}
2023-12-27T22:28:47.285Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47467, "block": "0ca98cc9c3", "state_hash": "0xae8d7935e393aafbce6596cfc5ac88f0eddf54c2174591ed1eb5028b16ddfcab", "duration": "47.7282ms", "count": 1, "rewards": 48, "name": "executor"}
2023-12-27T22:28:47.514Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "bd553bf3d4", "layer_id": 47468, "name": "blockHandler"}
2023-12-27T22:28:47.927Z	WARN	800e3.sync	mesh failed to process layer from sync	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "sync", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "layer_id": 47470, "errmsg": "get block: get block bd553bf3d4: database: not found", "name": "sync"}
2023-12-27T22:28:47.986Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "36ab2e082e", "layer_id": 47469, "name": "blockHandler"}
2023-12-27T22:29:02.364Z	INFO	800e3.nipostValidator	calculating proof of work for nonces 0..288	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "module": "post::prove", "file": "src\\prove.rs", "line": 131}

App-logs:

...
[2023-12-27 22:28:27.634] [info]  SmesherService, in grpc PostDataCreationProgressStream, output: {"state":5,"numLabelsWritten":90194313216} 
[2023-12-27 22:28:27.636] [info]  SmesherService, in PostSetupStatusStream, output: {"postSetupState":5,"numLabelsWritten":90194313216,"opts":{"dataDir":"E:\\HDDNODE4","numUnits":21,"maxFileSize":2147483648,"providerId":4294967295}} 
[2023-12-27 22:28:27.636] [info]  SmesherService, in PostSetupStatusStream, output: "Status complete -> closing stream" 
[2023-12-27 22:28:33.775] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":25,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47460} 
[2023-12-27 22:28:38.098] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47460} 
[2023-12-27 22:28:45.280] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47464} 
[2023-12-27 22:28:49.045] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:02.323] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":25,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:13.199] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":24,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:14.619] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":23,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:34.373] [error] syncSmesherInfo, in subscribeNodeEvents, error: "Error: 14 UNAVAILABLE: read ECONNRESET" 
[2023-12-27 22:29:34.821] [error] NodeManager, in Node Process close, error: 3221225477 
...

I guess that prove.rs tries to read wrong address somehow and that crashes an app.

You can download full logs from the report in Discord or ask User to help with the investigation (e.g. running Smapp with --pprof-server flag to pass it to go-spacemesh, and then fetch required profiles).

Opencl based initialization

Given the problems with Vulkan and Cuda recently.
We need an alternative implementation. To lower the maintenance cost, let's use OpenCL.

The code has to be in this repo, and then we might need to integrate it with post.

We have parallel effort with fixing gpu-post.

Initialization: GPU->host bandwidth could be halved

Currently, the code transfers 32 bytes of each label from GPU to the host. 32B are needed to search for the POPS VRF nonce because all 32B are compared with difficulty. Later on, only the most significant (big-endian) 16B are preserved in the POS data.

It's possible to optimize the bandwidth and send only the 16B of each label. The missing lower 16B could be generated on the CPU on the rare occasion (1/2^128) when the most significant bytes are not enough to decide.

Given a 16B label and 32B difficulty, there are 3 possible cases to consider:

label > difficulty[0:16] -> NOT a nonce
label == difficulty[0:16] -> POSSIBLY a nonce - generate lower 16 bytes of label
label < difficulty[0:16] -> VALID nonce

The code would need to generate a label on the CPU only if the 16 bytes of the label are equal to the most significant 16B of the difficulty.

Configurable retries in post-service

Add a new CLI parameter to the PoST service --max-retries that defines how many times the service should try to connect to the node before giving up and shutting down

Post-Service support for generating proofs

Implement the GRPC post service as described in spacemeshos/pm#260

Acceptance criteria

a binary that, when executed tries to connect to the node at the given address
CLI to configure:
- POST proving
- POST network parameters (with mainnet defaults)
- mTLS
- host address
auto re-connecting
mTLS
support for the generate proof command
- also verify the proof immediately
release the binary as the CI artifact and include in releases
reasonable test coverage

Reduce CPU load during GPU initialization

The CPU load during initialization is constantly at 100% for each postcli process. This has been reported from many users.

This issue is due to the specific Nvidia implementation of the OpenCL synchronization. The enqueued buffer read operation is constantly probing the status of the running OpenCL kernel, which puts the CPU under high load.

A workaround would be to put the CPU thread in sleep for a defined duration right after enqueuing the kernel, and only then enqueue a buffer read. The sleep duration can be obtained by averaging the execution time over a number of kernel executions and then subtracting a safety factor from it (e.g. sleep duration is 90% of kernel execution).
Ideally, the kernel execution time should be measured periodically by disabling the sleep for a couple of kernel executions every few seconds. This ensures that if the kernel execution time decreases below the sleep duration (e.g. becuase the user increased the power limit of the GPU), the decrease is properly detected.

Tests on a RTX 3080 Ti show, that a sleep duration of 25ms reduces the CPU load to 10% while maintaining the initialization speed. While a further increase of the sleep duration to 30ms reduced the CPU load to 2.5% it has a negative impact on the initialization speed.

sleep (ms)	CPU load (%)	init. speed (MiB/s)
0	100	3.30
25	10	3.30
30	2.5	2.75

Expose global_work_size via api

We need to expose global_work_size so we don't leave perf on the table when integrating with post.

And also exposing it will allow to make small but fitting batches so the overall UX will be better.

Tracking read performance issues

I've observed on multiple systems and OS that profiler speed resullts never exceed approx 3 GiB/s regardless of nonces and available hardware.

Edit: I have observed the same limitation on production nodes during the cycle gap.

Running multiple instances of profiler allows for cumulative read speeds that scale to the system CPU and IO limits, but still no more than ~3GiB/s per instance.

Tests run with --data-size=32

Respect threads limitation in K2 PoW

Currently, the K2 proof of work uses all available cores. It should respect the threads parameter passed to generate_proof().

Seed k3 randomness with verifier ID and return index of invalid label

To support distributed verification, each verifier must verify a different subset of k3 indices. To achieve that, the randomness seed used to select k3 indices must include the public key of the verifying node.

To create a malfeasance proof, we need to know which index in the proof is invalid. The result from the verification method should include this information.

spacemeshos / post-rs Goto Github PK

post-rs's People

Contributors

Stargazers

Watchers

Forkers

post-rs's Issues

Acceptance criteria

Status quo

Desired behavior

Requirements:

atty is Unmaintained

Possible Alternative(s)

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org