Coder Social home page Coder Social logo

post-rs's People

Contributors

andreivcodes avatar brusherru avatar dependabot[bot] avatar dshulyak avatar fasmat avatar lrettig avatar pigmej avatar poszu avatar zhiqiangxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

post-rs's Issues

Use randomX for K2 PoW

Currently, K2 PoW (POST proving & verification path) uses scrypt-jane. A better, PoW-dedicated alternative was found - randomX.

We should change hash function for K2 PoW to randomX.

Make profiler available in recent main

As in the title, we need the profiler tool available in the main branch of the repo, preferably even as a release artifact binary that anyone can download.

remove K3 pow

Quoting Conversation with @iddo333 from slack:
@iddo333:

k3pow won't be needed, but k3 itself will be needed when we switch to distributed verification. So the code that solves the PoW for k3pow isn't needed (it shouldn't be compiled, you can comment it out or delete it so it's only the git histroy, if we find a new reason to use k3pow and we can revive this code but it's unlikely. The code that takes the random-looking k3pow seed and verifies the subset (by using blake3 to create the random-looking sequence) will be used in the future for distributed verification, by replacing the k3pow seed with a different seed.

@poszu:

Is it OK to remove k3pow from the POST proof and k3_pow_difficulty from the configuration of POST prover then? I will leave the code to pick random indices for distr. verification.

@iddo333:

Yes that's ok for now (the code will remain in the git history so it's no big deal, it's probably enough just in the git commit message mention k3pow removal, but I guess you can also do git tag).

POS Data Verification Bug

I have a set of POS data that are valid when I use the postcli tool to verify. But when I use go-spacemesh, I get this error in the log.
2023-11-27T10:00:48.630-0800 INFO e2a89.post proving: generated proof {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "post"} 2023-11-27T10:00:48.632-0800 INFO e2a89.atxBuilder created the initial post {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder"} 2023-11-27T10:00:48.632-0800 INFO e2a89.atxBuilder verifying the initial post {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder", "post": {"nonce": 284, "indices":"8e22b74a006b4f2c275874d3e604766f65c2f022ec2f1d666faf46844f58ddba584abe73177d17584e03ea9dc87f502f1865900f7edf3962b9aa085c0a8426b4cbe131f3b1518d2fdd395d4e6b9d877d25531c7178622aa74f5bc8e47505576dbfcedb0036dafe627d7243bdf3ed73a836bdf48e35c045ea59ef451245a41fa764d98ca151407fefb8d16bfa32a881c5446d26be0baae1c7d86195831a47842e21667bfc1824ad57c9f8a501"}, "metadata": {"Challenge": "0000000000000000000000000000000000000000000000000000000000000000", "LabelsPerUnit": 4294967296}, "name": "atxBuilder"} 2023-11-27T10:00:48.649-0800 ERROR e2a89.nipostValidator Proof is invalid: invalid proof of work{"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "nipostValidator","module":"post::post_impl", "file": "ffi/src/post_impl.rs", "line": 203} 2023-11-27T10:00:48.649-0800 FATAL e2a89.atxBuilder initial POST proof is invalid. Probably the initialized POST data is corrupted. Please verify the data with postcli and regenerate the corrupted files. {"node_id":"e2a897d9be8190ef5fc79871237786939235de0ef2c497142f56f2def2f4912f", "module": "atxBuilder", "errmsg": "verify PoST: invalid proof", "name": "atxBuilder"}

Interruptable proof generation

Status quo

Currently, there is no way to interrupt proof generation other than killing the application.

Desired behavior

It should be possible to interrupt proof generation and continue later from the point it stopped.

Test new initialization on AMD on Linux

We need to validate AMD gpus on Linux for opencl. The problem is that installing amdgpu is problematic and does not work out of the box for 6600XT even after fully installed and correctly set.

Profiler does not seem to work correctly on Linux amd64

cmdline:

time ./profiler --k2-pow-difficulty 891576961504 --data-file profiler_data.bin --data-size 1 --duration 10 -t 1
^C

real	5m3.591s
user	76m50.815s
sys	0m5.756s

There is also 0 data created on the disk despite running for 5mins (instead of 10seconds as I asked)

Additionally, there was nothing created on a disk. Screenshot presents CPU usage
image (3)

When I removed --k2-pow-difficulty 891576961504 from the cmdline then it started to behave ok.

root@testnet-04-miner-1-0:~/data/post/profiler_d# time ./profiler  --data-file profiler_data.bin --data-size 1 --duration 10 -t 1
{
  "time_s": 10.6147993,
  "speed_gib_s": 0.9420809303478777
}

real	0m10.644s
user	0m9.243s
sys	0m1.402s

profiler release on M1/M2 does NOT use hw AES acceleration

Checked on M2:

nj ~/Downloads $ ./profiler
{
  "time_s": 13.723401458,
  "speed_gib_s": 0.2186046957222229
}

It clearly does not use hw aes acceleration.

Same machine compilled manually:

nj ~/workspace/post-rs main* $ RUSTFLAGS="--cfg=aes_armv8" cargo run --release -p profiler
    Finished release [optimized] target(s) in 0.08s
     Running `target/release/profiler`
{
  "time_s": 10.456489666,
  "speed_gib_s": 1.8170533904681048
}

Improve blacklisting initialization providers

#98 introduced a way to blacklist OCL platforms and devices via environment variables (POST_OCL_PLATFORMS_BLACKLIST and POST_OCL_DEVICES_BLACKLIST). It turns out to be too difficult to configure for less technical users.

The proposal is to let to configure the blacklist via the config instead.

Allow using externally created K2Pow to generate a POST proof

It should be possible to provide a k2 proof of work to generate the POST with in order to avoid recomputing it.

Requirements:

  • generate_proof() should accept an optional k2pow parameter (an array per each nonce group?)
  • if k2pow is provided for given nonce group - use it
  • otherwise - compute it

Exception when querying providers on Windows integrated graphics

There is a crash in OpenCL.dll when querying properties of an openCL platform that happens on some systems (for example by calling scrypt_ocl::get_providers_count()).

It was narrowed down to Intel's integrated graphics (both Intel HD and Intel Iris). Disabling the driver (leaving Nvidia enabled) resolves the issue.

An issue in the ocl library tracking the same: cogciprocate/ocl#219.

Please see the logs.

spacemesh-log-7f8f332c.txt

https://spacemesh.sentry.io/issues/4234116670/?project=6324919&query=is%3Aunresolved&referrer=issue-stream&stream_index=0 that's the report in sentry about that exact issue.

Postcli shows POS Data as Valid but still produces invalid proofs

A POST proof generated on M1 Mac is invalid.

Related thread on Discord: https://discord.com/channels/623195163510046732/1145451846249500753

Logs: spacemesh-log-7c8cef2b.zip

The POST proof generation code thinks that the files are 4096B in size:

invalid POS file, expected size: 2147483648 vs actual size: 4096	{"node_id": "358c764132cff490de0266420624b561392bf14acada9112a9fc1d7e350adb6d", "module": "nipostValidator", "module": "post::reader", "file": "src/reader.rs", "line": 102}

Crash from 3090 CL_MEM_OBJECT_ALLOCATION_FAILURE

2023/05/15 10:52:30     DEBUG   initialization: file #2 current position: 39845888, remaining: 27262976
2023/05/15 10:52:32     DEBUG   initialization: file #2 current position: 40894464, remaining: 26214400
2023/05/15 10:52:34     DEBUG   initialization: file #2 current position: 41943040, remaining: 25165824
2023/05/15 10:52:36     DEBUG   initialization: file #2 current position: 42991616, remaining: 24117248
2023/05/15 10:52:39     DEBUG   initialization: file #2 current position: 44040192, remaining: 23068672
2023/05/15 10:52:41     DEBUG   initialization: file #2 current position: 45088768, remaining: 22020096
2023/05/15 10:52:43     DEBUG   initialization: file #2 current position: 46137344, remaining: 20971520
2023/05/15 10:52:45     DEBUG   initialization: file #2 current position: 47185920, remaining: 19922944
2023/05/15 10:52:47     DEBUG   initialization: file #2 current position: 48234496, remaining: 18874368
2023/05/15 10:52:50     DEBUG   initialization: file #2 current position: 49283072, remaining: 17825792
2023/05/15 10:52:52     DEBUG   initialization: file #2 current position: 50331648, remaining: 16777216
2023/05/15 10:52:54     DEBUG   initialization: file #2 current position: 51380224, remaining: 15728640
2023/05/15 10:52:56     DEBUG   initialization: file #2 current position: 52428800, remaining: 14680064
2023/05/15 10:52:58     DEBUG   initialization: file #2 current position: 53477376, remaining: 13631488
2023/05/15 10:53:01     DEBUG   initialization: file #2 current position: 54525952, remaining: 12582912
2023/05/15 10:53:03     DEBUG   initialization: file #2 current position: 55574528, remaining: 11534336
2023/05/15 10:53:05     DEBUG   initialization: file #2 current position: 56623104, remaining: 10485760
2023/05/15 10:53:07     DEBUG   initialization: file #2 current position: 57671680, remaining: 9437184
2023/05/15 10:53:09     DEBUG   initialization: file #2 current position: 58720256, remaining: 8388608
2023/05/15 10:53:12     DEBUG   initialization: file #2 current position: 59768832, remaining: 7340032
2023/05/15 10:53:14     DEBUG   initialization: file #2 current position: 60817408, remaining: 6291456
2023/05/15 10:53:16     DEBUG   initialization: file #2 current position: 61865984, remaining: 5242880
2023/05/15 10:53:18     DEBUG   initialization: file #2 current position: 62914560, remaining: 4194304
2023/05/15 10:53:21     DEBUG   initialization: file #2 current position: 63963136, remaining: 3145728
Found new smallest nonce: Some(VrfNonce { index: 199214068, label: [0, 0, 0, 1, 10, 251, 93, 246, 161, 28, 201, 140, 231, 38, 180, 178, 27, 93, 64, 199, 96, 105, 160, 107, 8, 151, 1
54, 192, 67, 170, 150, 173] })
2023/05/15 10:53:23     INFO    initialization: file #2, found nonce: 199214068, value: 000000010afb5df6a11cc98ce726b4b2
2023/05/15 10:53:23     INFO    initialization: file #2, found new best nonce
2023/05/15 10:53:23     DEBUG   initialization: file #2 current position: 65011712, remaining: 2097152
2023/05/15 10:53:25     DEBUG   initialization: file #2 current position: 66060288, remaining: 1048576
2023/05/15 10:53:27     INFO    initialization: file #2 completed; number of labels written: 67108864
2023/05/15 10:53:27     INFO    initialization: starting to write file #3; target number of labels: 67108864, start position: 201326592
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/15 10:53:27     DEBUG   initialization: file #3 current position: 0, remaining: 67108864
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("scrypt")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################
)))', ffi/src/initialization.rs:146:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x7fd345a2fa7c m=9 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x523d16, 0xc0001495f8)
        /root/go/src/runtime/cgocall.go:158 +0x5c fp=0xc0001495d0 sp=0xc000149598 pc=0x40595c
github.com/spacemeshos/post/internal/postrs._Cfunc_initialize(0x7fd13cdcf8a0, 0xc000000, 0xc0fffff, 0x7fd13d0020d0, 0xc0002820b0)
        _cgo_gotypes.go:307 +0x4c fp=0xc0001495f8 sp=0xc0001495d0 pc=0x4f562c
github.com/spacemeshos/post/internal/postrs.cScryptPositions.func2(0x7fd13cdcf8a0, 0x1?, 0x7fd13d0020d0?, 0x657be0?, 0x0?)
        /root/post/internal/postrs/initializer.go:109 +0x7f fp=0xc000149658 sp=0xc0001495f8 pc=0x4f6bdf
github.com/spacemeshos/post/internal/postrs.cScryptPositions(0xc00013a000?, 0xc000149700?, 0xc000000, 0xc0fffff)
        /root/post/internal/postrs/initializer.go:109 +0xf1 fp=0xc000149708 sp=0xc000149658 pc=0x4f6831
github.com/spacemeshos/post/internal/postrs.(*Scrypt).Positions(0xc000280030, 0xc000000, 0xc0fffff)
        /root/post/internal/postrs/api.go:159 +0x7e fp=0xc000149760 sp=0xc000149708 pc=0x4f60fe
github.com/spacemeshos/post/oracle.(*WorkOracle).Positions(0x4000000?, 0x55dbd9?, 0x20?)
        /root/post/oracle/oracle.go:164 +0x33 fp=0xc0001497b8 sp=0xc000149760 pc=0x506513
github.com/spacemeshos/post/initialization.(*Initializer).initFile(0xc000176000, {0x588b18, 0xc00016c340}, 0xc000175a30?, 0x100000, 0xc000000, 0x4000000, {0xc00001a1a0, 0x20, 0x20})
        /root/post/initialization/initialization.go:479 +0xab0 fp=0xc0001499d8 sp=0xc0001497b8 pc=0x51f450
github.com/spacemeshos/post/initialization.(*Initializer).Initialize(0xc000176000, {0x588b18, 0xc00016c340})
        /root/post/initialization/initialization.go:266 +0x58a fp=0xc000149cb8 sp=0xc0001499d8 pc=0x51d9ea
main.main()
        /root/post/cmd/postcli/main.go:133 +0x3e5 fp=0xc000149f80 sp=0xc000149cb8 pc=0x522c85
runtime.main()
        /root/go/src/runtime/proc.go:250 +0x212 fp=0xc000149fe0 sp=0xc000149f80 pc=0x439052
runtime.goexit()
        /root/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000149fe8 sp=0xc000149fe0 pc=0x465e41

goroutine 2 [force gc (idle), 6 minutes]:

"POS data is invalid" error on Apple Silicon

I've seen a few reports of this. It happened to me when I initialized using a Mac M2. Another user just reported to me:

I still fail to smesh using my Mac M2 book after syncing. If I use the postclit with cmd:
./postcli -verify -datadir ~/post/7c8cef2b -fraction 0.1
It complains:
POS data is invalid: invalid label in file 0 at offset 532096 {"module": "post::initialization", "file": "ffi/src/initialization.rs", "line": 238}

Here are a couple of reports on Discord, there may be others:

Possibly related: #123

Improve logging

Currently, the library prints to stdout. It could be improved by using the log logging facade. Then Clibrary could expose an API to configure logging (either to log directly to stdout/stderr or provide a callback to pass logs up the stack to the user of post-rs lib).

profiler gives wrong results because of file caching

The goal of the profiler is to help pick the right combination of threads and parallel nonces to get optimal POST proving speed (going over the data on disk and finding labels) given the user's real disk read speed.

The profiler creates a file on a disk filled with garbage and then goes over it doing its work a few times in a loop. The problem is that the file gets cached in RAM in the first iteration so all subsequent ones read from RAM instead of disk. This defeats the purpose because the benchmark is not limited by disk read speed anymore.

RUSTSEC-2021-0145: Potential unaligned read

Potential unaligned read

Details
Status unsound
Package atty
Version 0.2.14
URL softprops/atty#50
Date 2021-07-04

On windows, atty dereferences a potentially unaligned pointer.

In practice however, the pointer won't be unaligned unless a custom global allocator is used.

In particular, the System allocator on windows uses HeapAlloc, which guarantees a large enough alignment.

atty is Unmaintained

A Pull Request with a fix has been provided over a year ago but the maintainer seems to be unreachable.

Last release of atty was almost 3 years ago.

Possible Alternative(s)

The below list has not been vetted in any way and may or may not contain alternatives;

See advisory page for additional details.

Don't start profiler timer until disk is spun up

I noticed that when I run the profiler on a dedicated disk that's in deep sleep, the first read is extra slow, but a subsequent read is faster:

> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_0.bin
{
  "time_s": 12.901740377,
  "speed_gib_s": 0.07750892288785359
}
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_0.bin
{
  "time_s": 11.053322122,
  "speed_gib_s": 0.18094107616924474
}
> profiler -t 0 -n 144 --data-file /mnt/smesher-02/post/7c8cef2b/postdata_1.bin
{
  "time_s": 10.225100351,
  "speed_gib_s": 0.1955971023604089
}

(Changing nonces and threads here has no impact.)

Profiler should be smart enough to handle this case. It's probably as easy as having it perform one tiny "dummy read" before the timer starts, to make sure the disk is awake.

Fails on 3090 after 11G initialized - CL_NV_INVALID_MEM_ACCESS

Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:37:09     DEBUG   initialization: file #10 current position: 59768832, remaining: 7340032
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(

################################ OPENCL ERROR ###############################

Error executing function: clCreateContext

Status error code: CL_NV_INVALID_MEM_ACCESS (-9999)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateContext.html#errors

#############################################################################
)))', ffi/src/initialization.rs:145:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort

all was working ok and then this crashed.

could be partially caused by spacemeshos/post#138

Consider sharing randomx cache and dataset between prover and verifier

Currently, there are separate instances of RandomX cache and dataset on proving and verifying path. In some cases, it could make sense to share them. For example, when a verifier is configured in FAST mode (2GB), then it could share its dataset with the prover.

Doing it in the other direction (to share a dataset with the verifier) makes less sense as a prover is created once every epoch and it exists only for a short time.

Add k2pow calculation to the profiler

I think we should add a basic unit (64nonces 256GiB) benchmark to the profiler tool.

Now it's missing and could result in someone initializing too much.

Fails on 4090

[email protected]:~/post-rs$ cargo run --release -p initializer -- -l 12032000 --node-id "hBGTHs44tav7YR87sRVafuzZwObCZnK1Z/exYpxwqSQ=" --commitment-atx-id "ZuxocVjIYWfv7A/K1Lmm8+mNsHzAZaWVpbl5+KINx+I=" -m 1000000000
   Compiling initializer v0.1.8 (/root/post-rs/initializer)
    Finished release [optimized] target(s) in 1.88s
     Running `target/release/initializer -l 12032000 --node-id hBGTHs44tav7YR87sRVafuzZwObCZnK1Z/exYpxwqSQ= --commitment-atx-id ZuxocVjIYWfv7A/K1Lmm8+mNsHzAZaWVpbl5+KINx+I= -m 1000000000`
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 4090
device memory: 24217 MB, max_mem_alloc_size: 6054 MB, max_compute_units: 128, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12096, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 387072 bytes
Allocating buffer for lookup: 6341787648 bytes
Error: initializing: Fail in OpenCL:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("scrypt")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################


Location:
    initializer/src/main.rs:199:22
    ```

help for verify error data

ERROR 636ff.post Proof is invalid: MSB value for index: 117717007989 doesn't satisfy difficulty: 215 > 0 (label: [145, 107, 182, 205, 188, 156, 238, 115, 148, 20, 39, 181, 44, 4, 119, 80]) {"node_id": "", "module": "post", "module": "post::post_impl", "file": "ffi\src\post_impl.rs", "line": 242}

initial POST proof is invalid. Probably the initialized POST data is corrupted. Please verify the data with postcli and regenerate the corrupted files

I find this code:
(https://github.com/spacemeshos/post-rs/blob/main/ffi/src/post_impl.rs#L239)

and I want this code tell me which postdata file is error

Memory access violation (3221225477) crashes the Node while calculating PoW

We have a report in Discord that go-spacemesh unexpectedly quits.

Go-spacemesh logs look fine and do not contain any errors.
However, app-logs indicates that the node process exits with the code 3221225477.

Here is a part of log files:

Go-spacemesh:

...
2023-12-27T22:28:26.699Z	INFO	800e3.post	initialization: file already initialized	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "fileIndex": 671, "currentNumLabels": 134217728, "targetNumLabels": 134217728, "startPosition": 90060095488}
2023-12-27T22:28:26.699Z	INFO	800e3.post	initialization: completed, found nonce	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "nonce": 81997299014}
2023-12-27T22:28:26.699Z	INFO	800e3.post	post setup completed	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "post", "node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "commitment_atx": "9399f34ca0", "data_dir": "E:\\HDDNODE4", "num_units": "21", "labels_per_unit": "4294967296", "provider": "{824636000664}", "name": "post"}
2023-12-27T22:28:26.705Z	INFO	800e3.atxBuilder	loaded the initial post from disk	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder"}
2023-12-27T22:28:26.705Z	INFO	800e3.atxBuilder	verifying the initial post	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder", "post": {"nonce": 179, "indices": "9f32e0326039241d242889fa5e051c0a1fd2601fa8321fdea40f70842a139ca6003bcf8617af6ef504e3df3c9863586785070f2ed956f7a12531444074d6d6d888c894391bd159cceb2308b58ace844bff34a1c416cb849666a915dd62e301a371f021dd020f3a8efceae9dc329e3f4898e666e82609e110f5c5c2dca24eac98c234d313c6a698fa23a095dab51623fb3c247a6813d340abed5e9761bfbd88b54338f695cd2ee80ae9681701"}, "metadata": {"Challenge": "0000000000000000000000000000000000000000000000000000000000000000", "LabelsPerUnit": 4294967296}, "name": "atxBuilder"}
2023-12-27T22:28:26.815Z	INFO	800e3.atxBuilder	atx challenge is ready	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "atxBuilder", "sessionId": "0d2a3100-fd10-4ad5-a1f4-ca91d8129d18", "current_epoch": "11", "publish_epoch": "11", "target_epoch": "12", "name": "atxBuilder"}
2023-12-27T22:28:26.815Z	INFO	800e3.nipostBuilder	building nipost	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostBuilder", "sessionId": "0d2a3100-fd10-4ad5-a1f4-ca91d8129d18", "poet round start": "2023-12-11T08:00:00.000Z", "poet round end": "2023-12-24T20:00:00.000Z", "publish epoch": "11", "publish epoch end": "2023-12-29T08:00:00.000Z", "target epoch": "12", "name": "nipostBuilder"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostBuilder	starting post execution	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostBuilder", "challenge": "VgDHwLYegT5coz4ptGFD9ORuJeEVBEoVIwiC0CsXcjo=", "name": "nipostBuilder"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator	scaling post verifier	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "current": 9, "new": 1}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-8	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-1	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-5	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-7	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-4	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-2	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-3	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.952Z	INFO	800e3.nipostValidator.worker-6	stopped	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator"}
2023-12-27T22:28:26.960Z	INFO	800e3.nipostValidator	generating proof with PoW flags: RandomXFlag(FLAG_HARD_AES | FLAG_FULL_MEM | FLAG_JIT | FLAG_ARGON2_SSSE3 | FLAG_ARGON2_AVX2) and params: ProvingParams { difficulty: 5317578556, pow_difficulty: [0, 0, 170, 111, 105, 238, 214, 162, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195, 12, 48, 195] }	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "module": "post::prove", "file": "src\\prove.rs", "line": 276}
2023-12-27T22:28:27.331Z	INFO	grpc	finished streaming call with code OK	{"module": "grpc", "grpc.start_time": "2023-12-27T22:19:58Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.SmesherService", "grpc.method": "PostSetupStatusStream", "peer.address": "127.0.0.1:49946", "grpc.code": "OK", "grpc.time_ms": 509008.72}
2023-12-27T22:28:27.642Z	INFO	grpc	finished streaming call with code OK	{"module": "grpc", "grpc.start_time": "2023-12-27T22:19:57Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.SmesherService", "grpc.method": "PostSetupStatusStream", "peer.address": "127.0.0.1:49946", "grpc.code": "OK", "grpc.time_ms": 510008.3}
2023-12-27T22:28:40.483Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "05686864f8", "layer_id": 47462, "name": "blockHandler"}
2023-12-27T22:28:40.514Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47461, "block": "0b73fbbbac", "state_hash": "0xb9e5cb60045e4c592357b51cc8611805fcea1658ad18b770fba1102619eb8833", "duration": "59.4948ms", "count": 3, "rewards": 42, "name": "executor"}
2023-12-27T22:28:40.633Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "86b9eb89c9", "layer_id": 47464, "name": "blockHandler"}
2023-12-27T22:28:40.887Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47462, "block": "05686864f8", "state_hash": "0xaf193fd9dba8fba8d5e14848f6e8348b2bdc0d27d4cba5ebb4ab36e2e68e9d6c", "duration": "43.1586ms", "count": 2, "rewards": 47, "name": "executor"}
2023-12-27T22:28:40.935Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47463, "block": "afcde47b00", "state_hash": "0xd2496d0bc4f538c98ff6d9c83de2ff7fde108e940524c8d716880ae719e44fca", "duration": "41.1809ms", "count": 1, "rewards": 48, "name": "executor"}
2023-12-27T22:28:40.975Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47464, "block": "86b9eb89c9", "state_hash": "0x697b54dda4cf332e696e8b00511172628514427c9053252f569467d301e847e4", "duration": "39.1292ms", "count": 0, "rewards": 57, "name": "executor"}
2023-12-27T22:28:44.688Z	INFO	grpc	started streaming call	{"module": "grpc", "grpc.start_time": "2023-12-27T22:28:44Z", "system": "grpc", "span.kind": "server", "grpc.service": "spacemesh.v1.AdminService", "grpc.method": "EventsStream", "peer.address": "127.0.0.1:49946"}
2023-12-27T22:28:45.856Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "5d7e197d88", "layer_id": 47465, "name": "blockHandler"}
2023-12-27T22:28:46.318Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "218cdf8128", "layer_id": 47466, "name": "blockHandler"}
2023-12-27T22:28:46.509Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "0ca98cc9c3", "layer_id": 47467, "name": "blockHandler"}
2023-12-27T22:28:46.520Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47465, "block": "5d7e197d88", "state_hash": "0xaddf8c52ca05242676aaa426f342139b00d6d16a8c5fe83bce63eaee28a774dd", "duration": "51.3029ms", "count": 1, "rewards": 42, "name": "executor"}
2023-12-27T22:28:47.238Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47466, "block": "218cdf8128", "state_hash": "0x1ebe77e764256def431b07f15e5f072f7788b51218653c32e2ee6fb8b7975490", "duration": "45.5812ms", "count": 3, "rewards": 43, "name": "executor"}
2023-12-27T22:28:47.285Z	INFO	800e3.executor	executed block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "executor", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "lid": 47467, "block": "0ca98cc9c3", "state_hash": "0xae8d7935e393aafbce6596cfc5ac88f0eddf54c2174591ed1eb5028b16ddfcab", "duration": "47.7282ms", "count": 1, "rewards": 48, "name": "executor"}
2023-12-27T22:28:47.514Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "bd553bf3d4", "layer_id": 47468, "name": "blockHandler"}
2023-12-27T22:28:47.927Z	WARN	800e3.sync	mesh failed to process layer from sync	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "sync", "sessionId": "155456e0-8e97-4ecb-b0de-9799d75db3fc", "layer_id": 47470, "errmsg": "get block: get block bd553bf3d4: database: not found", "name": "sync"}
2023-12-27T22:28:47.986Z	INFO	800e3.blockHandler	new block	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "blockHandler", "block_id": "36ab2e082e", "layer_id": 47469, "name": "blockHandler"}
2023-12-27T22:29:02.364Z	INFO	800e3.nipostValidator	calculating proof of work for nonces 0..288	{"node_id": "800e38f530ceb835110a88f477a2804b5008a22f02b935fdd8a03bb8240c0b4d", "module": "nipostValidator", "module": "post::prove", "file": "src\\prove.rs", "line": 131}

App-logs:

...
[2023-12-27 22:28:27.634] [info]  SmesherService, in grpc PostDataCreationProgressStream, output: {"state":5,"numLabelsWritten":90194313216} 
[2023-12-27 22:28:27.636] [info]  SmesherService, in PostSetupStatusStream, output: {"postSetupState":5,"numLabelsWritten":90194313216,"opts":{"dataDir":"E:\\HDDNODE4","numUnits":21,"maxFileSize":2147483648,"providerId":4294967295}} 
[2023-12-27 22:28:27.636] [info]  SmesherService, in PostSetupStatusStream, output: "Status complete -> closing stream" 
[2023-12-27 22:28:33.775] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":25,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47460} 
[2023-12-27 22:28:38.098] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47460} 
[2023-12-27 22:28:45.280] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47465,"topLayer":47981,"verifiedLayer":47464} 
[2023-12-27 22:28:49.045] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":26,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:02.323] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":25,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:13.199] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":24,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:14.619] [info]  NodeManager, in sendNodeStatus, output: {"connectedPeers":23,"isSynced":false,"syncedLayer":47469,"topLayer":47981,"verifiedLayer":47467} 
[2023-12-27 22:29:34.373] [error] syncSmesherInfo, in subscribeNodeEvents, error: "Error: 14 UNAVAILABLE: read ECONNRESET" 
[2023-12-27 22:29:34.821] [error] NodeManager, in Node Process close, error: 3221225477 
...

I guess that prove.rs tries to read wrong address somehow and that crashes an app.

You can download full logs from the report in Discord or ask User to help with the investigation (e.g. running Smapp with --pprof-server flag to pass it to go-spacemesh, and then fetch required profiles).

Opencl based initialization

Given the problems with Vulkan and Cuda recently.
We need an alternative implementation. To lower the maintenance cost, let's use OpenCL.

The code has to be in this repo, and then we might need to integrate it with post.

We have parallel effort with fixing gpu-post.

Initialization: GPU->host bandwidth could be halved

Currently, the code transfers 32 bytes of each label from GPU to the host. 32B are needed to search for the POPS VRF nonce because all 32B are compared with difficulty. Later on, only the most significant (big-endian) 16B are preserved in the POS data.

It's possible to optimize the bandwidth and send only the 16B of each label. The missing lower 16B could be generated on the CPU on the rare occasion (1/2^128) when the most significant bytes are not enough to decide.

Given a 16B label and 32B difficulty, there are 3 possible cases to consider:

  • label > difficulty[0:16] -> NOT a nonce
  • label == difficulty[0:16] -> POSSIBLY a nonce - generate lower 16 bytes of label
  • label < difficulty[0:16] -> VALID nonce

The code would need to generate a label on the CPU only if the 16 bytes of the label are equal to the most significant 16B of the difficulty.

Configurable retries in post-service

Add a new CLI parameter to the PoST service --max-retries that defines how many times the service should try to connect to the node before giving up and shutting down

Post-Service support for generating proofs

Implement the GRPC post service as described in spacemeshos/pm#260

Acceptance criteria

  • a binary that, when executed tries to connect to the node at the given address
  • CLI to configure:
    • POST proving
    • POST network parameters (with mainnet defaults)
    • mTLS
    • host address
  • auto re-connecting
  • mTLS
  • support for the generate proof command
    • also verify the proof immediately
  • release the binary as the CI artifact and include in releases
  • reasonable test coverage

Reduce CPU load during GPU initialization

The CPU load during initialization is constantly at 100% for each postcli process. This has been reported from many users.

This issue is due to the specific Nvidia implementation of the OpenCL synchronization. The enqueued buffer read operation is constantly probing the status of the running OpenCL kernel, which puts the CPU under high load.

A workaround would be to put the CPU thread in sleep for a defined duration right after enqueuing the kernel, and only then enqueue a buffer read. The sleep duration can be obtained by averaging the execution time over a number of kernel executions and then subtracting a safety factor from it (e.g. sleep duration is 90% of kernel execution).
Ideally, the kernel execution time should be measured periodically by disabling the sleep for a couple of kernel executions every few seconds. This ensures that if the kernel execution time decreases below the sleep duration (e.g. becuase the user increased the power limit of the GPU), the decrease is properly detected.

Tests on a RTX 3080 Ti show, that a sleep duration of 25ms reduces the CPU load to 10% while maintaining the initialization speed. While a further increase of the sleep duration to 30ms reduced the CPU load to 2.5% it has a negative impact on the initialization speed.

sleep (ms) CPU load (%) init. speed (MiB/s)
0 100 3.30
25 10 3.30
30 2.5 2.75

Expose global_work_size via api

We need to expose global_work_size so we don't leave perf on the table when integrating with post.

And also exposing it will allow to make small but fitting batches so the overall UX will be better.

Tracking read performance issues

I've observed on multiple systems and OS that profiler speed resullts never exceed approx 3 GiB/s regardless of nonces and available hardware.

Edit: I have observed the same limitation on production nodes during the cycle gap.

Running multiple instances of profiler allows for cumulative read speeds that scale to the system CPU and IO limits, but still no more than ~3GiB/s per instance.

Tests run with --data-size=32

Seed k3 randomness with verifier ID and return index of invalid label

To support distributed verification, each verifier must verify a different subset of k3 indices. To achieve that, the randomness seed used to select k3 indices must include the public key of the verifying node.

To create a malfeasance proof, we need to know which index in the proof is invalid. The result from the verification method should include this information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.