strangelove-ventures / horcrux Goto Github PK
View Code? Open in Web Editor NEWA threshold Tendermint signer
License: Apache License 2.0
A threshold Tendermint signer
License: Apache License 2.0
Dev (Osmosis) co-authored a handy article that references a 'rust implementation of multiparty Ed25519 signature scheme' repository that supports trustless DKG:
https://medium.com/blockchain-at-berkeley/alternative-signatures-schemes-14a563d9d562
https://github.com/ZenGo-X/multi-party-eddsa
Signing seems to be a blocker for DAO validators. With Interchain Accounts, Juno DAOs can create validator infrastructure on Akash. But how DAOs will securely supply cryptographic keys to the infrastructure for remote signing is not apparent.
I believe the raft-leader console is mis-representing which nodes have signed. This is a 3 node config. I have validated that each of the private_share_*.json
files are on the correct nodes, and each of those files is called share.json
.
Config on node 3:
home-dir: /home/rhino/.horcrux
chain-id: uni-2
cosigner:
threshold: 2
shares: 3
p2p-listen: tcp://0.0.0.0:2222
raft-listen: 10.254.100.212:2223
peers:
- share-id: 1
p2p-addr: tcp://10.254.100.210:2222
raft-addr: 10.254.100.210:2223
- share-id: 2
p2p-addr: tcp://10.254.100.211:2222
raft-addr: 10.254.100.211:2223
rpc-timeout: 1500ms
chain-nodes:
- priv-val-addr: tcp://10.254.106.191:1234
- priv-val-addr: tcp://10.254.106.192:1234
- priv-val-addr: tcp://10.254.106.193:1234
When that node is the leader, the console shows:
D[2022-02-21|09:13:18.778] I am the raft leader. Managing the sign process for this block module=validator
D[2022-02-21|09:13:18.815] Have threshold peers module=validator
D[2022-02-21|09:13:18.815] Number of eph parts for peer module=validator peer=1 count=1
D[2022-02-21|09:13:18.815] Number of eph parts for peer module=validator peer=3 count=1
D[2022-02-21|09:13:18.831] Received signature from 1 module=validator
D[2022-02-21|09:13:18.834] Received signature from 3 module=validator
The same thing happens for node 2, showing that 1 & 2 have signed.
The swarm seems to be signing fine, just cosmetic? The docs are a bit confusing on the naming of the share.json
files on migrating.md. So I could have an incorrect configuration.
This function does not appear to be working properly.
root@ip-10-1-1-10:~# ps -ax |pgrep horcrux
root@ip-10-1-1-10:~# horcrux state set 273596
Error: cannot modify state while horcrux is running
In order to wire in the prometheus metrics #93, we need a debug server so that we can have a single listen address for multiple concerns. The prometheus handler can be included in the mux at the /metrics
endpoint.
Here is the implementation in the relayer for reference:
https://github.com/cosmos/relayer/blob/main/internal/relaydebug/debugserver.go
Suggestion by @mark-rushakoff :
It will probably be a copy and paste today, but if we do this on a third utility, then maybe it will be worth extracting a small module to capture all of it
It can be desirable to have the priv key share at a different path than share.json
, e.g. a different mounted filesystem.
Just noticed that go.mod and go test all specifies that go version 1.17 is/should be used.
However, the latest release is built with 1.16?
Is this something that should be fixed? Should't the test and version run on the same version?
Requested Feature: The ability to change the active leader.
It would be ideal to be issue a command to either rotate to a new leader, or temporarily set the leader. This would be used during maintenance operations to maintain uptime.
With horcrux
setup correctly, the validator's proposal block will always get the following info and the block will never be committed, resulting selecting a new proposer and slow down the speed of chain.
INF commit is for a block we do not know about; set ProposalBlock=nil commit=C52CB815E75718494DA41B0D49006B7E90597BE39B8A3FF17215596E148F0FC2 commit_round=1 height=2900785 module=consensus proposal=
The following is the log from the signer. We can see that another precommit is voted at round=1
.
Nov 15 05:06:16 desmos-m-signer-3-ln-sg horcrux[153734]: I[2021-11-15|05:06:16.569] Signed proposal module=validator node={validator_address} height=2900785 round=0 type=SIGNED_MSG_TYPE_PROPOSAL
Nov 15 05:06:17 desmos-m-signer-3-ln-sg horcrux[153734]: I[2021-11-15|05:06:17.850] Signed vote module=validator node={validator_address} height=2900785 round=0 type=SIGNED_MSG_TYPE_PREVOTE
Nov 15 05:06:19 desmos-m-signer-3-ln-sg horcrux[153734]: I[2021-11-15|05:06:19.131] Signed vote module=validator node={validator_address} height=2900785 round=0 type=SIGNED_MSG_TYPE_PRECOMMIT
Nov 15 05:06:20 desmos-m-signer-3-ln-sg horcrux[153734]: I[2021-11-15|05:06:20.433] Signed vote module=validator node={validator_address} height=2900785 round=1 type=SIGNED_MSG_TYPE_PRECOMMIT
Some users will primarily use horcrux through docker. For those users we should publish an image. Using github actions to push to the github image repository is prob the best way to go here.
Add commands to the CLI like so:
horcrux
config
init
nodes
add
remove
peers
add
remove
state
pv
show
reset (--height)
share
show
reset (--height)
Is there any way to know which sentry is being utilized for signing, so we can use the other sentries for other purpose.
A raft leader metric that show signer_sentry <signer_url_from_chain-nodes>
or signer_sentry <chain-nodes_number>
then that would be great, others (non raft leader signer) should show signer_sentry NA
or something
Any thoughts?
This code in here
horcrux/cmd/horcrux/cmd/state.go
Lines 33 to 36 in 0ba5fda
assumes that the only error ever will be os.IsNotExist but there is a huge variety of errors that can be returned especially in tools like this that can be deployed on containers
Check for other errors if not nil and handle them appropriately like this
switch _, err := os.Stat(config.HomeDir); {
case err == nil:
// Do nothing here.
case os.IsNotExist(err):
return fmt.Errorf("%s does not exist, initialize config with horcrux config init and try again", config.HomeDir)
default:
return fmt.Errorf("%w unhandled error", err)
}
Abstract the local cosigner sign methods into a common interface so that various HSMs can be implemented using the interface
The signer start command is currently entangled and way too large. It includes two separate pieces of functionality, the single signer and the cosigner: https://github.com/strangelove-ventures/horcrux/blob/main/cmd/horcrux/cmd/cosigner.go#L33
This function should be split up into a number of smaller functions and most of the logic should be moved into the signer/
directory.
Update the readme to explain how horcrux works and give brief intro to the CLI tooling
Currently logging in the signer is quiet and not as informative as it could be. We should utilize the tendermint
logging library and use their different log.Info
, log.Warn
, log.Error
, and log.Debug
levels. Debug should be very verbose and allow for diagnostic looks into the logs, Info should contain much of what is currently being logged and error and warn should be used for errors (things that can be safely ignored, Warn, anything requiring more attention, Error)
We need to simulate taking down individual nodes in the signer cluster and ensure that the HA is working as intended. Maybe test takes down each signer for a period of time and ensures continued operation
I noticed that horcrux is not supported by Zenith.
https://meka.tech/zenith#signing
@danbryan will be chatting with them on October 13th and will ask for details about why it's not supported, and if their is anything strangelove can do to help.
Currently the config flag points to the config file and the positions of the other files in the ~/.horcrux
directory are implied from that. Ideally the flag says --home
and the position of the config is implied.
The following are individual tasks that need to be done to get this in proper order for other users to user it:
signer/
directory #12--config
flag to --home
flag #14Q. In your los-alamos, when configuring hrocrux, you've specified multiple sentries "tcp://192.168.5.4:31234,tcp://192.168.5.3:31234,tcp://192.168.5.5:31234"
one to many (one signer with multiple sentry) but in migration docs "tcp://10.168.0.1:1234"
one to one (one signer with one sentry). May I know how would it affect the singing or fault tolerance?
$ horcrux config init cosmoshub-4 "tcp://192.168.5.4:31234,tcp://192.168.5.3:31234,tcp://192.168.5.5:31234" -c -p "tcp://signer-1:2222|1,tcp://signer-3:2222|3" -l "tcp://signer-2:2222" -k "/root/share/share.json" -t 2 --timeout 1500ms
$ horcrux config init {my_chain_id} "tcp://10.168.0.1:1234" -c -p "tcp://10.168.1.2:2222|2,tcp://10.168.1.3:2222|3" -l "tcp://10.168.1.1:2222" -t 2 --timeout 1500ms
https://prometheus.io/docs/guides/go-application/
The following are ideas for metrics to be tracked:
simd q slashing signing-info
)is there a secure procedure/instructions for migrating an already working cluster from 2.0.0-rc3 to 2.0.0 ? (without downtime should be also great)
tnks
We have a horcrux setup with 3 signers up and running but proposal block can be committed #30 .
As the Raft feature is implemented, I added "raft-listen" and "raft-add" into the config.yaml but horcrux service is unable to sign the vote as below error
Feb 07 01:52:08 infra-m-signer-3-ln-sg horcrux[885633]: D[2022-02-07|01:52:08.569] I am not the raft leader. Proxying request to the leader module=validator Feb 07 01:52:08 infra-m-signer-3-ln-sg horcrux[885633]: E[2022-02-07|01:52:08.569] Failed to sign vote module=validator address=tcp://192.46.230.248:1234 error="unable to find leader cosigner from address 66.228.50.77:10023" vote_type=SIGNED_MSG_TYPE_PRECOMMIT height=4114554 round=0 validator=160D942A71109538206E0A897DEBDBF8DDB06BD4
Have question about https://github.com/strangelove-ventures/lens --- Could you please reach out to me at
[email protected]
Automated releases and versioning for horcrux
I think this project is fantastic and am a huge fan of everything happening here in the way of improvements and useability ๐ .
Another improvement I think would be awesome is to make the timeout for the cosigner rpc server configurable (currently hardcoded to 1 second). Depending on the latency between the cosigners a value of 1200-1500ms drastically reduces the number of missed signatures for cosigners located in different regions/DCs.
horcrux/signer/cosigner_rpc_server.go
Line 127 in 547bf1b
We are going to implement a raft cluster for leader election. Only the leader will be able to submit signatures to sentries and all nodes will still maintain a high watermark HRS file outside consensus for protection against node take over (i.e. in the case where an attacker gains access to a single node in the cluster the attacker would only be able to cause a service disruption not a double sign event).
We plan to use https://github.com/strangelove-ventures/raft-sample as a basis for the implementation i.e. using hashicorp/raft and a database backend (badgerdb?). Data to be stored in consensus:
Add test case and docs for validating multiple networks from one signer cluster
For chains which have ethereum brodge relays, like peggo, where do we set that up? things like peggo need to be set up on a validator, but in the horcrux setup the sentries are just fullnodes. so where should we run these relayers like peggo?
See comment:
horcrux/.github/workflows/release.yml
Line 23 in 8cc8fdd
Also see gorelease.yaml
for additional pointers here.
The above is wrt getting CI to produce the containers automatically. This version will need to be pulled from the go mod file when running the tests. Either the runtime
package or the mod
package from the stdlib should get us there.
I've tried what was suggested in the metrics document to add debug-listen-address: 0.0.0.0:6001
to config and this does not seem to enable metrics, can you please help resolve this.
Thanks!
This command should share that the user can pass a bech32 prefix to get their valcons address.
$ horcrux cosigner address --help
Get public key hex address and valcons address
Usage:
horcrux cosigner address [flags]
Flags:
-h, --help help for address
Global Flags:
--home string Directory for config and data (default is $HOME/.horcrux)
it should look something like this.
$ horcrux cosigner address --help
Get public key hex address and valcons address
Usage:
horcrux cosigner address [optional_bech32_prefix] [flags]
Flags:
-h, --help help for address
Global Flags:
--home string Directory for config and data (default is $HOME/.horcrux)
get some linter and add it. Look at sdk, tendermint, ibc-go as examples
After the Raft leader crashed, no new leader was elected by the cluster.
The issue happened after the disk of the signer get full.
Sep 13 13:09:42 signer-1 horcrux[51881]: panic: write /srv/injective/.horcrux/state/write-file-atomic-238428084346281919: no space left on device
Sep 13 13:09:42 signer-1 horcrux[51881]: goroutine 14099922 [running]:
Sep 13 13:09:42 signer-1 horcrux[51881]: github.com/strangelove-ventures/horcrux/signer.(*SignState).save(0xc000034b80)
Sep 13 13:09:42 signer-1 horcrux[51881]: /home/runner/work/horcrux/horcrux/signer/sign_state.go:158 +0x119
Sep 13 13:09:42 signer-1 horcrux[51881]: github.com/strangelove-ventures/horcrux/signer.(*SignState).Save.func1(0xc000034b80)
Sep 13 13:09:42 signer-1 horcrux[51881]: /home/runner/work/horcrux/horcrux/signer/sign_state.go:137 +0x2b
Sep 13 13:09:42 signer-1 horcrux[51881]: created by github.com/strangelove-ventures/horcrux/signer.(*SignState).Save
Sep 13 13:09:42 signer-1 horcrux[51881]: /home/runner/work/horcrux/horcrux/signer/sign_state.go:136 +0x30c
removing the PID file and free some space allow the service to restart.
Test the following topologies:
Test the following failure modes for the above topologies:
I don't have proper logs of this, but can attempt to do a better job if it happens again. IHere's a screenshot:
After I purposefully kill a node (non-leader), and bring that node back in, occasionally it will be a flood of these creating/reaping snapshots across all nodes. I don't believe it's signing during this flood, at least from quick perusal of the logs. If I remove a node (any of the 3 in this case), it stops. When I bring a node back, this same snapshot flood continues until a node is killed.
The only way I've been able to make this stop is to force quit all nodes, and bring them up simultaneously.
If it's helpful: debian 11, 4 cpu, 8gb ram each. These are all in the same rack of hardware, so 0 latency between nodes. If I can help troubleshoot this please let me know.
Hi.
I'm doing a from-scratch install, and trying to following the instructions.
wget https://github.com/strangelove-ventures/horcrux/releases/download/v2.0.0-rc1/horcrux_2.0.0-rc1_linux_amd64.tar.gz
tar xfz horcrux_2.0.0-rc1_linux_amd64.tar.gz
mkdir bin
mv horcrux bin
mkdir /home/horcrux/.horcrux
horcrux config init ${CHAIN_ID} "tcp://${v_terra_hel_2}:1234" -c -p "tcp://${v_terra_hel_1}:2222|1,tcp://${v_terra_fsn_1}:2222|3" -l "tcp://${v_terra_hel_2}:2222" -t 2 --timeout 1500ms
produces the following
panic: open /home/horcrux/.horcrux/state/tcp:/write-file-atomic-07061320272504022285: no such file or directory
goroutine 1 [running]:
github.com/strangelove-ventures/horcrux/signer.(*SignState).save(0xc0000c6600)
/home/runner/work/horcrux/horcrux/signer/sign_state.go:158 +0x119
github.com/strangelove-ventures/horcrux/signer.LoadOrCreateSignState(0xc0000fe5a0, 0x4f, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/runner/work/horcrux/horcrux/signer/sign_state.go:282 +0x1a7
github.com/strangelove-ventures/horcrux/cmd/horcrux/cmd.initCmd.func1(0xc0000fa280, 0xc0000f8c80, 0x1, 0xa, 0x0, 0x0)
/home/runner/work/horcrux/horcrux/cmd/horcrux/cmd/config.go:144 +0x53d
github.com/spf13/cobra.(*Command).execute(0xc0000fa280, 0xc0000f8b40, 0xa, 0xa, 0xc0000fa280, 0xc0000f8b40)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0x16e6160, 0x4012f0, 0x0, 0x0)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
github.com/strangelove-ventures/horcrux/cmd/horcrux/cmd.Execute()
/home/runner/work/horcrux/horcrux/cmd/horcrux/cmd/root.go:28 +0x2d
main.main()
/home/runner/work/horcrux/horcrux/cmd/horcrux/main.go:21 +0x25
I'm unsure if it matters, but this is with terrad / rc-1
For meketek signing on horcrux, we need to add support for the additional handlers.
Reported by pom dapie in the horcrux telegram channel:
horcrux elect
multi:///2001:bc8:***:**::1:3333,2a02:c206:****:****::1:3333,2a01:***:**:****:2::115:3333
Error: rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp: address 2a01:***:**:****:2::115:3333: too many colons in address"
There is some dead code in the repo right now. Go through and remove unused files/code.
$ horcrux version
{
"version": "2.0.0-rc3",
"commit": "0f7cbf5eef6ab43cf22bc6b1ed5cde210605205a",
"go_version": "go1.16.15 linux/amd64",
"cosmos_sdk_version": "v0.44.5",
"tendermint_version": "v0.34.14"
}
2nd & 3rd node are running
$ horcrux version
{
"version": "2.0.0",
"commit": "0ba5fda1d49ee18b2a452e94f8a9f29180776fd5",
"go_version": "go1.16.15 linux/amd64",
"cosmos_sdk_version": "v0.44.5",
"tendermint_version": "v0.34.14"
}
1st node shows
LimitNOFILE=4096
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: panic: open /home/horcrux_phoenix-1-xxx/.horcrux/state/write-file-atomic-08132135906461776673: too many open files
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: goroutine 20740 [running]:
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: github.com/strangelove-ventures/horcrux/signer.(*SignState).save(0xc000e90598)
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: /home/runner/work/horcrux/horcrux/signer/sign_state.go:158 +0x119
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: github.com/strangelove-ventures/horcrux/signer.(*SignState).Save.func1(0xc000e90598)
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: /home/runner/work/horcrux/horcrux/signer/sign_state.go:137 +0x2b
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: created by github.com/strangelove-ventures/horcrux/signer.(*SignState).Save
Sep 14 18:05:24 n-hel-7 horcrux[1224842]: /home/runner/work/horcrux/horcrux/signer/sign_state.go:136 +0x30c
Sep 14 18:05:24 n-hel-7 systemd[1]: horcrux_phoenix-1-flipside.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 14 18:05:24 n-hel-7 systemd[1]: horcrux_phoenix-1-flipside.service: Failed with result 'exit-code'.
Sep 14 18:05:24 n-hel-7 systemd[1]: horcrux_phoenix-1-flipside.service: Consumed 2.116s CPU time.
2nd node just repeatedly shows (and shows 4/800%cpu on HTOP)
LimitNOFILE=infinity (was 4096)
Sep 14 18:10:52 n-ovh-lim-02 horcrux[3070]: D[2022-09-14|18:10:52.403] I am not the raft leader. Proxying request to the leader module=validator
Sep 14 18:10:52 n-ovh-lim-02 horcrux[3070]: D[2022-09-14|18:10:52.403] I am not the raft leader. Proxying request to the leader module=validator
Sep 14 18:10:52 n-ovh-lim-02 horcrux[3070]: D[2022-09-14|18:10:52.404] I am not the raft leader. Proxying request to the leader module=validator
Sep 14 18:10:52 n-ovh-lim-02 horcrux[3070]: D[2022-09-14|18:10:52.404] I am not the raft leader. Proxying request to the leader module=validator
Sep 14 18:10:52 n-ovh-lim-02 horcrux[3070]: D[2022-09-14|18:10:52.405] I am not the raft leader. Proxying request to the leader module=validator
3rd node behaving properly (by the looks of it)
LimitNOFILE=4096 in .service file.
Sep 14 18:13:30 n-ovh-war-03 horcrux[577736]: I[2022-09-14|18:13:30.702] Signed vote module=validator node=tcp://127.0.0.1:17097 height=1572925 round=0 type=SIGNED_MSG_TYPE_PRECOMMIT
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.338] I am the raft leader. Managing the sign process for this block module=validator
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.416] Have threshold peers module=validator
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.416] Number of eph parts for peer module=validator peer=2 count=1
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.416] Number of eph parts for peer module=validator peer=1 count=1
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.426] Received signature from 1 module=validator
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.485] Received signature from 2 module=validator
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: D[2022-09-14|18:13:36.485] Done waiting for cosigners, assembling signatures module=validator
Sep 14 18:13:36 n-ovh-war-03 horcrux[577736]: I[2022-09-14|18:13:36.508] Signed vote module=validator node=tcp://127.0.0.1:17097 height=1572926 round=0 type=SIGNED_MSG_TYPE_PREVOTE
This doc should contain:
~/.horcrux
directory, in addition to warnings about the state filesThere are many types defined in the cmd/horcrux/cmd directory which can't be integrated into other programs.
For example:
https://github.com/strangelove-ventures/horcrux/blob/main/cmd/horcrux/cmd/config.go#L543
type CosignerConfig struct {
Threshold int `json:"threshold" yaml:"threshold"`
Shares int `json:"shares" yaml:"shares"`
P2PListen string `json:"p2p-listen" yaml:"p2p-listen"`
Peers []CosignerPeer `json:"peers" yaml:"peers"`
Timeout string `json:"rpc-timeout" yaml:"rpc-timeout"`
}
Can't be referenced from a third party program.
For example we had to recreate the same struct in our program in order to utilize the same structures.
https://github.com/chillyvee/precrux/blob/master/snitch/generate.go#L22
Is it acceptable to extract all these types into https://github.com/strangelove-ventures/horcrux/horcrux/types.go
Or something else like that?
If so we can start to work on a PR. Guidance is welcome.
is Oasis Network (ROSE) supported? or could be supported ?
tnks
We have about a year for this but good to get it up in the issues
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.