w3f / polkadot-validator-setup Goto Github PK

View Code? Open in Web Editor NEW

217.0 20.0 129.0 1.06 MB

Polkadot Validator Secure Setup

License: Apache License 2.0

Python 7.36% JavaScript 40.21% HCL 30.81% Shell 4.97% Jinja 16.65%

polkadot polkadot-validator validator blockchain substrate terraform ansible devops vpn wireguard

polkadot-validator-setup's Introduction

NOTE: this repository isn't actively maintained

Polkadot Validator Setup

This repo describes a potential setup for a Polkadot or Kusama validator that aims to prevent some types of potential attacks at the TCP layer and below. The Workflow section describes the Platform Layer and the Application Layer in more detail.

Usage

There are two ways of using this repository:

Platform & Application Layer

Configure credentials for infrastructure providers such as AWS, Azure, GCP, digitalocean, and/or Packet, then execute the Terraform process to automatically deploy the required machines (Platform Layer) and setup the Application Layer.

See the Complete Guide for more.
Application Layer

Setup Debian-based machines yourself, which only need basic SSH access and configure those in an inventory. The Ansible scripts will setup the entire Application Layer.

See the Ansible Guide for more.

Structure

The secure validator setup is composed of one or more validators that run with a local instance of NGINX as a reverse TCP proxy in front of them. The validators are instructed to:

advertise themselves with the public IP of the node and the port where the reverse proxy is listening.
bind to the localhost interface, so that they only allow incoming connections from the proxy.

The setup also configures a firewall in which the default p2p port is closed for incoming connections and only the proxy port is open.

Workflow

The secure validator setup is structured in two layers, an underlying platform and the applications that run on top of it.

Platform Layer

Validators are created using the terraform modules located at terraform directory. We have created code for several providers but it is possible to add new ones, please reach out if you are interested in any provider currently not available.

Besides the actual machines the terraform modules create the minimum required networking infrastructure for adding firewall rules to protect the nodes.

Application Layer

This is done through the ansible playbook and polkadot-validator role located at ansible, basically the role performs these actions:

Software firewall setup, for the validator we only allow the proxy, SSH and, if enabled, node-exporter ports.
Configure journald to tune log storage.
Create polkadot user and group.
Configure NGINX proxy
Setup polkadot service, including binary download.
Polkadot session management, create session keys if they are not present.
Setup node-exporter if the configuration includes it.

Note about upgrades from the sentries setup

The current version of polkadot-secure-validator doesn't allow to create and configure sentry nodes. Although the terraform files and ansible roles of this latest version can be applied on setups created with previous versions, the validators would be configured to work without sentries and to connect to the network using the local reverse proxy instead.

If you created the sentries with a previous version of this tool through terraform following the complete workflow, then they will not be deleted automatically when running this new version. In short, the old sentries will no longer be used by the validators and it will be up to you to remove them manually.

polkadot-validator-setup's People

Contributors

Stargazers

Watchers

Forkers

mxinden iammelea stake-capital ltfschoen microhexhq vindberg edouardlvdl come-maiz rngkll ddorgan 6egic debankdefi riodefi lamafab rozifus woss centrifuge 0xthreebody lwshang hicommonwealth krzysztof-jelski zadmarbella everhusk wpank stlukey tripleight nuobit luguslabs 0x4a5e1e4baab geeklabx dmt5dh ceight-io stjordanis blockdaemon s3krit vkhatri saingsab vedhavyas divanerd2020 joepetrowski the-love-guru kmazz tylerztl blackshakes synaptic0n akhanaton romeroyang strawberryflavor mitchelltesla sekmet spannerprotocol plsrelax pactum mswezey23 sotatek-dev typhoonstake alanmclean1991 cryptoleya nuggetdigital btcgoose rezamtbcp envoylabs stakerspace macrobiotis 0xdhu doubleotheven openweb3-foundation shandu-io with-veracity slonigiraf mmagician chainflip-io ramidishiva paradox-tt aspes husonghua hyperspheredigital kloud-9 bhonetwork tomaka amallyn floxter grenade davekaj jamesmirenda thecoinmaniac hvelayos integritee-network rajapaladugula tankofzion zhenjie stojanov-igor f3joule uniswapx 0xblockchainx0 gopherj inbestapi polkadex-substrate louiseadven diamondnetwork

polkadot-validator-setup's Issues

"yarn sync" command erroring

Hi all,

I'm following the README instructions, but when I run "yarn sync -c config/main.json," I get the following error:

$ yarn sync -c config/main.json
yarn run v1.19.1
$ node . sync -c config/main.json
Syncing platform...
Could not sync platform: EEXIST: file already exists, mkdir '/home/tommy/.config/polkadot-secure-validator/build/w3f/terraform/remote-state0'
error Command failed with exit code 255.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Removing '/home/tommy/.config/polkadot-secure-validator' does not fix this problem.

Any advice would be appreciated!

Ansible looks in the wrong place for python

Setup

I'm using this tool with GCP & AWS sentries and a Packet validator.

Issue

When the executing the Ansible playbook, the GCP & AWS instances throw "/bin/sh: 1: /usr/bin/python: not found" because Python's binary is located at /usr/bin/python3.

Temporary Solution

I manually SSHed into the servers and moved the binary location.

Permanent Solution

Ansible needs to confirm location of python before executing /usr/bin/python. I will try to make a pull request with this fix over the next few days.

digitalocean deployment fails with Could not retrieve the list of available versions for provider hashicorp/digitalocean

I guess the provider should actually be
https://registry.terraform.io/providers/digitalocean/digitalocean/latest
but how to change that?

Initializing provider plugins...

- Finding hashicorp/digitalocean versions matching "~> 1.16"...

- Finding hashicorp/google versions matching "~> 2.15"...

- Installing hashicorp/google v2.20.3...

- Installed hashicorp/google v2.20.3 (signed by HashiCorp)

╷
│ Warning: Version constraints inside provider configuration blocks are deprecated
│
│   on backend.tf line 3, in provider "google":
│    3:   version     = "~>2.15"
│
│ Terraform 0.13 and earlier allowed provider version constraints inside the
│ provider configuration block, but that is now deprecated and will be
│ removed in a future version of Terraform. To silence this warning, move the
│ provider version constraint into the required_providers block.
│
│ (and one more similar warning elsewhere)
╵


╷
│ Error: Failed to query available provider packages
│
│ Could not retrieve the list of available versions for provider
│ hashicorp/digitalocean: provider registry registry.terraform.io does not
│ have a provider named registry.terraform.io/hashicorp/digitalocean
│
│ Did you intend to use digitalocean/digitalocean? If so, you must specify
│ that source address in each module which requires that provider. To see
│ which modules are currently depending on hashicorp/digitalocean, run the
│ following command:
│     terraform providers
╵


Command execution failed with code: 1
(node:2894) UnhandledPromiseRejectionWarning: Error: 1
    at ChildProcess.<anonymous> (/mnt/d/integritee/polkadot-validator-setup/src/lib/cmd.js:45:18)
    at ChildProcess.emit (events.js:376:20)
    at maybeClose (internal/child_process.js:1055:16)
    at Socket.<anonymous> (internal/child_process.js:441:11)
    at Socket.emit (events.js:376:20)
    at Pipe.<anonymous> (net.js:673:12)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2894) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:2894) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Done in 9.95s.

Adding s3 backend for the Terraform

Hello, according to this issue there's a default Terraform backend configured to the gcp. So I tried to change it to the AWS S3 bucket since I am going to deploy to the AWS - after some struggling, I managed to run ./scripts/deploy.sh without an error but it gets stuck on the command line input where it's asking for a key again (although I have it in the config). However, the terraform.tfstate is created inside the bucket so connection to the AWS is working. I am not really skilled with Terraform so I would appreciate any help here :-)
This is the the terraform/remote-state/main.tf

terraform {
  backend "s3" {
    bucket = "polkadotters"
    key    = "terraform/state/terraform.tfstate"
    region = "eu-central-1"
  }
}

This is the output what I see - and I can press enter how many times I want and nothing happens. Thanks a lot for looking at this.

'=' character is turned into "&#x3d"

This happens at least on both on public ssh keys and extra flags.

Nginx proxy not bring created?

Running the ansible setup script on ubuntu 18. Runs well, except it doesn't look like the nginx config proxing the p2p port (30333) is being set up, Instead, polkadot is listening on 127.0.0.1. --listen-addr=/ip4/127.0.0.1/tcp/30333. I like the idea of proxying through nginx. Is the documentation not correct or am I missing something in our inventory?

Add prometheus r-proxy

Hi, thanks for this fantastic code! It really makes our daily tasks more confortable.

I would add a new r-proxy with almost the same tasks for prometheus 9615 daemon port.

New task in firewall.yml
New var for prometheus public port
New nginx server block on proxy.conf.j2 template

I am not sure if this should also use the auth basic method or if we could allow a "From IP" at firewall level.

secure_network_chart.svg - outdated?

Is this file outdated? If so can we get it updated? I believe the documentation indicates that we are no longer using sentries.

Feature: Add separate scripts for dev environment setup

References to be included in ansible

Configure journald on W3F validators and sentries

Monitoring

Federico!! Muy fan tuyo @fgimenez !

Great collection of ansible / terraform codes. It is really amazing. I am using part of it, so thanks a million!
Is there any plan to include some monitoring? Prometheus or similar
I cant find it.
Regards

failing at : TASK [polkadot-validator : check if keys already exist]

With ansible it should check if keys are in the keystore of the validator:

fatal: [IP-Address]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'polkadot_network_id' is undefined\n\nThe error appears to be in '/polkadot-secure-validator/ansible/roles/polkadot-validator/tasks/main.yml': line 19, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: check if keys already exist\n ^ here\n"}

Add reserved nodes for public nodes

With more nodes coming online as kusama launch is coming up, the validator node being programmed to only connect to the reserved nodes, the public nodes will serve anybody. Meaning sometimes they close the connection to the validator. I'm on different server providers and with --reserved-nodes on the public nodes towards with reserving a slot for the validator it works much more stable. Without this is sometimes even goes down to 1 peer even though 4 are available.

As more nodes enter the network for a stable validator this is necessar.

compiling wireguard kernel module failed

polkadot-secure-validator installs the headers for the current kernel only
after a kernel update, the wireguard kernel module will be compiled automatically by dkms
this will fail, if appropriate kernel headers are not installed
the best way, i think, is to install the "linux-headers-generic" package, which always depends on the package with the current kernel headers

hint:

this is true for kernel older then 5.6 (which includes the wireguard kernel module)
the lts kernel in ubuntu 20.04 supports wireguard too

Task Polkadot-validator : initialize server keys fails

I am running ansible to secure the validator as instructed, but it fails at per the below message

TASK [polkadot-validator : initialize server keys] *************************************************************************************************************
fatal: [IP_ADDRESS]: FAILED! => {"changed": false, "content": "", "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://localhost:9933"}
to retry, use: --limit @/root/blockchain/ksm_validators/ansible/main.retry

I don't know what this error is and how can it be resolved? Please help.

Thank you.

Additional Telemetry link

Using this tool for deploying and upgrading my kusama nodes, for a specific node I need just one extra input for telemetry-url extra, for matching telemetry-urls I can use the following polkadot_additional_validator_flags

If I have the following:

telemetryUrl=wss://mi.private.telemetry.backend/
telemetryUrl=wss://mi.private.telemetry.backend2/

It just picks up backend2 and leaves out the first.

Running a deployment with a pass-phrased SSH key

I tried to use the Terraform with SSH keys protected by passphrase. However, the build always failed with Could not sync platform: Could not convert private key from PEM; PEM is encrypted. - which seems to come from some low-level JS library. Is it even supported or should I only use the passwordless keys?

Use sentry flag on public nodes

Use --sentry on public nodes.

Support alicloud ?

Provider google present in backend.tf (terraform)

Hello, thank you for the work.

Checking the configuration of the terraform section I saw that the tag "google" is shown in several providers are not GCP.

% grep -irl "google" terraform/*
terraform/**aws**/backend.tf
terraform/**azure**/backend.tf
terraform/gcp/output.tf
terraform/gcp/main.tf
terraform/gcp/provider.tf
terraform/packet/backend.tf
terraform/remote-state/main.tf

deployment problem

when I try to deploy the secure validator project on google cloud, it shows me the following error after creating the instances ... any idea of this error?

Apply complete! Resources: 4 added, 0 changed, 0 destroyed.

Outputs:

ip_address = [
"34.74.221.199",
]

["34.74.221.199"]

["34.73.9.214"]

Done
Syncing application...
node:events:304
throw er; // Unhandled 'error' event
^

Error: spawn ansible-playbook ENOENT
at Process.ChildProcess._handle.onexit (node:internal/child_process:269:19)
at onErrorNT (node:internal/child_process:465:16)
at processTicksAndRejections (node:internal/process/task_queues:80:21)
Emitted 'error' event on ChildProcess instance at:
at Process.ChildProcess._handle.onexit (node:internal/child_process:275:12)
at onErrorNT (node:internal/child_process:465:16)
at processTicksAndRejections (node:internal/process/task_queues:80:21) {
errno: -2,
code: 'ENOENT',
syscall: 'spawn ansible-playbook',
path: 'ansible-playbook',
spawnargs: [
'main.yml',
'-f',
'30',
'-i',
]
}
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Invalid PEM formatted message

I have exported env variables

TF_VAR_do_token
GOOGLE_APPLICATION_CREDENTIALS
SSH_ID_RSA_PUBLIC
SSH_ID_RSA_VALIDATOR

and the ssh keys were generated with empty passphrase using ssh-keygen -m PEM -f <path> and added to ssh-agent
(I wasn't able to use keys protected by passphrase, caused another error...)

yarn run v1.15.2
$ node . sync -c config/main.json
Syncing platform...

Initializing the backend...


Initializing provider plugins...

- Finding hashicorp/google versions matching "~> 2.15"...

- Installing hashicorp/google v2.20.3...

- Installed hashicorp/google v2.20.3 (signed by HashiCorp)


Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.



╷
│ Warning: Version constraints inside provider configuration blocks are deprecated
│
│   on main.tf line 3, in provider "google":
│    3:   version     = "~>2.15"
│
│ Terraform 0.13 and earlier allowed provider version constraints inside the
│ provider configuration block, but that is now deprecated and will be
│ removed in a future version of Terraform. To silence this warning, move the
│ provider version constraint into the required_providers block.
╵

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.


Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_storage_bucket.imagestore will be created
  + resource "google_storage_bucket" "imagestore" {
      + bucket_policy_only = (known after apply)
      + force_destroy      = true
      + id                 = (known after apply)
      + location           = "US"
      + name               = "integritee-sv-tf-state"
      + project            = (known after apply)
      + self_link          = (known after apply)
      + storage_class      = "STANDARD"
      + url                = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

google_storage_bucket.imagestore: Creating...

╷
│ Warning: Version constraints inside provider configuration blocks are deprecated
│
│   on main.tf line 3, in provider "google":
│    3:   version     = "~>2.15"
│
│ Terraform 0.13 and earlier allowed provider version constraints inside the
│ provider configuration block, but that is now deprecated and will be
│ removed in a future version of Terraform. To silence this warning, move the
│ provider version constraint into the required_providers block.
╵

╷
│ Error: googleapi: Error 409: You already own this bucket. Please select another name., conflict
│
│   with google_storage_bucket.imagestore,
│   on main.tf line 6, in resource "google_storage_bucket" "imagestore":
│    6: resource "google_storage_bucket" "imagestore" {
│
╵

Command execution failed with code: 1
Allowed error creating state backend: 1
Could not sync platform: Invalid PEM formatted message.
error Command failed with exit code 255.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Cannot Finish the main.yml without hanging/freezing

I'm a standard user of this tool to upgrade Kusama Nodes. However it's not working for me at the moment when trying to upgrade nodes.

It freezes and hangs at the following, but not limited to the one below, I'm using additional flags:

polkadot_additional_common_flags=''
polkadot_additional_validator_flags=''

Starting it with the following command: ansible-playbook main.yml -i stakerspace.inventory

This is the message it stays stuck at for longer then 10-15 mins

TASK [polkadot-common : cronjob for restarting polkadot service] **************************************************************************************************************************************************
task path: /home/ilhan/dev/polkadot-secure-validator/ansible/roles/polkadot-common/tasks/main.yml:93
skipping: [redacted-IP] => {
    "changed": false, 
    "skip_reason": "Conditional result was False"
}
META: ran handlers
META: ran handlers

Terraform for OVH VPS

Hi all,
I'm wondering if it was possible to consider a terraform for provider OVH?
Let me know if I can be any help.

Rename Repository

The name secure Validator implies that this sets up everything entirely to have a secure set up. This is not necessarily the case - this repo should be used as a template to further make accomodations to ones individual setup.

I would propose renaming this to polkadot-validator-template

Add Digital Ocean VM Option

Create a new Droplet should be doable with one authenticated curl call.

Eg curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer b7d03a6947b217efb6f3ec3bd3504582" -d '{"name":"example.com","region":"nyc3","size":"s-1vcpu-1gb","image":"ubuntu-16-04-x64","ssh_keys":[107149],"backups":false,"ipv6":true,"user_data":null,"private_networking":null,"volumes": null,"tags":["web"]}' "https://api.digitalocean.com/v2/droplets"

Or using terraform.

And then simply applying ansible script

Feature: Add terraform for vultr.com

VPC Provider : https://vultr.com/

Vultr Terraform Docs : https://registry.terraform.io/providers/vultr/vultr/latest/docs

Adding a RestartSec value to systemd template.

In the systemd template
https://github.com/w3f/polkadot-secure-validator/blob/master/ansible/roles/polkadot-validator/templates/polkadot.service.j2

Wouldn't it make sense to add a RestartSec value like so:
Restart=always
RestartSec=120

Delaying a restart forces the node to look past potentially conflictive votes right after a crash in which the node might not recognise votes because it wasn't able to store the information to the disk.

AWS SSH Timeout

After running yarn sync -c config/main.json the build boots up everything in AWS. I am able to ssh into the box for a short period, but then I get an ssh timeout error when I try again. The yarn task also fails with a timeout with this as the last log message:

TASK [Wait for nodes to become reachable] **************************************

Here is my config/main.json

{
  "project": "kusama",
  "polkadotBinary": {
    "url": "https://github.com/paritytech/polkadot/releases/download/v0.8.30/polkadot",
    "checksum": "sha256:9dddd2ede827865c6e81684a138b0f282319e07f717c166b92834699f43274cd"
  },
  "nodeExporter": {
    "enabled": true,
    "binary": {
      "url": "https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz",
      "checksum": "sha256:3369b76cd2b0ba678b6d618deab320e565c3d93ccb5c2a0d5db51a53857768ae"
    }
  },
  "polkadotRestart": {
    "enabled": true,
    "minute": "50",
    "hour": "4,12,20"
  },
  "chain": "kusama",
  "polkadotNetworkId": "ksmcc3",
  "state": {
    "project": "kusama"
  },
"validators": {
    "telemetryUrl": "wss://telemetry-backend.w3f.community/submit",
    "additionalFlags": "--unsafe-pruning --pruning 1000 --execution=native",
    "dbSnapshot": {
      "url": "https://ksm-rocksdb.polkashots.io/kusama-7595617.RocksDb.7z",
      "checksum": "sha256:6159d3e3790f00455cd3dcc9c8238e7af07762d1a0d9956d9f407d5d22db0784"
    },
    "loggingFilter": "sync=trace,afg=trace,babe=debug",
    "nodes": [
      {
        "provider": "aws",
        "machineType": "t2.large",
        "count": 1,
        "location": "us-east-1",
        "zone": "us-east-1a",
        "projectId": "kusama-315119",
        "nodeName": "node-man",
        "sshUser": "admin",
        "image": "ami-09e67e426f25ce0d7"
      }
    ]
}

yarn clean will update the tfstate in Gcloud, but it does not tear down the infra in AWS.

Has anyone come across this? Is there something wrong with my config?

Could not sync platform: Invalid PEM formatted message.

Getting this era after generating a key using ssh-keygen -m PEM -f <path>. Any ideas why it keeps saying its invalid? Before it complained it wasn't prefaced with the right header and once I moved on from that this appeared. Unsure how to get this repo working.

EDIT: seems to be caused from this:

google_storage_bucket.imagestore: Creating...



Error: googleapi: Error 403: [email protected] does not have storage.buckets.create access to project <project-number>, forbidden


  on main.tf line 6, in resource "google_storage_bucket" "imagestore":

   6: resource "google_storage_bucket" "imagestore" {

Nominating: Has to be a way to copy and paste multiple validators vs 1 at a time

This issue arose from different testnet than KSM /Polkadot, but principle is the same. Is there a way to allow for copy and paste of multiple addresses for the validators on the right hand side? Currently looks like only can input 1 at a time, so when you need to change up (in this case testnet) and often, it's a manual deal. For UI, ease of use and user experience perhaps could allow for copy and paste (and show entire address)?

Ability to control color would be beneficial for those with bad eyesite vs the light gray which does not stand out for visually impaired. An option to do so (drop down perhaps) or hover over that points out the option may help some people.

Also in the nominating process, the little box you check can barely see it due to visual impairment. A box that is colored (or again with option to change up color or boldness) may be helpful for the visually impaired.

Adding a new sentry wg-quick error

I have been adding new sentries to my Validator, this happend twice so far with provisioning new different nodes. Setting everything else goes right, but when it's time to start the wg0 service on the new node it gives the following error in the logs of ansible

FAILED! => {"changed": false, "msg": "Unable to start service wg-quick@wg0: Job for [email protected] failed because the control process exited with error code.\nSee \"systemctl status [email protected]\" and \"journalctl -xe\" for details.\n"}

The output of systemctl status [email protected]

Jun 17 12:58:21  systemd[1]: Starting WireGuard via wg-quick(8) for wg0...
Jun 17 12:58:21  wg-quick[9839]: [#] ip link add wg0 type wireguard
Jun 17 12:58:21  wg-quick[9839]: RTNETLINK answers: Operation not supported
Jun 17 12:58:21  wg-quick[9839]: Unable to access interface: Protocol not supported
Jun 17 12:58:21  wg-quick[9839]: [#] ip link delete dev wg0
Jun 17 12:58:21  wg-quick[9839]: Cannot find device "wg0"
Jun 17 12:58:21  systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 17 12:58:21  systemd[1]: [email protected]: Failed with result 'exit-code'.

What worked for me was installing wireguard-dkms and from there on I can manually boot it wg-quick up wg0 whereas before installing the dkms version it wouldn't

Playbook for upgrading nodes

To initially set-up the a secure-validator this tool is perfect, also for updating. But the in the ansible-playbook main.yml this goes through all the checks, while everything works as it should. While in a upgrade you just want to upgrade the binary and restart the service

Can we get a new playbook that just upgrades the nodes, replaces the old binary with the new one as written in the sample file?

--public-addr parameter not being added to systemd service file

Even though I set the enable_reverse_proxy = true in the inventory.yml file. The template polkadot.service.j2 doesn't automatically add the --public-addr parameter in the polkadot.service file. And then if you enable the nginx reverse proxy you will need to add this parameter manually.

Vultr as a provider?

Is Vultr supported??

Failing to retrieve node-id

I had already a secure-validator-setup, was testing with the latest set-up how it would respond to updates. It get's stuck on polkadot-public task save log with peer id when running it with ansible-playbook main.yml -i /PATH/inventory.sample

It is either is re-trying and fails or gets stuck and doesn't give any log.

As a temporary fix you can free up the journalctl with journalctl --vacuum-time=2d and then restart your public-nodes so the node-id line appears when starting your node. The script doesn't get stuck then at this step.

Prometheus autoprovisioning

Prometheus has a folder where you can place a json file per machine you'd like to scrape.
It is located by default in /etc/prometheus/provisioning

An example of this kind of files and static configurations looks like this:

cat provisioning/polkadot-sentry-2.json

[
    {
        "labels": {
            "job": "polkadot-sentry-2",
            "group": "substrate",
            "network": "polkadot"
        },
        "targets": [
            "XX.XX.XX.XX:9615",
            "XX.XX.XX.XX:9100"
        ]
    }
]

It would be really nice if the last of the roles after a successful deployment creates and uploads this file to your monitoring server.
There is no need to restart prometheus. A proper use of labels will later help you to configure your grafana panels and new machines should pop up automagically there.

Feature: Add steps post "polkadot validator setup" to validate and start a node in README.md

Add steps post "polkadot validator setup" to validate and start a node at path below in README.md

Path to polkadot binary installation at destination server : /usr/local/bin/
Reference : https://github.com/paritytech/polkadot#hacking-on-polkadot

Make telemetry endpoints configurable

Telemetry endpoints for the validator and/ or public nodes should be defined in the config file. We should also add a note to the readme about the potential information leakage when using public telemetry endpoints (both using the code in this repo or on setups based on this one).

add load balancing for public rpc nodes

Would be great to have auto-deployed load balancing for rpc.
Not sure if all providers offer that