Coder Social home page Coder Social logo

proxmox / pve-ha-manager Goto Github PK

View Code? Open in Web Editor NEW
12.0 7.0 4.0 1.08 MB

Proxmox VE High Availabillity Manager - read-only source mirror

Home Page: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html

Makefile 3.22% Perl 92.41% C 4.08% Handlebars 0.28%
ha high-availability perl proxmox proxmox-ve clustering cloud

pve-ha-manager's Introduction

= Proxmox HA Manager =

Note that this README got written as early development planning/documentation
in 2015, even though small updates where made in 2023 it might be a bit out of
date. For usage documentation see the official reference docs shipped with your
Proxmox VE installation or, for your convenience, hosted at:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html

== History & Motivation ==

The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks:

- no more development (redhat moved to pacemaker)
- highly depend on old version of corosync
- complicated code (cause by compatibility layer with older cluster stack
  (cman)
- no self-fencing

For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA
easier for our users while also making it possible to move to newest corosync,
or even a totally different cluster stack. So, the following core requirements
got set out:

- possibility to run with any distributed key/value store which provides some
  kind of locking with timeouts (zookeeper, consul, etcd, ..) 
- self fencing using Linux watchdog device
- implemented in Perl, so that we can use PVE framework
- only work with simply resources like VMs

We dropped the idea to assemble complex, dependent services, because we think
this is already done with the VM/CT abstraction.

== Architecture ==

Cluster requirements.

=== Cluster wide locks with timeouts ===

The cluster stack must provide cluster-wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync.

=== Watchdog ===

We need a reliable watchdog mechanism, which is able to provide hard
timeouts. It must be guaranteed that the node reboots within the specified
timeout if we do not update the watchdog. For me it looks that neither
systemd nor the standard watchdog(8) daemon provides such guarantees.

We could use the /dev/watchdog directly, but unfortunately this only
allows one user. We need to protect at least two daemons, so we write
our own watchdog daemon. This daemon work on /dev/watchdog, but
provides that service to several other daemons using a local socket.

=== Self fencing ===

A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
lock for each node) before starting HA resources, and the node updates
the watchdog device once it get that lock. If the node loose quorum,
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
longer updated. The node can release the lock if there are no running
HA resources.

This makes sure that the node holds the 'ha_agent_${node}_lock' as
long as there are running services on that node.

The HA manger can assume that the watchdog triggered a reboot when he
is able to acquire the 'ha_agent_${node}_lock' for that node.

==== Problems with "two_node" Clusters ====

This corosync options depends on a fence race condition, and only
works using reliable HW fence devices.

Above 'self fencing' algorithm does not work if you use this option!

Note that you can use a QDevice, i.e., a external simple (no full corosync
membership, so relaxed networking) note arbiter process.

=== Testing Requirements ===

We want to be able to simulate and test the behavior of a HA cluster, using
either a GUI or a CLI. This makes it easier to learn how the system behaves. We
also need a way to run regression tests.

== Implementation details ==

=== Cluster Resource Manager (class PVE::HA::CRM) ====

The Cluster Resource Manager (CRM) daemon runs one each node, but
locking makes sure only one CRM daemon act in 'master' role. That
'master' daemon reads the service configuration file, and request new
service states by writing the global 'manager_status'. That data
structure is read by the Local Resource Manager, which performs the
real work (start/stop/migrate) services.

==== Service Relocation ====

Some services, like a QEMU Virtual Machine, supports live migration.
So the LRM can migrate those services without stopping them (CRM service state
'migrate'),

Most other service types requires the service to be stopped, and then restarted
at the other node. Stopped services are moved by the CRM (usually by simply
changing the service configuration).

==== Service Ordering and Co-location Constraints ====

There are no to implement this for the initial version but although it would be
possible and probably should be done for later versions.

==== Possible CRM Service States ====

stopped:      Service is stopped (confirmed by LRM)

request_stop: Service should be stopped. Waiting for 
              confirmation from LRM.

started:      Service is active an LRM should start it asap.

fence:        Wait for node fencing (service node is not inside
              quorate cluster partition).

recovery:     Service node gets recovered to a new node as it current one was
              fenced. Note that a service might be stuck here depending on the
              group/priority configuration

freeze:       Do not touch. We use this state while we reboot a node,
              or when we restart the LRM daemon.

migrate:      Migrate (live) service to other node.

relocate:     Migrate (stop. move, start) service to other node.

error:        Service disabled because of LRM errors.


There's also a `ignored` state which tells the HA stack to ignore a service
completely, i.e., as it wasn't under HA control at all.

=== Local Resource Manager (class PVE::HA::LRM) ===

The Local Resource Manager (LRM) daemon runs one each node, and
performs service commands (start/stop/migrate) for services assigned
to the local node. It should be mentioned that each LRM holds a
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
to assign the service to another node while the LRM holds that lock.

The LRM reads the requested service state from 'manager_status', and
tries to bring the local service into that state. The actual service
status is written back to the 'service_${node}_status', and can be
read by the CRM.

=== Pluggable Interface for Cluster Environment (class PVE::HA::Env) ===

This class defines an interface to the actual cluster environment:

* get node membership and quorum information

* get/release cluster wide locks

* get system time

* watchdog interface

* read/write cluster wide status files 

We have plugins for several different environments:

* PVE::HA::Sim::TestEnv: the regression test environment

* PVE::HA::Sim::RTEnv: the graphical simulator

* PVE::HA::Env::PVE2: the real Proxmox VE cluster


pve-ha-manager's People

Contributors

blub avatar cebner-proxmox avatar fabian-gruenbichler avatar flumm avatar foxmox avatar lwagner94 avatar maurerdietmar avatar pimaker avatar rhonda avatar thomaslamprecht avatar xorond avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.