NAME
ha-manager - Proxmox VE HA Manager
SYNOPSYS
ha-manager <COMMAND> [ARGS] [OPTIONS]
ha-manager add <sid>
[OPTIONS]
Create a new HA resource.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
-
-comment
string
-
Description.
-
-group
string
-
The HA group identifier.
-
-max_relocate
integer (0 - N)
(default=1
) -
Maximal number of service relocate tries when a service failes to start.
-
-max_restart
integer (0 - N)
(default=1
) -
Maximal number of tries to restart the service on a node after its start failed.
-
-state
(disabled | enabled)
(default=enabled
) -
Resource state.
-
-type
(ct | vm)
-
Resource type.
ha-manager config [OPTIONS]
List HA resources.
-
-type
(ct | vm)
-
Only list resources of specific type
ha-manager disable <sid>
Disable a HA resource.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
ha-manager enable <sid>
Enable a HA resource.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
ha-manager groupadd <group> -nodes <string>
[OPTIONS]
Create a new HA group.
-
<group>
string
-
The HA group identifier.
-
-comment
string
-
Description.
-
-nodes
<node>[:<pri>]{,<node>[:<pri>]}*
-
List of cluster node names with optional priority. We use priority 0 as default. The CRM tries to run services on the node with highest priority (also see option nofailback).
-
-nofailback
boolean
(default=0
) -
The CRM tries to run services on the node with the highest priority. If a node with higher priority comes online, the CRM migrates the service to that node. Enabling nofailback prevents that behavior.
-
-restricted
boolean
(default=0
) -
Services on unrestricted groups may run on any cluster members if all group members are offline. But they will migrate back as soon as a group member comes online. One can implement a preferred node behavior using an unrestricted group with one member.
-
-type
(group)
-
Group type.
ha-manager groupconfig
Get HA groups.
ha-manager groupremove <group>
Delete ha group configuration.
-
<group>
string
-
The HA group identifier.
ha-manager groupset <group>
[OPTIONS]
Update ha group configuration.
-
<group>
string
-
The HA group identifier.
-
-comment
string
-
Description.
-
-delete
string
-
A list of settings you want to delete.
-
-digest
string
-
Prevent changes if current configuration file has different SHA1 digest. This can be used to prevent concurrent modifications.
-
-nodes
<node>[:<pri>]{,<node>[:<pri>]}*
-
List of cluster node names with optional priority. We use priority 0 as default. The CRM tries to run services on the node with highest priority (also see option nofailback).
-
-nofailback
boolean
(default=0
) -
The CRM tries to run services on the node with the highest priority. If a node with higher priority comes online, the CRM migrates the service to that node. Enabling nofailback prevents that behavior.
-
-restricted
boolean
(default=0
) -
Services on unrestricted groups may run on any cluster members if all group members are offline. But they will migrate back as soon as a group member comes online. One can implement a preferred node behavior using an unrestricted group with one member.
ha-manager help [<cmd>]
[OPTIONS]
Get help about specified command.
-
<cmd>
string
-
Command name
-
-verbose
boolean
-
Verbose output format.
ha-manager migrate <sid> <node>
Request resource migration (online) to another node.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
-
<node>
string
-
The cluster node name.
ha-manager relocate <sid> <node>
Request resource relocatzion to another node. This stops the service on the old node, and restarts it on the target node.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
-
<node>
string
-
The cluster node name.
ha-manager remove <sid>
Delete resource configuration.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
ha-manager set <sid>
[OPTIONS]
Update resource configuration.
-
<sid>
<type>:<name>
-
HA resource ID. This consists of a resource type followed by a resource specific name, separated with colon (example: vm:100 / ct:100). For virtual machines and containers, you can simply use the VM or CT id as a shortcut (example: 100).
-
-comment
string
-
Description.
-
-delete
string
-
A list of settings you want to delete.
-
-digest
string
-
Prevent changes if current configuration file has different SHA1 digest. This can be used to prevent concurrent modifications.
-
-group
string
-
The HA group identifier.
-
-max_relocate
integer (0 - N)
(default=1
) -
Maximal number of service relocate tries when a service failes to start.
-
-max_restart
integer (0 - N)
(default=1
) -
Maximal number of tries to restart the service on a node after its start failed.
-
-state
(disabled | enabled)
(default=enabled
) -
Resource state.
ha-manager status [OPTIONS]
Display HA manger status.
-
-verbose
boolean
(default=0
) -
Verbose output. Include complete CRM and LRM status (JSON).
DESCRIPTION
Our modern society depends heavily on information provided by computers over the network. Mobile devices amplified that dependency, because people can access the network any time from anywhere. If you provide such services, it is very important that they are available most of the time.
We can mathematically define the availability as the ratio of (A) the total time a service is capable of being used during a given interval to (B) the length of the interval. It is normally expressed as a percentage of uptime in a given year.
Availability % | Downtime per year |
---|---|
99 |
3.65 days |
99.9 |
8.76 hours |
99.99 |
52.56 minutes |
99.999 |
5.26 minutes |
99.9999 |
31.5 seconds |
99.99999 |
3.15 seconds |
There are several ways to increase availability. The most elegant solution is to rewrite your software, so that you can run it on several host at the same time. The software itself need to have a way to detect errors and do failover. This is relatively easy if you just want to serve read-only web pages. But in general this is complex, and sometimes impossible because you cannot modify the software yourself. The following solutions works without modifying the software:
-
Use reliable "server" components
|
Computer components with same functionality can have varying reliability numbers, depending on the component quality. Most vendors sell components with higher reliability as "server" components - usually at higher price. |
-
Eliminate single point of failure (redundant components)
-
use an uninterruptible power supply (UPS)
-
use redundant power supplies on the main boards
-
use ECC-RAM
-
use redundant network hardware
-
use RAID for local storage
-
use distributed, redundant storage for VM data
-
-
Reduce downtime
-
rapidly accessible administrators (24/7)
-
availability of spare parts (other nodes in a Proxmox VE cluster)
-
automatic error detection (ha-manager)
-
automatic failover (ha-manager)
-
Virtualization environments like Proxmox VE make it much easier to reach high availability because they remove the "hardware" dependency. They also support to setup and use redundant storage and network devices. So if one host fail, you can simply start those services on another host within your cluster.
Even better, Proxmox VE provides a software stack called ha-manager, which can do that automatically for you. It is able to automatically detect errors and do automatic failover.
Proxmox VE ha-manager works like an "automated" administrator. First, you configure what resources (VMs, containers, …) it should manage. ha-manager then observes correct functionality, and handles service failover to another node in case of errors. ha-manager can also handle normal user requests which may start, stop, relocate and migrate a service.
But high availability comes at a price. High quality components are more expensive, and making them redundant duplicates the costs at least. Additional spare parts increase costs further. So you should carefully calculate the benefits, and compare with those additional costs.
|
Increasing availability from 99% to 99.9% is relatively simply. But increasing availability from 99.9999% to 99.99999% is very hard and costly. ha-manager has typical error detection and failover times of about 2 minutes, so you can get no more than 99.999% availability. |
Requirements
-
at least three cluster nodes (to get reliable quorum)
-
shared storage for VMs and containers
-
hardware redundancy (everywhere)
-
hardware watchdog - if not available we fall back to the linux kernel software watchdog (softdog)
-
optional hardware fencing devices
Resources
We call the primary management unit handled by ha-manager a resource. A resource (also called "service") is uniquely identified by a service ID (SID), which consists of the resource type and an type specific ID, e.g.: vm:100. That example would be a resource of type vm (virtual machine) with the ID 100.
For now we have two important resources types - virtual machines and containers. One basic idea here is that we can bundle related software into such VM or container, so there is no need to compose one big service from other services, like it was done with rgmanager. In general, a HA enabled resource should not depend on other resources.
How It Works
This section provides an in detail description of the Proxmox VE HA-manager internals. It describes how the CRM and the LRM work together.
To provide High Availability two daemons run on each node:
- pve-ha-lrm
-
The local resource manager (LRM), it controls the services running on the local node. It reads the requested states for its services from the current manager status file and executes the respective commands.
- pve-ha-crm
-
The cluster resource manager (CRM), it controls the cluster wide actions of the services, processes the LRM results and includes the state machine which controls the state of each service.
|
Locks in the LRM & CRM Locks are provided by our distributed configuration file system (pmxcfs).
They are used to guarantee that each LRM is active once and working. As a
LRM only executes actions when it holds its lock we can mark a failed node
as fenced if we can acquire its lock. This lets us then recover any failed
HA services securely without any interference from the now unknown failed Node.
This all gets supervised by the CRM which holds currently the manager master
lock. |
Local Resource Manager
The local resource manager (pve-ha-lrm) is started as a daemon on boot and waits until the HA cluster is quorate and thus cluster wide locks are working.
It can be in three states:
-
wait for agent lock: the LRM waits for our exclusive lock. This is also used as idle sate if no service is configured
-
active: the LRM holds its exclusive lock and has services configured
-
lost agent lock: the LRM lost its lock, this means a failure happened and quorum was lost.
After the LRM gets in the active state it reads the manager status file in /etc/pve/ha/manager_status and determines the commands it has to execute for the services it owns. For each command a worker gets started, this workers are running in parallel and are limited to maximal 4 by default. This default setting may be changed through the datacenter configuration key "max_worker". When finished the worker process gets collected and its result saved for the CRM.
|
Maximal Concurrent Worker Adjustment Tips The default value of 4 maximal concurrent Workers may be unsuited for
a specific setup. For example may 4 live migrations happen at the same
time, which can lead to network congestions with slower networks and/or
big (memory wise) services. Ensure that also in the worst case no congestion
happens and lower the "max_worker" value if needed. In the contrary, if you
have a particularly powerful high end setup you may also want to increase it. |
Each command requested by the CRM is uniquely identifiable by an UID, when the worker finished its result will be processed and written in the LRM status file /etc/pve/nodes/<nodename>/lrm_status. There the CRM may collect it and let its state machine - respective the commands output - act on it.
The actions on each service between CRM and LRM are normally always synced. This means that the CRM requests a state uniquely marked by an UID, the LRM then executes this action one time and writes back the result, also identifiable by the same UID. This is needed so that the LRM does not executes an outdated command. With the exception of the stop and the error command, those two do not depend on the result produced and are executed always in the case of the stopped state and once in the case of the error state.
|
Read the Logs The HA Stack logs every action it makes. This helps to understand what
and also why something happens in the cluster. Here its important to see
what both daemons, the LRM and the CRM, did. You may use
journalctl -u pve-ha-lrm on the node(s) where the service is and
the same command for the pve-ha-crm on the node which is the current master. |
Cluster Resource Manager
The cluster resource manager (pve-ha-crm) starts on each node and waits there for the manager lock, which can only be held by one node at a time. The node which successfully acquires the manager lock gets promoted to the CRM master.
It can be in three states:
-
wait for agent lock: the LRM waits for our exclusive lock. This is also used as idle sate if no service is configured
-
active: the LRM holds its exclusive lock and has services configured
-
lost agent lock: the LRM lost its lock, this means a failure happened and quorum was lost.
It main task is to manage the services which are configured to be highly available and try to always enforce them to the wanted state, e.g.: a enabled service will be started if its not running, if it crashes it will be started again. Thus it dictates the LRM the actions it needs to execute.
When an node leaves the cluster quorum, its state changes to unknown. If the current CRM then can secure the failed nodes lock, the services will be stolen and restarted on another node.
When a cluster member determines that it is no longer in the cluster quorum, the LRM waits for a new quorum to form. As long as there is no quorum the node cannot reset the watchdog. This will trigger a reboot after the watchdog then times out, this happens after 60 seconds.
Configuration
The HA stack is well integrated in the Proxmox VE API2. So, for example, HA can be configured via ha-manager or the PVE web interface, which both provide an easy to use tool.
The resource configuration file can be located at /etc/pve/ha/resources.cfg and the group configuration file at /etc/pve/ha/groups.cfg. Use the provided tools to make changes, there shouldn’t be any need to edit them manually.
Node Power Status
If a node needs maintenance you should migrate and or relocate all services which are required to run always on another node first. After that you can stop the LRM and CRM services. But note that the watchdog triggers if you stop it with active services.
Package Updates
When updating the ha-manager you should do one node after the other, never all at once for various reasons. First, while we test our software thoughtfully, a bug affecting your specific setup cannot totally be ruled out. Upgrading one node after the other and checking the functionality of each node after finishing the update helps to recover from an eventual problems, while updating all could render you in a broken cluster state and is generally not good practice.
Also, the Proxmox VE HA stack uses a request acknowledge protocol to perform actions between the cluster and the local resource manager. For restarting, the LRM makes a request to the CRM to freeze all its services. This prevents that they get touched by the Cluster during the short time the LRM is restarting. After that the LRM may safely close the watchdog during a restart. Such a restart happens on a update and as already stated a active master CRM is needed to acknowledge the requests from the LRM, if this is not the case the update process can be too long which, in the worst case, may result in a watchdog reset.
Fencing
What Is Fencing
Fencing secures that on a node failure the dangerous node gets will be rendered unable to do any damage and that no resource runs twice when it gets recovered from the failed node. This is a really important task and one of the base principles to make a system Highly Available.
If a node would not get fenced it would be in an unknown state where it may have still access to shared resources, this is really dangerous! Imagine that every network but the storage one broke, now while not reachable from the public network the VM still runs and writes on the shared storage. If we would not fence the node and just start up this VM on another Node we would get dangerous race conditions, atomicity violations the whole VM could be rendered unusable. The recovery could also simply fail if the storage protects from multiple mounts and thus defeat the purpose of HA.
How Proxmox VE Fences
There are different methods to fence a node, for example fence devices which cut off the power from the node or disable their communication completely.
Those are often quite expensive and bring additional critical components in a system, because if they fail you cannot recover any service.
We thus wanted to integrate a simpler method in the HA Manager first, namely self fencing with watchdogs.
Watchdogs are widely used in critical and dependable systems since the beginning of micro controllers, they are often independent and simple integrated circuit which programs can use to watch them. After opening they need to report periodically. If, for whatever reason, a program becomes unable to do so the watchdogs triggers a reset of the whole server.
Server motherboards often already include such hardware watchdogs, these need to be configured. If no watchdog is available or configured we fall back to the Linux Kernel softdog while still reliable it is not independent of the servers Hardware and thus has a lower reliability then a hardware watchdog.
Configure Hardware Watchdog
By default all watchdog modules are blocked for security reasons as they are like a loaded gun if not correctly initialized. If you have a hardware watchdog available remove its kernel module from the blacklist, load it with insmod and restart the watchdog-mux service or reboot the node.
Recover Fenced Services
After a node failed and its fencing was successful we start to recover services to other available nodes and restart them there so that they can provide service again.
The selection of the node on which the services gets recovered is influenced by the users group settings, the currently active nodes and their respective active service count. First we build a set out of the intersection between user selected nodes and available nodes. Then the subset with the highest priority of those nodes gets chosen as possible nodes for recovery. We select the node with the currently lowest active service count as a new node for the service. That minimizes the possibility of an overload, which else could cause an unresponsive node and as a result a chain reaction of node failures in the cluster.
Groups
A group is a collection of cluster nodes which a service may be bound to.
Group Settings
- nodes
-
List of group node members where a priority can be given to each node. A service bound to this group will run on the nodes with the highest priority available. If more nodes are in the highest priority class the services will get distributed to those node if not already there. The priorities have a relative meaning only.
- restricted
-
resources bound to this group may only run on nodes defined by the group. If no group node member is available the resource will be placed in the stopped state.
- nofailback
-
the resource won’t automatically fail back when a more preferred node (re)joins the cluster.
Start Failure Policy
The start failure policy comes in effect if a service failed to start on a node once ore more times. It can be used to configure how often a restart should be triggered on the same node and how often a service should be relocated so that it gets a try to be started on another node. The aim of this policy is to circumvent temporary unavailability of shared resources on a specific node. For example, if a shared storage isn’t available on a quorate node anymore, e.g. network problems, but still on other nodes, the relocate policy allows then that the service gets started nonetheless.
There are two service start recover policy settings which can be configured specific for each resource.
- max_restart
-
maximal number of tries to restart an failed service on the actual node. The default is set to one.
- max_relocate
-
maximal number of tries to relocate the service to a different node. A relocate only happens after the max_restart value is exceeded on the actual node. The default is set to one.
|
The relocate count state will only reset to zero when the service had at least one successful start. That means if a service is re-enabled without fixing the error only the restart policy gets repeated. |
Error Recovery
If after all tries the service state could not be recovered it gets placed in an error state. In this state the service won’t get touched by the HA stack anymore. To recover from this state you should follow these steps:
-
bring the resource back into an safe and consistent state (e.g: killing its process)
-
disable the ha resource to place it in an stopped state
-
fix the error which led to this failures
-
after you fixed all errors you may enable the service again
Service Operations
This are how the basic user-initiated service operations (via ha-manager) work.
- enable
-
the service will be started by the LRM if not already running.
- disable
-
the service will be stopped by the LRM if running.
- migrate/relocate
-
the service will be relocated (live) to another node.
- remove
-
the service will be removed from the HA managed resource list. Its current state will not be touched.
- start/stop
-
start and stop commands can be issued to the resource specific tools (like qm or pct), they will forward the request to the ha-manager which then will execute the action and set the resulting service state (enabled, disabled).
Service States
- stopped
-
Service is stopped (confirmed by LRM), if detected running it will get stopped again.
- request_stop
-
Service should be stopped. Waiting for confirmation from LRM.
- started
-
Service is active an LRM should start it ASAP if not already running. If the Service fails and is detected to be not running the LRM restarts it.
- fence
-
Wait for node fencing (service node is not inside quorate cluster partition). As soon as node gets fenced successfully the service will be recovered to another node, if possible.
- freeze
-
Do not touch the service state. We use this state while we reboot a node, or when we restart the LRM daemon.
- migrate
-
Migrate service (live) to other node.
- error
-
Service disabled because of LRM errors. Needs manual intervention.
Copyright and Disclaimer
Copyright © 2007-2016 Proxmox Server Solutions GmbH
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/