ha-manager - Proxmox VE HA manager command line interface
ha-manager handles management of user-defined cluster services. This includes handling of user requests including service start, service disable, service relocate, and service restart. The cluster resource manager daemon also handles restarting and relocating services in the event of failures.
The local resource manager (pve-ha-lrm) is started as a daemon on each node at system start and waits until the HA cluster is quorate and locks are working. After initialization, the LRM determines which services are enabled and starts them. Also the watchdog gets initialized.
The cluster resource manager (pve-ha-crm) starts on each node and waits there for the manager lock, which can only be held by one node at a time. The node which successfully acquires the manager lock gets promoted to the CRM, it handles cluster wide actions like migrations and failures.
When an node leaves the cluster quorum, its state changes to unknown. If the current CRM then can secure the failed nodes lock, the services will be 'stolen' and restarted on another node.
When a cluster member determines that it is no longer in the cluster quorum, the LRM waits for a new quorum to form. As long as there is no quorum the node cannot reset the watchdog. This will trigger a reboot after 60 seconds.
The HA stack is well integrated int the Proxmox VE API2. So, for example, HA can be configured via ha-manager or the PVE web interface, which both provide an easy to use tool.
The resource configuration file can be located at /etc/pve/ha/resources.cfg and the group configuration file at /etc/pve/ha/groups.cfg. Use the provided tools to make changes, there shouldn't be any need to edit them manually.
A resource or also called service can be managed by the ha-manager. Currently we support virtual machines and container.
A group is a collection of cluster nodes which a service may be bound to.
list of group node members
resources bound to this group may only run on nodes defined by the group. If no group node member is available the resource will be placed in the stopped state.
the resource won't automatically fail back when a more preferred node (re)joins the cluster.
There are two service recover policy settings which can be configured specific for each resource.
maximal number of tries to restart an failed service on the actual node. The default is set to one.
maximal number of tries to relocate the service to a different node. A relocate only happens after the max_restart value is exceeded on the actual node. The default is set to one.
Note that the relocate count state will only reset to zero when the service had at least one successful start. That means if a service is re-enabled without fixing the error only the restart policy gets repeated.
If after all tries the service state could not be recovered it gets placed in an error state. In this state the service won't get touched by the HA stack anymore. To recover from this state you should follow these steps:
bring the resource back into an safe and consistent state (e.g: killing its process)
disable the ha resource to place it in an stopped state
fix the error which led to this failures
after you fixed all errors you may enable the service again
This are how the basic user-initiated service operations (via ha-manager) work.
the service will be started by the LRM if not already running.
the service will be stopped by the LRM if running.
the service will be relocated (live) to another node.
the service will be removed from the HA managed resource list. Its current state will not be touched.
start and stop commands can be issued to the resource specific tools (like qm or pct), they will forward the request to the ha-manager which then will execute the action and set the resulting service state (enabled, disabled).
Service is stopped (confirmed by LRM)
Service should be stopped. Waiting for confirmation from LRM.
Service is active an LRM should start it ASAP if not already running.
Wait for node fencing (service node is not inside quorate cluster partition).
Do not touch the service state. We use this state while we reboot a node, or when we restart the LRM daemon.
Migrate service (live) to other node.
Service disabled because of LRM errors. Needs manual intervention.
ha-manager <COMMAND> [ARGS] [OPTIONS]
ha-manager groupadd <group> -nodes <string> [OPTIONS]
Create a new HA group.
<group> string
The HA group identifier.
-comment string
Description.
-nodes <node>[:<pri>]{,<node>[:<pri>]}*
List of cluster node names with optional priority. We use
priority '0' as default. The CRM tries to run services on the
node with highest priority (also see option 'nofailback').
-nofailback boolean (default=0)
The CRM tries to run services on the node with the highest
priority. If a node with higher priority comes online, the CRM
migrates the service to that node. Enabling nofailback
prevents that behavior.
-restricted boolean (default=0)
Services on unrestricted groups may run on any cluster members
if all group members are offline. But they will migrate back
as soon as a group member comes online. One can implement a
'preferred node' behavior using an unrestricted group with one
member.
-type (group)
Group type.
ha-manager groupconfig
Get HA groups.
ha-manager groupremove <group>
Delete ha group configuration.
<group> string
The HA group identifier.
ha-manager groupset <group> [OPTIONS]
Update ha group configuration.
<group> string
The HA group identifier.
-comment string
Description.
-delete string
A list of settings you want to delete.
-digest string
Prevent changes if current configuration file has different
SHA1 digest. This can be used to prevent concurrent
modifications.
-nodes <node>[:<pri>]{,<node>[:<pri>]}*
List of cluster node names with optional priority. We use
priority '0' as default. The CRM tries to run services on the
node with highest priority (also see option 'nofailback').
-nofailback boolean (default=0)
The CRM tries to run services on the node with the highest
priority. If a node with higher priority comes online, the CRM
migrates the service to that node. Enabling nofailback
prevents that behavior.
-restricted boolean (default=0)
Services on unrestricted groups may run on any cluster members
if all group members are offline. But they will migrate back
as soon as a group member comes online. One can implement a
'preferred node' behavior using an unrestricted group with one
member.
ha-manager add <sid> [OPTIONS]
Create a new HA resource.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
-comment string
Description.
-group string
The HA group identifier.
-max_relocate integer (0 - N) (default=1)
Maximal number of service relocate tries when a service failes
to start.
-max_restart integer (0 - N) (default=1)
Maximal number of tries to restart the service on a node after
its start failed.
-state (disabled | enabled) (default=enabled)
Resource state.
-type (ct | vm)
Resource type.
ha-manager config [OPTIONS]
List HA resources.
-type (ct | vm)
Only list resources of specific type
ha-manager migrate <sid> <node>
Request resource migration (online) to another node.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
<node> string
The cluster node name.
ha-manager relocate <sid> <node>
Request resource relocatzion to another node. This stops the service on
the old node, and restarts it on the target node.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
<node> string
The cluster node name.
ha-manager remove <sid>
Delete resource configuration.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
ha-manager set <sid> [OPTIONS]
Update resource configuration.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
-comment string
Description.
-delete string
A list of settings you want to delete.
-digest string
Prevent changes if current configuration file has different
SHA1 digest. This can be used to prevent concurrent
modifications.
-group string
The HA group identifier.
-max_relocate integer (0 - N) (default=1)
Maximal number of service relocate tries when a service failes
to start.
-max_restart integer (0 - N) (default=1)
Maximal number of tries to restart the service on a node after
its start failed.
-state (disabled | enabled) (default=enabled)
Resource state.
ha-manager disable <sid>
Disable a HA resource.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
ha-manager enable <sid>
Enable a HA resource.
<sid> <type>:<name>
HA resource ID. This consists of a resource type followed by a
resource specific name, separated with colon (example: vm:100
/ ct:100). For virtual machines and containers, you can simply
use the VM or CT id as a shortcut (example: 100).
ha-manager status [OPTIONS]
Display HA manger status.
-verbose boolean (default=0)
Verbose output. Include complete CRM and LRM status (JSON).
ha-manager help [<cmd>] [OPTIONS]
Get help about specified command.
<cmd> string
Command name
-verbose boolean
Verbose output format.
Copyright (C) 2007-2015 Proxmox Server Solutions GmbH
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.