NAME

ha-manager - Proxmox VE HA manager command line interface

DESCRIPTION

ha-manager handles management of user-defined cluster services. This includes handling of user requests including service start, service disable, service relocate, and service restart. The cluster resource manager daemon also handles restarting and relocating services in the event of failures.

HOW IT WORKS

The local resource manager (pve-ha-lrm) is started as a daemon on each node at system start and waits until the HA cluster is quorate and locks are working. After initialization, the LRM determines which services are enabled and starts them. Also the watchdog gets initialized.

The cluster resource manager (pve-ha-crm) starts on each node and waits there for the manager lock, which can only be held by one node at a time. The node which successfully acquires the manager lock gets promoted to the CRM, it handles cluster wide actions like migrations and failures.

When an node leaves the cluster quorum, its state changes to unknown. If the current CRM then can secure the failed nodes lock, the services will be 'stolen' and restarted on another node.

When a cluster member determines that it is no longer in the cluster quorum, the LRM waits for a new quorum to form. As long as there is no quorum the node cannot reset the watchdog. This will trigger a reboot after 60 seconds.

CONFIGURATION

The HA stack is well integrated int the Proxmox VE API2. So, for example, HA can be configured via ha-manager or the PVE web interface, which both provide an easy to use tool.

The resource configuration file can be located at /etc/pve/ha/resources.cfg and the group configuration file at /etc/pve/ha/groups.cfg. Use the provided tools to make changes, there shouldn't be any need to edit them manually.

RESOURCES/SERVICES AGENTS

A resource or also called service can be managed by the ha-manager. Currently we support virtual machines and container.

GROUPS

A group is a collection of cluster nodes which a service may be bound to.

GROUP SETTINGS

* nodes

list of group node members

* restricted

resources bound to this group may only run on nodes defined by the group. If no group node member is available the resource will be placed in the stopped state.

* nofailback

the resource won't automatically fail back when a more preferred node (re)joins the cluster.

RECOVERY POLICY

There are two service recover policy settings which can be configured specific for each resource.

* max_restart

maximal number of tries to restart an failed service on the actual node. The default is set to one.

* max_relocate

maximal number of tries to relocate the service to a different node. A relocate only happens after the max_restart value is exceeded on the actual node. The default is set to one.

Note that the relocate count state will only reset to zero when the service had at least one successful start. That means if a service is re-enabled without fixing the error only the restart policy gets repeated.

ERROR RECOVERY

If after all tries the service state could not be recovered it gets placed in an error state. In this state the service won't get touched by the HA stack anymore. To recover from this state you should follow these steps:

SERVICE OPERATIONS

This are how the basic user-initiated service operations (via ha-manager) work.

* enable

the service will be started by the LRM if not already running.

* disable

the service will be stopped by the LRM if running.

* migrate/relocate

the service will be relocated (live) to another node.

* remove

the service will be removed from the HA managed resource list. Its current state will not be touched.

* start/stop

start and stop commands can be issued to the resource specific tools (like qm or pct), they will forward the request to the ha-manager which then will execute the action and set the resulting service state (enabled, disabled).

SERVICE STATES

stopped

Service is stopped (confirmed by LRM)

request_stop

Service should be stopped. Waiting for confirmation from LRM.

started

Service is active an LRM should start it ASAP if not already running.

fence

Wait for node fencing (service node is not inside quorate cluster partition).

freeze

Do not touch the service state. We use this state while we reboot a node, or when we restart the LRM daemon.

migrate

Migrate service (live) to other node.

error

Service disabled because of LRM errors. Needs manual intervention.

SYNOPSIS

 ha-manager <COMMAND> [ARGS] [OPTIONS]

 ha-manager groupadd <group> -nodes <string> [OPTIONS]
 
   Create a new HA group.
 
   <group>    string
 
             The HA group identifier.
 
   -comment   string
 
             Description.
 
   -nodes     <node>[:<pri>]{,<node>[:<pri>]}*
 
             List of cluster node names with optional priority. We use
             priority '0' as default. The CRM tries to run services on the
             node with highest priority (also see option 'nofailback').
 
   -nofailback boolean  (default=0)
 
             The CRM tries to run services on the node with the highest
             priority. If a node with higher priority comes online, the CRM
             migrates the service to that node. Enabling nofailback
             prevents that behavior.
 
   -restricted boolean  (default=0)
 
             Services on unrestricted groups may run on any cluster members
             if all group members are offline. But they will migrate back
             as soon as a group member comes online. One can implement a
             'preferred node' behavior using an unrestricted group with one
             member.
 
   -type      (group)
 
             Group type.
 
 

 ha-manager groupconfig 
 
   Get HA groups.
 
 

 ha-manager groupremove <group>
 
   Delete ha group configuration.
 
   <group>    string
 
             The HA group identifier.
 
 

 ha-manager groupset <group> [OPTIONS]
 
   Update ha group configuration.
 
   <group>    string
 
             The HA group identifier.
 
   -comment   string
 
             Description.
 
   -delete    string
 
             A list of settings you want to delete.
 
   -digest    string
 
             Prevent changes if current configuration file has different
             SHA1 digest. This can be used to prevent concurrent
             modifications.
 
   -nodes     <node>[:<pri>]{,<node>[:<pri>]}*
 
             List of cluster node names with optional priority. We use
             priority '0' as default. The CRM tries to run services on the
             node with highest priority (also see option 'nofailback').
 
   -nofailback boolean  (default=0)
 
             The CRM tries to run services on the node with the highest
             priority. If a node with higher priority comes online, the CRM
             migrates the service to that node. Enabling nofailback
             prevents that behavior.
 
   -restricted boolean  (default=0)
 
             Services on unrestricted groups may run on any cluster members
             if all group members are offline. But they will migrate back
             as soon as a group member comes online. One can implement a
             'preferred node' behavior using an unrestricted group with one
             member.
 
 


 ha-manager add <sid> [OPTIONS]
 
   Create a new HA resource.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
   -comment   string
 
             Description.
 
   -group     string
 
             The HA group identifier.
 
   -max_relocate integer (0 - N)   (default=1)
 
             Maximal number of service relocate tries when a service failes
             to start.
 
   -max_restart integer (0 - N)  (default=1)
 
             Maximal number of tries to restart the service on a node after
             its start failed.
 
   -state     (disabled | enabled)   (default=enabled)
 
             Resource state.
 
   -type      (ct | vm)
 
             Resource type.
 
 

 ha-manager config  [OPTIONS]
 
   List HA resources.
 
   -type      (ct | vm)
 
             Only list resources of specific type
 
 

 ha-manager migrate <sid> <node>
 
   Request resource migration (online) to another node.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
   <node>     string
 
             The cluster node name.
 
 

 ha-manager relocate <sid> <node>
 
   Request resource relocatzion to another node. This stops the service on
   the old node, and restarts it on the target node.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
   <node>     string
 
             The cluster node name.
 
 

 ha-manager remove <sid>
 
   Delete resource configuration.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
 

 ha-manager set <sid> [OPTIONS]
 
   Update resource configuration.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
   -comment   string
 
             Description.
 
   -delete    string
 
             A list of settings you want to delete.
 
   -digest    string
 
             Prevent changes if current configuration file has different
             SHA1 digest. This can be used to prevent concurrent
             modifications.
 
   -group     string
 
             The HA group identifier.
 
   -max_relocate integer (0 - N)   (default=1)
 
             Maximal number of service relocate tries when a service failes
             to start.
 
   -max_restart integer (0 - N)  (default=1)
 
             Maximal number of tries to restart the service on a node after
             its start failed.
 
   -state     (disabled | enabled)   (default=enabled)
 
             Resource state.
 
 


 ha-manager disable <sid>
 
   Disable a HA resource.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
 

 ha-manager enable <sid>
 
   Enable a HA resource.
 
   <sid>      <type>:<name>
 
             HA resource ID. This consists of a resource type followed by a
             resource specific name, separated with colon (example: vm:100
             / ct:100). For virtual machines and containers, you can simply
             use the VM or CT id as a shortcut (example: 100).
 
 

 ha-manager status  [OPTIONS]
 
   Display HA manger status.
 
   -verbose   boolean   (default=0)
 
             Verbose output. Include complete CRM and LRM status (JSON).
 
 


 ha-manager help [<cmd>] [OPTIONS]
 
   Get help about specified command.
 
   <cmd>      string
 
             Command name
 
   -verbose   boolean
 
             Verbose output format.
 
 

COPYRIGHT AND DISCLAIMER

Copyright (C) 2007-2015 Proxmox Server Solutions GmbH

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.