Overview

FortiSOAR™ supports High Availability (HA) clusters that support both Active-Passive and Active-Active configurations.

FortiSOAR™ supports the following High Availability (HA)/Disaster Recovery (DR) options:

Method Brief Description
Nightly database backups and incremental VM snapshots FortiSOAR™ provides backup scripts that are scheduled to run at pre-defined intervals and take full database backup on a shared or backed up drive. The full backups have to be supplemented with incremental Virtual Machine (VM) snapshots whenever there are changes made to the file system, such as connector installation changes, config file changes, upgrades, schedule changes, etc. For more information, see the Backing up and Restoring FortiSOAR™ chapter.
HA provided by the underlying virtualization platform Your Virtualization platform also provides HA, such as VMware HA and AWS EBS snapshots. This method relies on your expertise and infrastructure.
Externalized Database This method allows you to externalize your PostgreSQL database and uses your own database's HA solution. VM snapshots have to be taken when there are changes made to the file system, such as connector installation changes, config file changes, upgrades, schedule changes, etc.
For more information on externalizing PostgreSQL database, see the Externalization of your FortiSOAR™ PostgreSQL database chapter.
High Availability (HA) clusters This chapter describes this method of HA/DR.

FortiSOAR™ High Availability Scenarios

You can configure FortiSOAR™ with either an externalized PostgreSQL database or an internal PostgreSQL database. For both cases you can configure Active-Active or Active-Passive high availability clusters.

High Availability with an internal PostgreSQL database

You can configure FortiSOAR™ for high availability (HA) with an internal PostgreSQL database in the following two ways:

  • In an Active-Active HA cluster configuration, at least two nodes are actively running the same kind of service simultaneously. The main aim of the active-active cluster is to achieve load balancing and horizontal scaling, while data is being replicated asynchronously. You should front multiple active nodes with a proxy or a load balancer to effectively direct requests to all nodes.
    FortiSOAR™ with an internal database and an Active/Active configuration
  • In an Active-Passive HA cluster configuration, one or more passive or standby nodes are available to take over if the primary node fails. Processing is done only by the primary node. However, when the primary node fails, then a standby node can be promoted as the primary node. In this configuration, you can have one active node and one or more passive nodes configured in a cluster, which provides redundancy, while data is being replicated asynchronously.
    FortiSOAR™ with an internal database and an Active/Passive configuration

High Availability with an externalized PostgreSQL database

In case of an externalized database, the user will use their own database's HA solution. FortiSOAR™ ensures that changes done in the file system of any of the cluster nodes arising from the connector install/uninstall or any changes in the module definitions are synced across every node so a secondary or passive node can takeover in the least time in case of a failure of the primary node.

FortiSOAR™ with an external database and an Active/Active configuration

From version 5.0.0 onwards, when you deploy FortiSOAR™ instance, the FortiSOAR Configuration Wizard configures the instance as a single node cluster, and it is created as an active primary node. You can join more nodes to this node to form a multi-node cluster. For more information on the FortiSOAR Configuration Wizard, see the Deploying FortiSOAR™ chapter in the "Deployment Guide."

Notes for FortiSOAR™ HA clusters:

  • One FortiSOAR™ cluster can have only one active primary node, all the other nodes are either active secondary nodes or passive nodes. Uniqueness of the primary node is due to the following:
    • In case of an internal database, all active nodes talk to the database of the primary node for all reads/writes. The database of all other nodes is in the read-only mode and setup for replication from the primary node.
    • Although the queued workflows are distributed amongst all active nodes, the Workflow scheduler runs only on the primary node.
    • All active nodes index the data for quick search into ElasticSearch at the primary node.
    • All integrations or connectors that have a listener configured for notifications, such as IMAP, Exchange, Syslog, etc run the listeners only on the primary node.
      Therefore, if the primary node goes down, one of the other nodes in the cluster must be promoted as the new primary node and the other nodes should rejoin the cluster connecting to the new primary.
  • Active secondary nodes connect to the database of the active primary node and serve FortiSOAR™ requests. However, passive nodes are used only for disaster recovery and they do not serve any FortiSOAR™ requests.

Prerequisite to configuring High Availability

  • Your FortiSOAR™ instance must be a 5.0.0 and later instance, either a fresh install of 5.0.0 and later or your instance must be upgraded to 5.0.0 and later.
  • All nodes of a cluster should DNS resolvable from each other.
  • Ensure that the ssh session does not time out by entering into the screen mode. For more information, see Handling session timeouts.
  • All nodes that are part of a HA cluster must have a similar license in terms of user count, multitenancy support and entitlements.

Handling session timeouts

Certain operations, such as takeover, join cluster, etc. might take a longer period of time to run, therefore you must ensure that your ssh session does not timed out. It is possible that your ssh session will time out since generally the timeout set for an ssh session is 5 minutes, and some of FortiSOAR™ operations can take up to 15-20 minutes, depending on the data volume. This also ensures that ensure the session runs smoothly even the terminal session gets deactivated.

To ensure that your session does not timeout, use the screen command that maintains the session until you manually terminate the session.
Command to install screen is # yum install screen.

For more information of the screen mode and to avoid issues due to session timeouts, see A Basic understanding of screen on Centos.

In cases where yet your current ssh session gets disconnected, then do the following:

  1. To list the current session, type the # screen -ls command.
  2. To restore the session, type # screen -r XXXX, where XXXX is the last session ID of the last screen.

Process for configuring High Availability

From version 5.1.0 onwards, the process for configuring HA has been simplified, i.e., the join-cluster operation is now a single step operation, which does not require you to perform the following steps:

  1. Export the configuration details of the active primary node to a configuration file named ha.conf, and then copy the ha.conf file to the node that you want to configure as a secondary node.
  2. Whitelist the hostnames of the secondary nodes on the active primary server.

Important: You cannot parallelly join nodes to a HA cluster in version 5.1.0. Therefore, in version 5.1.0 you can only join nodes sequentially to a HA cluster.

Process that you can use for configuring HA for version 5.1.0 and later:

  1. Use the FortiSOAR™ Admin CLI (csadm) to configure HA for your FortiSOAR™ instances. For more information, see the FortiSOAR™ Admin CLI chapter. Connect to your VM as a root user and run the following command:
    # csadm ha
    This will display the options available to configure HA:
    FortiSOAR™ Admin CLI - csadm ha command output
  2. To configure a node as a secondary node, ensure that all HA nodes are resolvable through DNS and then SSH to the server that you want to configure as a secondary node and run the following command:
    # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --primary-node <DNS_Resolvable_Primary_Node_Name>
    Once you enter this command, you will be prompted to enter the SSH password to access your primary node.
    In case of a cloud environment, where authentication is key-based, you require to run the following command:
    # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --primary-node <DNS_Resolvable_Primary_Node_Name> --primary-node-ssh-key <Path_To_Pem_File>
    This will add the node as a secondary node in the cluster.
    Note: When you join a node to an HA cluster, the list-nodes command does not display that a node is in the process of joining the cluster. The newly added node will be displayed in the list-nodes command only after it has been added to the HA cluster.
  3. If you have upgraded to version 5.0.0 and later and are joining a freshly provisioned 5.0.0 (or later) node (with the join-cluster operation) to a cluster having some connectors installed, then you are required to manually reinstall the connectors that were present on the existing node on the new node.

Alternative process that can be followed to configure HA:

  1. Connect to your VM as a root user and run the following command:
    # csadm ha
    This will display the options available to configure HA.
  2. To configure a node as a secondary node, perform the following steps:
    1. SSH to the active primary node and run the csadm ha export-conf command to export the configuration details of the active primary node to a configuration file named ha.conf.
      You must copy the ha.conf file from the active primary node to the node that you want to configure as a secondary node.
    2. On the active primary server, whitelist the hostnames of the secondary nodes, using the following command:
      # csadm ha whitelist --nodes
      Add the comma-separated list of hostnames of the cluster nodes that you want to whitelist after the --nodes argument.
      Important: In case of an externalized database, you need to whitelist all nodes in a cluster in the pg_hba.conf file.
    3. Ensure that all HA nodes are resolvable through DNS and then SSH to the server that you want to configure as a secondary node and run the following command:
      # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --conf <location of the ha.conf file>
      For example, # csadm ha join-cluster --status passive --role secondary --conf tmp/ha.conf
      This will add the node as a secondary node in the cluster.
      Note: If you run the csadm ha join-cluster command without whitelisting the hostnames of the secondary nodes, then you will get an error such as, Failed to verify....
      Also, when you join a node to an HA cluster, the list-nodes command does not display that a node is in the process of joining the cluster. The newly added node will be displayed in the list-nodes command only after it has been added to the HA cluster.
  3. If you have upgraded to version 5.0.0 or later and are joining a freshly provisioned 5.0.0 (or later) node (with the join-cluster operation) to a cluster having some connectors installed, then you are required to manually reinstall the connectors that were present on the existing node on the new node.

Important

In the case of an HA cluster, proxy settings get replicated only for FortiSOAR™ services on the secondary/passive nodes. OS services or commands such as 'yum', 'curl', or 'wget' do not honor the proxy settings of the primary node. Therefore, to configure proxy settings on the secondary node, you can either configure the proxy setting when the FortiSOAR Configuration Wizard is run on the first login of the 'csadmin' user (using SSH) or by using the csadm network {set-https-proxy|set-http-proxy|set-no-proxy} command.

Usage of the csadm ha command

Certain operations, such as takeover, join cluster, etc. might take a longer period of time to run, therefore you must ensure that your ssh session does not timed out by entering into the screen mode. For more information, see Handling session timeouts.

You can get help for the csadm ha command and subcommands using the --help parameter.

Note

It is recommended that you perform operations such as join-cluster, leave-cluster, etc sequentially. For example, when you are adding nodes to a cluster, it is recommended that you add the nodes one after the other rather than parallelly.

The following table lists all the subcommands that you can use with the csadm ha command:

Subcommand Brief Description
list-nodes Lists all the nodes that are available in the cluster with their respective node names and ID, status, role, and a comment that contains information about which nodes have joined the specific HA cluster and the primary server.
List Nodes command output
You can filter nodes for specific status, role, etc.
For example, if you want to retrieve only those nodes that are active use the following command: csadm ha list-nodes --active, or if you want to retrieve secondary active nodes, then use the following command: csadm ha list-nodes --active --secondary.
Note: The list-nodes command will not display a node that is in the process of joining the cluster, i.e., it will display the newly added node only after it has been added to the HA cluster.
export-conf Exports the configuration of details of the active primary node to a configuration file named ha.conf. For more details on export-conf, see the Process for configuring HA section.
whitelist Whitelists the hostnames of the secondary nodes in the HA cluster on the active primary node. For more details on whitelist, see the Process for configuring HA section.
Important: Ensure that incoming TCP traffic from the IP address(es) [xxx.xxx.xx.xxx] of your FortiSOAR™ instance(s) on port(s) 5432, 9200, and 6379 is not blocked by your organization's firewall.
join-cluster Adds a node to the cluster with the role and status you have specified. For more details on join-cluster, see the Process for configuring HA section.
firedrill Tests your disaster recovery configuration.
You can perform a firedrill on a secondary (active or passive) node only. Running the firedrill suspends the replication to the node's database and sets it up as a standalone node pointing to its local database. Since the firedrill is primarily performed to ensure that the database replication is set up correctly, hence it is not applicable when the database is externalized.
Important: The node on which a firedrill is being performed will have their schedules and playbooks stopped, i.e., celerybeatd will be disabled on this node. This is done intentionally as any configured schedules or playbooks should not run when the node is in the firedrill mode.
Once you have completed the firedrill, ensure that you perform restore, to get the nodes back to replication mode.
restore Restores the node back to its original state in the cluster after you have performed a firedrill. That is, csadm ha restore restores the node that was converted to the active primary node after the firedrill back to its original state of a secondary node.
The restore command discards all activities such as record creation, that is done during the firedrill since that data is assumed to be test data. This command will restore the database from the content backed up prior to firedrill.
takeover Performs a takeover when your active primary node is down. Therefore, you must run the csadm ha takeover command on the secondary node that you want to configure as your active primary node.
list-commands Lists all pending, in-progress, or failed commands that were propagated across the cluster nodes. You can filter this command for a specific nodeID or state.
For example, if you want to retrieve a list of failed commands use the following command: csadm ha list-commands --status failed.
In case of failed commands, you must check the reason for failure and re-run the failed command manually after resolving the error.
leave-cluster Removes a node from the cluster and the node goes back to the state it was in before joining the cluster.

Points to be considered while working with High Availability configurations

  • When using an active-passive configuration with internal databases, ensure that replication between the nodes is working correctly using the following steps:
    • Perform the firedrill operation at regular intervals to ensure that the passive node can takeover successfully, when required.
    • Schedule full nightly backups at the active primary node using the FortiSOAR™ backup and restore scripts. For more information on backup and restore, see the Backing up and Restoring FortiSOAR™ chapter.
  • In case your FortiSOAR™ instance is part of a High Availability cluster, the License Manager page, you will display the information about the nodes in the cluster, if you have added a secondary node as shown in the following image:
    License Manager Page in case of  your FortiSOAR™ instance is part of a High Availability cluster
  • If you have built your own custom connector, then you must upload the .tgz file of the connector on all the nodes within the HA cluster.
    When you are uploading the .tgz file on all the nodes, you must ensure that you select the Delete all existing versions checkbox. You must also ensure that you have uploaded the same version of the connector to all the nodes.
  • For the procedure on how to upgrade a FortiSOAR™ High Availability Cluster to 6.0.0, see the Upgrading a FortiSOAR™ High Availability Cluster to 6.0.0 article.

Takeover

Use the csadm ha takeover command to perform a takeover when your active primary node is down. Run this command on the secondary node that you want to configure as your active primary node.

From version 5.1.0 onwards, takeover is a single-step operation, i.e., you do not need to manually reconfigure all the nodes in the cluster to point to the new active primary node. The takeover operation reconfigures the nodes to point to the new active primary node during the process.

However, if during takeover you specify no to the Do you want to invokejoin-clusteron other cluster nodes? prompt, or if any node(s) is not reachable, then you will have to reconfigure all the nodes (or the node(s) that were not reachable) in the cluster to point to the new active primary node using the csadm ha join-cluster command.

In case of an internal database cluster, when the failed primary node comes online after the takeover, it still thinks of itself as the active primary node with all its services running. In case of an external database cluster, when the failed primary node comes online after the takeover, it detects its status as "Faulted" and disables all its services. In both cases, run the csadm ha join-cluster command to point all the nodes to the new active primary node. For details on join-cluster, see Process for configuring HA.

Tunables

You can tune the following configurations:

  • max_wal_senders = 10
    This attribute defines the maximum number of walsender processes. By default, this is set as 10.
  • wal_keep_segments = 320
    This attribute contains a maximum of 5 GB data.
    Important: Both max_wal_senders and wal_keep_segments attributes are applicable when the database is internal.

Every secondary/passive node needs one wal sender process on the primary node, which means that the above setting can configure a maximum of 10 secondary/passive nodes.

If you have more than 10 secondary/passive nodes, then you need to edit the value of the max_wal_senders attribute in the /var/lib/pgsql/12/data/postgresql.conf file on the primary node and restart the PostgreSQL server using the following command: systemctl restart postgresql-12
Note: You might find multiple occurrences of max_wal_senders attribute in the postgresql.conf file. You always need to edit last occurrence of the max_wal_senders attribute in the postgresql.conf file.

The wal_keep_segments attribute has been set to 320, which means that the secondary nodes can lag behind by the maximum of 5GB. If the lag is more than 5GB, then replication will not work properly, and you will require to reconfigure the secondary node by running the join-cluster command

Also note that Settings changes that are done in any configuration file on an instance, such as changing the log level, etc., apply only to that instance. Therefore, if you want to apply the changed setting to all the node, you have to make those changes across all the cluster nodes.

HAProxy

The clustered instances should be fronted by a TCP Load Balancer such as HAProxy, and clients should connect to the cluster using the address of the proxy.

Setting up HAProxy as a TCP load balancer fronting the two clustered nodes

The following steps list out the steps to install HAProxy on a CentOS Virtual Machine:

  1. # yum install haproxy
  2. In the /etc/haproxy/haproxy.cfg file, add the policy as shown in the following image:
    HA Proxy policy in the haproxy configuration file
  3. To reload the firewall, run the following commands:
    sudo firewall-cmd --zone=public --add-port=<portspecifiedwhilebindingHAProxy>/tcp --permanent
    sudo firewall-cmd --reload
  4. Restart haproxy using the following command:
    # systemctl restart haproxy
  5. Use the bind address (instead of the IP address of the node in the cluster) for accessing the FortiSOAR™ UI.

Behavior that might be observed while publishing modules when you are accessing HA clusters using the HAProxy

When you have initiated a publish for any module management activity and you are accessing your HA cluster with one or more active secondary nodes using HAProxy, then you might observe the following behaviors:

  • While the Publish operation is in progress, you might see many publish status messages on the UI.
  • If you have added a new field to the module, or you have removed a field from the module, then you might observe that these changes are not reflected on the UI. In such cases, you must log out of FortiSOAR™ and log back into FortiSOAR™.
  • After a successful publish of the module(s), you might observe that the Publish button is yet enabled and the modules yet have the asterisk (*) sign. In such cases, you must log out of FortiSOAR™ and log back into FortiSOAR™ to view the correct state of the Publish operation.

Troubleshooting HA Issues

Unable to add a node to an HA cluster using join-cluster, and the node gets stuck at a service restart

This issue occurs when you are performing join-cluster of any node and that node sticks at service restart, specifically at PostgreSQL restart.

Resolution

Terminate the join-cluster process and retry join-cluster using an additional parameter --fetch-fresh-backup.

Fixing the HA cluster when the Primary node of that cluster is halted and then resumed

If your primary node is halted due to a system crash or other such events, and a new cluster is made with the other nodes in the HA cluster, the list-nodes command on other nodes will display that the primary node is in the Faulted state. Since the administrator has triggered takeover on other cluster nodes, the administrator will be aware of the faulted primary node. Also, note that even after the primary node resumes, post the halt, the primary node still remains the primary node of its own cluster, and therefore, after the resume, the list-nodes command on the primary node will display this node as Primary Active.

Resolution

To fix the HA cluster to have only one node as primary active node, do the following:

  1. On the primary node, which got resume, run leave-cluster, which will remove this node from the HA cluster.
  2. Run join-cluster command to join this node to the HA cluster with the new primary node.

Unable to join a node to an HA cluster when a proxy is enabled

You are unable to join a node to an HA cluster using the join-cluster command when you have enabled a proxy using which clients should connect to the HA cluster.

Resolution

Run the following commands on your primary node:

# sudo firewall-cmd --zone=trusted --add-source=<CIDR> --add-port=<ElasticSearchPort>/tcp --permanent

# sudo firewall-cmd --reload

For example,

# sudo firewall-cmd --zone=trusted --add-source=64.39.96.0/20 --add-port=9200/tcp --permanent

# sudo firewall-cmd --reload