HA Cluster

How Proxmox HA Cluster Works: Reliable and Highly Available Virtualization

Proxmox Virtual Environment (Proxmox VE) is a virtualization platform that enables easy management of virtualized resources and their clustering. One of the key components of Proxmox VE is the High Availability (HA) cluster which provides reliability and highly available resources for applications and services.

What is a Proxmox HA Cluster?

A Proxmox HA cluster is an aggregation of several physical or virtual servers into one cluster, allowing for automatic workload redistribution and recovery in case of failure of one of the servers. This functionality ensures that your applications and services remain available even in the event of a node failure within the cluster.

Proxmox HA Cluster Architecture

The Proxmox HA cluster consists of two or more nodes (servers) that are connected via network links for communication and data synchronization. Each node in the cluster has the same configuration and access to shared data storage, such as shared storage using CEPH technology. Within the Proxmox HA cluster, one device is designated as the Master, and the other nodes are Slaves. The Master node is responsible for managing and coordinating the cluster, while the Slave nodes provide computational and storage performance.

In Daktela, we currently have two independent HA clusters operational, each containing several nodes (blades) and they are divided 50:50 in each data center.

HA-PVE1

HA-PVE2

CEPH Storage

CEPH is an open-source distributed data storage system that enables efficient and reliable management of data storage. Within the Proxmox HA cluster, CEPH is used to provide shared data storage among all nodes. Data is divided and replicated among different nodes, ensuring fault tolerance and high availability.

Our CEPH storage is built on nVME disks, so there is no delay in synchronization between individual nodes. Thanks to 100Gbits connections between all clusters and both primary data centers, the latency of write changes is immeasurable.

Corosync

Corosync is open-source software for communication and coordination between nodes in the cluster. It is used to synchronize the state and events among nodes, which is crucial for the proper functioning of the Proxmox HA cluster. Corosync ensures that all nodes have a consistent view of the cluster's state and enables quick detection of failures and load balancing adaptation.

Failure Detection and Virtual Machine Migration

Upon detecting a node failure, the Proxmox HA cluster immediately notifies other nodes through Corosync. Subsequently, the workload switching mechanism is activated, where virtual machines running on the failed node are automatically migrated to another available node. The duration of this process depends on the configuration and load of the cluster but is typically measured in tens of seconds. In Daktela, we have tested that the failover of customer instances is up to fully operational status within 2 minutes.

Description of Proxmox Backup Server Functionality within the Proxmox HA Cluster

Proxmox Backup Server (PBS) is an open-source tool developed by Proxmox Server Solutions for data backup and recovery in virtualized environments. Within the Proxmox HA cluster, PBS plays a crucial role in providing reliable data protection and a highly available backup solution.

Virtual Machine Backup

Using PBS, we regularly back up virtual machines (VMs) running on nodes in the Proxmox HA cluster. Backups are scheduled and performed automatically according to a defined schedule. In our implementation, backups are performed twice daily, ensuring data currency and minimal loss in case of failure. For selected customers, we are able to perform backups every hour.

PBS backs up using snapshots, so the backups are very fast.

We also use other technologies for back ups, such as Dirvish or Xtra backup. Therefore, it can be said that we are able to restore not only the entire machine 1:1 to the last backup, but also just part of the instance, such as files or just the database.

Georedundant Backup

Our Proxmox HA cluster utilizes two georedundant Fullflash Proxmox Backup Servers for storing backups. This approach ensures that backups are stored in two separate locations, increasing fault tolerance and ensuring reliability in data restoration in case of a failure in one of the data centers.

Data Deduplication and Compression

PBS uses data deduplication and compression techniques for efficient storage utilization. Deduplication identifies and removes data duplicates between different backups, while compression reduces the size of data files. This helps reduce storage requirements and saves space and bandwidth during data transfer between nodes. Thanks to this, we are able to back up over 100TB of data daily.

Backup Integrity and Verification

PBS performs regular integrity checks of backups to ensure that they are complete and undamaged. Backup verification is an important part of data security and helps minimize the risk of data loss when restoration is needed.

Simple Management and Monitoring

Proxmox Backup Server provides a user-friendly interface for configuration, scheduling, and monitoring of backups. Administrators can easily monitor the backup status, perform data restoration, and manage storage through a central interface.

To further increase awareness of the backup status, we have implemented additional monitoring methods:

Monitoring the status using Zabbix
Notifications in a ticketing system

Recovery from Backup

Thanks to geo-redundant storages and layered backups, we are able to restore operations even in the event of a complete data center outage or data inconsistencies in individual customer instances.