AE-2321 : Implement Docker swarm for the single & multi node deployments
Review Request #1285 — Created Jan. 6, 2026 and submitted
| Information | |
|---|---|
| pmurugaiyan | |
| AMP | |
| amp_4_0 | |
| AE-2321 | |
| Reviewers | |
| apoorva.sn, pradeep, shuinvy | |
- Introduced docker swarm based orchestration
- Support for single node and partial failover support for multinode clusters
- Health checks and self healing capabilities has been added.
System Architecture: AMP Platform on Docker Swarm
1. Executive Summary
This document details the architecture of the AMP Platform deployed on a 3-Node Docker Swarm cluster. The current design utilizes a Hybrid High Availability (HA) model: it provides full redundancy for network ingress and stateless processing, while pinning stateful storage to a designated primary node to ensure data consistency without external shared storage dependencies (NFS/SAN).
2. Deployment Topology
The cluster consists of three nodes participating in a Docker Swarm.
Node Roles
| Node | Hostname | Swarm Role | Labels | Primary Responsibility |
|---|---|---|---|---|
| Node 1 | amp-node-1 |
Manager (Leader) | type=storage |
Storage + Compute. Hosts the database files (OpenSearch, TimescaleDB, Registry) on local high-speed disk. |
| Node 2 | amp-node-2 |
Manager/Worker | type=compute |
Compute Only. Runs stateless services (Logstash, Nginx, Dashboards). |
| Node 3 | amp-node-3 |
Manager/Worker | type=compute |
Compute Only. Runs stateless services (Logstash, Nginx, Dashboards). |
Component Diagram
graph TD
Client["Client / Log Sources"] --> VIP["Virtual IP (Keepalived)"]
subgraph cluster_swarm ["Docker Swarm Cluster"]
VIP --> Nginx_Service["Nginx (Web Port 443)"]
VIP --> Logstash["Logstash (Syslog Port 514)"]
subgraph cluster_stateless ["Stateless Layer (Any Node)"]
Nginx_Service --> Dashboards
Nginx_Service --> Grafana
end
subgraph cluster_stateful ["Stateful Layer (Node 1 Only)"]
Logstash --> OpenSearch[("OpenSearch Data")]
Grafana --> Timescale[("TimescaleDB Data")]
end
end
3. High Availability (HA) Strategy
3.1 Network Layer (Ingress)
- Technology:
Keepalived(VRRP). - Mechanism: A floating Virtual IP (VIP) is assigned to the active Leader (Node 1). If Node 1 fails, the VIP automatically migrates to Node 2 or Node 3 within seconds.
- Benefit: External systems (Syslog senders, User Browsers) never need to reconfigure IPs. The "Front Door" is always open.
3.2 Compute Layer (Stateless Services)
- Services:
nginx,logstash,opensearch-dashboards,grafana. - Mechanism: Docker Swarm Orchestration.
- Behavior: These services are not pinned. If a node fails, Swarm reschedules replicas to remaining healthy nodes.
- Benefit: Continuous request processing. Dashboards and Ingestion endpoints remain accessible.
3.3 Storage Layer (Stateful Services)
- Services:
opensearch,timescaledb. - Mechanism: Pinned placement (
constraints: - node.labels.type == storage) using Local Docker Volumes (driver: local). - Behavior: These services MUST run on Node 1 because their data files exist physically on Node 1's disk. They cannot start on other nodes.
4. Architectural Analysis
4.1 Advantages
- High Performance (I/O): By using local disks (NVMe/SSD) on Node 1 instead of NFS, database write speeds are maximized. This is critical for high-volume log ingestion.
- Stateless Scalability: Heavy processing tasks (Log Parsing via Logstash, SSL Termination via Nginx) are distributed across 3 nodes. Node 1 is not overwhelmed by CPU tasks, leaving it free to handle Database I/O.
- Simplicity: No requirement for external NAS, SAN, or complex Ceph/GlusterFS configurations. Reduced maintenance overhead.
- Partial Failover: In the event of a Node 1 failure, the UI remains accessible (with connection errors) and VIP remains pingable, preventing "Connection Refused" errors on client side.
4.2 Limitations & Risks
- Single Point of Failure (Data): If Node 1 fails, OpenSearch and TimescaleDB go offline. No new logs can be written, and no historical data can be queried.
- Data Loss Risk (Buffer Overflow): While Logstash (on surviving nodes) will buffer incoming logs in memory, it will eventually fill up and start dropping data if Node 1 does not recover quickly.
- Manual Recovery: If Node 1 has a hardware failure, recovering the data requires restoring from backups, as the data does not exist on Nodes 2/3.
5. Solution: Path to Full HA
To mitigate the storage limitation and achieve Zero Downtime, the cluster can be upgraded to use Shared Storage.
Implementation: NFS Migration
- Dependency: An external NFS Server accessible by all 3 nodes.
- Configuration:
- Update
stack.ymlvolumes to usedriver_opts: type: nfs. - Remove
placement.constraintsfrom database services.
- Update
- Result: Database containers effectively become stateless. If Node 1 fails, Swarm simply restarts OpenSearch on Node 2, which mounts the same NFS path and resumes operations immediately.
This upgrade can be performed without reinstalling the cluster, by modifying the stack.yml as detailed in README.md.
The changes has been tested locally.
