- Download Diff

Summary:

AE-2321 : Implement Docker swarm for the single & multi node deployments

Review Request #1285 — Created Jan. 6, 2026 and submitted Jan. 7, 2026, 2:06 p.m.

Information
Owner:	pmurugaiyan
Repository:	AMP
Branch:	amp_4_0
Bugs:	AE-2321
Depends On:
Reviewers
Groups:
People:	apoorva.sn, pradeep, shuinvy

Description

Introduced docker swarm based orchestration
Support for single node and partial failover support for multinode clusters
Health checks and self healing capabilities has been added.

System Architecture: AMP Platform on Docker Swarm

1. Executive Summary

This document details the architecture of the AMP Platform deployed on a 3-Node Docker Swarm cluster. The current design utilizes a Hybrid High Availability (HA) model: it provides full redundancy for network ingress and stateless processing, while pinning stateful storage to a designated primary node to ensure data consistency without external shared storage dependencies (NFS/SAN).

2. Deployment Topology

The cluster consists of three nodes participating in a Docker Swarm.

Node Roles

Node	Hostname	Swarm Role	Labels	Primary Responsibility
Node 1	`amp-node-1`	Manager (Leader)	`type=storage`	Storage + Compute. Hosts the database files (OpenSearch, TimescaleDB, Registry) on local high-speed disk.
Node 2	`amp-node-2`	Manager/Worker	`type=compute`	Compute Only. Runs stateless services (Logstash, Nginx, Dashboards).
Node 3	`amp-node-3`	Manager/Worker	`type=compute`	Compute Only. Runs stateless services (Logstash, Nginx, Dashboards).

Component Diagram

graph TD
    Client["Client / Log Sources"] --> VIP["Virtual IP (Keepalived)"]

    subgraph cluster_swarm ["Docker Swarm Cluster"]
        VIP --> Nginx_Service["Nginx (Web Port 443)"]
        VIP --> Logstash["Logstash (Syslog Port 514)"]

        subgraph cluster_stateless ["Stateless Layer (Any Node)"]
            Nginx_Service --> Dashboards
            Nginx_Service --> Grafana
        end

        subgraph cluster_stateful ["Stateful Layer (Node 1 Only)"]
            Logstash --> OpenSearch[("OpenSearch Data")]
            Grafana --> Timescale[("TimescaleDB Data")]
        end
    end

3. High Availability (HA) Strategy

3.1 Network Layer (Ingress)

Technology: Keepalived (VRRP).
Mechanism: A floating Virtual IP (VIP) is assigned to the active Leader (Node 1). If Node 1 fails, the VIP automatically migrates to Node 2 or Node 3 within seconds.
Benefit: External systems (Syslog senders, User Browsers) never need to reconfigure IPs. The "Front Door" is always open.

3.2 Compute Layer (Stateless Services)

Services: nginx, logstash, opensearch-dashboards, grafana.
Mechanism: Docker Swarm Orchestration.
Behavior: These services are not pinned. If a node fails, Swarm reschedules replicas to remaining healthy nodes.
Benefit: Continuous request processing. Dashboards and Ingestion endpoints remain accessible.

3.3 Storage Layer (Stateful Services)

Services: opensearch, timescaledb.
Mechanism: Pinned placement (constraints: - node.labels.type == storage) using Local Docker Volumes (driver: local).
Behavior: These services MUST run on Node 1 because their data files exist physically on Node 1's disk. They cannot start on other nodes.

4. Architectural Analysis

4.1 Advantages

High Performance (I/O): By using local disks (NVMe/SSD) on Node 1 instead of NFS, database write speeds are maximized. This is critical for high-volume log ingestion.
Stateless Scalability: Heavy processing tasks (Log Parsing via Logstash, SSL Termination via Nginx) are distributed across 3 nodes. Node 1 is not overwhelmed by CPU tasks, leaving it free to handle Database I/O.
Simplicity: No requirement for external NAS, SAN, or complex Ceph/GlusterFS configurations. Reduced maintenance overhead.
Partial Failover: In the event of a Node 1 failure, the UI remains accessible (with connection errors) and VIP remains pingable, preventing "Connection Refused" errors on client side.

4.2 Limitations & Risks

Single Point of Failure (Data): If Node 1 fails, OpenSearch and TimescaleDB go offline. No new logs can be written, and no historical data can be queried.
Data Loss Risk (Buffer Overflow): While Logstash (on surviving nodes) will buffer incoming logs in memory, it will eventually fill up and start dropping data if Node 1 does not recover quickly.
Manual Recovery: If Node 1 has a hardware failure, recovering the data requires restoring from backups, as the data does not exist on Nodes 2/3.

5. Solution: Path to Full HA

To mitigate the storage limitation and achieve Zero Downtime, the cluster can be upgraded to use Shared Storage.

Implementation: NFS Migration

Dependency: An external NFS Server accessible by all 3 nodes.
Configuration:
1. Update stack.yml volumes to use driver_opts: type: nfs.
2. Remove placement.constraints from database services.
Result: Database containers effectively become stateless. If Node 1 fails, Swarm simply restarts OpenSearch on Node 2, which mounts the same NFS path and resumes operations immediately.

This upgrade can be performed without reinstalling the cluster, by modifying the stack.yml as detailed in README.md.

Testing Done

The changes has been tested locally.

Files

Ship it!

```
Ship It!
```

You have a pending review.

Review Board 5.0.5