AE-2334 : Update build process to have the docker and other dependency images ready with image

Review Request #1313 — Created Jan. 14, 2026 and submitted — Latest diff uploaded

pmurugaiyan
AMP
amp_4_0
AE-2334
apoorva.sn, pradeep, shuinvy

AMP High Availability (HA) Cluster Deployment Guide

This guide describes how to deploy the Array Management Platform (AMP) on a 3-node (or larger) Docker Swarm cluster using the manage_amp.sh automation script.

1. Prerequisites

Hardware

  • 3 Nodes (Physical or Virtual Machines)
  • OS: Rocky Linux 9 / RHEL 9 (Recommended)
  • Resources: Minimum 8GB RAM, 4 vCPUs per node.

Network

  • All nodes must be on the same LAN/VLAN.
  • Static IPs are recommended for stability.

Firewall Configuration (On ALL Nodes)

You must open the following ports for Docker Swarm and AMP services:

# Docker Swarm Ports
firewall-cmd --add-port=2377/tcp --permanent
firewall-cmd --add-port=7946/tcp --permanent
firewall-cmd --add-port=7946/udp --permanent
firewall-cmd --add-port=4789/udp --permanent

# AMP Service Ports
firewall-cmd --add-port=80/tcp --permanent   # HTTP
firewall-cmd --add-port=443/tcp --permanent  # HTTPS
firewall-cmd --add-port=5000/tcp --permanent # Local Registry
firewall-cmd --add-port=5432/tcp --permanent # Database (HAProxy/PGBouncer)
firewall-cmd --add-port=5433/tcp --permanent # Database (Direct Patroni)
firewall-cmd --add-port=2379-2380/tcp --permanent # Etcd
firewall-cmd --add-port=8008/tcp --permanent # Patroni API
firewall-cmd --add-port=9200/tcp --permanent # OpenSearch REST (Inter-node & Dashboards)
firewall-cmd --add-port=9300/tcp --permanent # OpenSearch Transport (Cluster)
firewall-cmd --add-port=5601/tcp --permanent # OpenSearch Dashboards
# IMPORTANT: Traffic between nodes on Overlay (UDP 4789) and OpenSearch (9200) MUST be allowed!
firewall-cmd --add-port=3000/tcp --permanent # Grafana
firewall-cmd --add-port=514/tcp --permanent  # Logstash Syslog TCP
firewall-cmd --add-port=514/udp --permanent  # Logstash Syslog UDP

# Reload
firewall-cmd --reload

⚠️ WARNING: Do NOT manually add Docker interfaces (docker0, docker_gwbridge) to firewalld zones. Docker manages these interfaces itself. Manual zone assignments cause ZONE_CONFLICT errors that prevent Docker from starting.

System Tuning (On ALL Nodes)

OpenSearch requires increased virtual memory. Run on each node:

# Using manage_amp.sh (recommended)
./manage_amp.sh system_tune

# Or manually:
sudo sysctl -w vm.max_map_count=262144
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf


2. Docker Swarm Setup

  1. Initialize Swarm on Manager Node (Node 1):

    bash docker swarm init --advertise-addr <NODE_1_IP>

    Copy the "docker swarm join" command output.

  2. Join Worker Nodes (Node 2, Node 3):
    Run the command copied from step 1 on the other nodes:

    bash docker swarm join --token <TOKEN> <NODE_1_IP>:2377

  3. Rename Nodes (Optional but Recommended):
    Assign readable hostnames if not already set (e.g., amp-node-1, amp-node-2). The script uses Docker Hostnames.

    To rename a node (run on the respective node):

    bash hostnamectl set-hostname <new_hostname>

  4. Promote Workers to Managers:
    Promote the newly joined worker nodes to managers, as all nodes function as managers in this setup.

    On Node 1 (Manager):

    bash docker node ls docker node promote <NODE_ID>


3. Offline Deployment (Air-Gapped Environments)

If you've created an offline bundle using ./manage_amp.sh bundle on a build machine with internet access, follow these steps:

Prerequisites

  1. Complete Section 1: Ensure all prerequisites (hardware, network, firewall rules) are met on all nodes.
  2. Complete Section 2: Set up Docker Swarm cluster (init, join workers, promote to managers).

On the Offline Target Machine (Manager Node)

  1. Transfer Files: Copy these files to the offline machine:
  2. amp_offline_bundle.tar.gz
  3. tar-bootstrap.rpm (for minimal Rocky Linux installations)

  4. Install tar (if not present):

bash rpm -ivh tar-bootstrap.rpm

  1. Extract Bundle:

bash tar -xf amp_offline_bundle.tar.gz cd amp_offline_bundle

  1. Load Offline Bundle:

bash ./manage_amp.sh load_offline

This installs all dependencies (Docker, rsync, keepalived, Java, Python) and loads Docker images into the local registry.

  1. Continue with Standard Deployment: After load_offline completes successfully, jump to Section 4 below (starting from Step 0: Configure VIP).

4. Standard Online Deployment

All deployment actions are handled by the manage_amp.sh script on Node 1 (Manager).

Step 0: Configure Virtual IP (VIP) for HA

To ensure High Availability, configure a Floating VIP that will automatically failover between nodes.

On Node 1 (Master):

./manage_amp.sh vip --vip <VIP_ADDRESS> --priority 101

On Node 2 (Backup):

./manage_amp.sh vip --vip <VIP_ADDRESS> --priority 100

This command configures Keepalived and updates your configuration to use this VIP.

Step 1: Prepare Environment

Navigate to the container directory:

cd container/

Check .env file (Optional). The defaults are usually sufficient. You mainly only need to set passwords if you want non-defaults.

vi .env

Step 2: Build & Push Images (First Time or Updates)

This step pulls the required images from the internet (or loads them) and pushes them to the local registry so all Swarm nodes can access them.

./manage_amp.sh build

This may take a while depending on your internet connection.

Step 3: Auto-Configure & Deploy

Run the deploy command with the --auto flag. This will:

  1. Detect all Swarm nodes.
  2. Auto-populate IPs in .env.
  3. Generate the stack.yml dynamically (adding Etcd/DB services for each node).
  4. Deploy the stack.

./manage_amp.sh deploy --auto

Step 4: Setup Certificates (Automated)

The deployment script (deploy --auto) automatically checks for and triggers certificate generation if they are missing.

No manual action required.

Step 5: Initialize Security (First Time Only)

Initialize the OpenSearch security index.

./manage_amp.sh security_init

Step 6: Initialize Grafana DB (First Time Only)

Create the Grafana database user and schema in the HA Postgres cluster.

./manage_amp.sh create_grafana_db

Step 7: Configure OpenSearch Dashboards (First Time Only)

Import Dashboards, Index Patterns, and Index Templates (ISM Policies).

./manage_amp.sh configurator


4. Verification

Check Services

Ensure all services are up and running (expected: 3/3 replicas for global services, 1/1 for others).

docker service ls

Verify HA / Failover

  1. Web Access: Open https://<Any_Node_IP>/ or https://<VIP>/. You should see the AMP login.
  2. Database: Connect to Port 5432 on any node. It routes to the current Primary.

    bash psql -h 127.0.0.1 -p 5432 -U amp_ts_user amp_ts

  3. Failover Test: Reboot a node inside the cluster.

    • Result: The cluster should remain operational.
    • Services will reschedule to remaining nodes.
    • Database leadership will failover automatically via Patroni/Etcd.

5. Troubleshooting

  • Logs: docker service logs -f amp_<service_name>
  • Manual Config Update: If you add a new node to the swarm, re-run:

    bash ./manage_amp.sh deploy --auto

Common Issues

Symptom Cause Solution
502 Bad Gateway on some requests Dashboards not running on all nodes Ensure mode: global in stack.yml
ZONE_CONFLICT Docker crash Manual firewalld zone assignment Remove Docker interfaces from manual zones
invalid mount config Missing log directory on node Create /var/log/amp/opensearch on all nodes
ECONNREFUSED to OpenSearch Firewall blocking port 9200 Open port 9200 on all nodes

Appendix A: Docker Images

The following Docker images are bundled/used by AMP:

Service Image Description
opensearch opensearchproject/opensearch Search and analytics engine
opensearch-dashboards opensearchproject/opensearch-dashboards Visualization UI
timescaledb Custom build (Patroni) Time-series database with HA
pgbouncer edoburu/pgbouncer Connection pooling
grafana grafana/grafana Monitoring dashboards
telegraf telegraf Metrics collection agent
logstash opensearchproject/logstash-oss-with-opensearch-output-plugin Log ingestion
nginx nginx Reverse proxy
etcd quay.io/coreos/etcd Distributed key-value store
haproxy haproxy Database load balancer
registry registry Local Docker registry
busybox busybox Utility container
rocky rockylinux Base OS image

Appendix B: RPM Packages (Offline Bundle)

The offline bundle includes these packages and their dependencies:

Package Purpose
docker-ce, docker-ce-cli, containerd.io Docker runtime
docker-buildx-plugin, docker-compose-plugin Docker plugins
keepalived VIP failover (VRRP)
rsync File synchronization
python3 Scripting
java-17-openjdk OpenSearch security tools
tar Archive extraction
openssl, httpd-tools Certificate generation
curl, jq API calls and JSON parsing
bind-utils DNS tools (dig, nslookup)
iputils Network tools (ping)
net-tools Network debugging (netstat, ifconfig)

Appendix C: Service Deployment Modes

Service Mode Port Rationale
nginx global 80, 443 Web access on every node
opensearch-dashboards global 5601 Local proxy access
grafana global 3000 Local proxy access
logstash global 514 Syslog on all nodes
telegraf global - Docker monitoring per node
haproxy global 5432 Database LB per node
pgbouncer global - Connection pooling
opensearch replicated (3) 9200, 9300 Stateful cluster
timescaledb replicated (3) 5433 Patroni HA cluster
etcd replicated (3) 2379-2380 Raft consensus

Appendix D: Default Ports

Port Protocol Service Notes
80 TCP Nginx (HTTP) Redirects to HTTPS
443 TCP Nginx (HTTPS) Web UI entry point
514 TCP/UDP Logstash Syslog ingestion
2377 TCP Docker Swarm Cluster management
2379-2380 TCP Etcd Cluster coordination
3000 TCP Grafana Monitoring UI
4789 UDP Docker Overlay Container networking
5000 TCP Registry Local image storage
5432 TCP HAProxy Database (via LB)
5433 TCP TimescaleDB Direct Patroni access
5601 TCP Dashboards OpenSearch UI
7946 TCP/UDP Docker Swarm Node communication
8008 TCP Patroni Health API
9200 TCP OpenSearch REST API
9300 TCP OpenSearch Cluster transport

The changes has been tested locally.

    Loading...