Epic M — deployment architecture overview

Epic M Phase A (MA1–MA6) delivers the production container and Kubernetes infrastructure for the Arbitex Platform. This document describes the deployment architecture: how the services are containerized, how the Helm chart deploys them to AKS, how the CI/CD pipeline builds and pushes images, what security controls are in place, and where to find the operational runbooks.

This is an architecture overview. Step-by-step operational procedures are in the ops runbooks linked at the end of this document.

Phase A summary

Sub-phase	Scope
MA1	Multi-stage Dockerfiles for all four services (backend, frontend, NER GPU, DeBERTa validator)
MA2	Helm chart (`deploy/helm/arbitex-platform/`) for AKS deployment — Deployments, Services, Ingress, PVC, HPA, PodDisruptionBudget
MA3	GitHub Actions CI/CD pipeline: lint → test → Trivy vulnerability scan → build → ACR push
MA4	Security hardening: non-root containers, read-only root filesystems, Azure Key Vault secrets, HSTS, mTLS CA verification
MA5	NGINX Ingress configuration: request body limits, SSE long-poll support, TLS termination
MA6	Ops runbooks: Alembic rollback, Docker→AKS migration, incident response playbook

Container architecture

Services

Four containers run as separate Kubernetes Deployments. All are built with multi-stage Dockerfiles pinned to digest-verified base images in production.

Service	Image name	Internal port	Base image
FastAPI backend	`platform-api`	8000	`python:3.12.8-slim`
React/Nginx frontend	`platform-frontend`	8080	`node:20.18-alpine` → `nginx:1.27.3-alpine`
NER GPU microservice (GLiNER)	`ner-gpu`	8200	GPU-enabled PyTorch image
DeBERTa validator microservice	`deberta-validator`	8201	GPU-enabled PyTorch image

Backend (FastAPI)

Two build stages:

Builder — installs Python dependencies into /install from requirements.txt (production deps only; dev tooling is excluded).
Production — copies the installed packages from the builder stage, creates a non-root user (appuser, UID 1000), exposes port 8000, and runs the docker-entrypoint.sh which executes Alembic migrations then starts uvicorn.

Security properties:

PYTHONDONTWRITEBYTECODE=1 and PYTHONUNBUFFERED=1 for clean container logging.
PYTHONPATH=/app to resolve backend.app.* import paths.
Healthcheck uses Python stdlib (urllib.request) — no curl installed.
Entrypoint uses exec form so Linux signals propagate correctly to uvicorn.

Frontend (React/Nginx)

Two build stages:

Builder — Node 20 Alpine image runs npm ci && npm run build to produce the Vite/React dist bundle.
Production — Nginx 1.27.3 Alpine image serves the static bundle. The default nginx config is replaced with a custom nginx.conf (request size limits, SSE, proxy headers). Runs as nginx user (UID 101).

Healthcheck: wget -qO /dev/null http://localhost:8080/healthz.

NER GPU microservice

GLiNER zero-shot NER model (urchade/gliner_medium-v2.1) served over HTTP on port 8200. Runs on GPU node pool (accelerator=nvidia). Circuit breaker: 3 consecutive failures / 60-second reset window.

DeBERTa validator microservice

DeBERTa NLI validator served over HTTP on port 8201. Runs on GPU node pool. Provides /validate endpoint for fine-grained entity type classification. Circuit breaker: 3 consecutive failures / 60-second reset window.

Note: the Outpost also includes an in-process DeBERTa scanner (see DeBERTa Tier 3 admin guide) that runs independently of this microservice.

Helm chart structure

Chart location: deploy/helm/arbitex-platform/

deploy/helm/arbitex-platform/
├── Chart.yaml                    # name: arbitex-platform, version: 0.1.0, appVersion: 0026
├── values.yaml                   # default values
├── values-prod.yaml              # production resource overrides
└── templates/
    ├── deployment-api.yaml       # FastAPI backend Deployment
    ├── deployment-frontend.yaml  # React/Nginx frontend Deployment
    ├── deployment-ner-gpu.yaml   # GLiNER NER Deployment
    ├── deployment-deberta.yaml   # DeBERTa validator Deployment
    ├── service-api.yaml          # ClusterIP :8100
    ├── service-frontend.yaml     # ClusterIP :3100
    ├── service-ner-gpu.yaml      # ClusterIP :8200
    ├── service-deberta.yaml      # ClusterIP :8201
    ├── ingress.yaml              # NGINX Ingress with TLS
    ├── hpa.yaml                  # HorizontalPodAutoscaler
    ├── pdb.yaml                  # PodDisruptionBudget
    ├── job-alembic-migrate.yaml  # pre-upgrade Helm hook for migrations
    └── secret-provider-class.yaml  # Azure Key Vault CSI SecretProviderClass

Values hierarchy

values.yaml          ← base defaults
values-prod.yaml     ← production resource overrides
--set ...            ← CI/CD runtime overrides (image tags, digests)

Key service endpoints

Service	Cluster DNS	Port
Backend API	`arbitex-platform-api`	`8100` → container `8000`
Frontend	`arbitex-platform-frontend`	`3100` → container `8080`
NER GPU	`arbitex-platform-ner-gpu`	`8200`
DeBERTa	`arbitex-platform-deberta`	`8201`

Alembic init container

When alembic.runAsInitContainer: true (the default), every API pod rollout includes an alembic-migrate init container that runs python -m alembic upgrade head before the main container starts. If the migration fails, the pod never reaches Ready state and Kubernetes blocks the rollout.

A pre-upgrade Helm hook Job also runs before each helm upgrade (backoffLimit: 3, ttlSecondsAfterFinished: 300).

GPU node placement

NER GPU and DeBERTa pods require GPU nodes. The chart sets nodeSelector: { accelerator: nvidia } and tolerations for nvidia.com/gpu: NoSchedule on those Deployments. The NVIDIA device plugin DaemonSet must be installed in the cluster (kube-system namespace) for GPU resource scheduling.

CI/CD pipeline

The pipeline runs on GitHub Actions. The trigger is a push to main or a release tag. Stages run in order; the pipeline halts on any failure.

Lint (ruff, eslint)
  ↓
Unit tests (pytest, vitest)
  ↓
Trivy vulnerability scan (CRITICAL/HIGH findings fail the build)
  ↓
Docker build (multi-stage, base image digest pinned)
  ↓
ACR push (four images: platform-api, platform-frontend, ner-gpu, deberta-validator)
  ↓
Helm upgrade (--atomic --wait, 10-minute timeout)

Image tagging

Each build produces two tags per image:

<acr>.azurecr.io/platform-api:<semver> — immutable release tag (e.g. v0.29.0)
<acr>.azurecr.io/platform-api:latest — mutable floating tag

Production Helm deployments pin by digest (--set api.image.digest=sha256:<digest>) to prevent tag-overwrite incidents.

Trivy scan

Trivy scans all four images for OS package and Python dependency CVEs. CRITICAL and HIGH severity findings that do not have a --ignore-unfixed exemption fail the build. The scan report is uploaded as a GitHub Actions artifact.

NGINX Ingress configuration

The Ingress controller is NGINX (ingress-nginx/ingress-nginx Helm chart). The Arbitex Ingress resource routes api.arbitex.ai to the backend and frontend services.

Request body limits

The backend API accepts large file uploads and multi-turn conversation requests. The NGINX configuration sets:

client_max_body_size 50m;
proxy_request_buffering off;

This is set via the Ingress annotation nginx.ingress.kubernetes.io/proxy-body-size: 50m in templates/ingress.yaml.

SSE support

The chat completion endpoint streams responses as Server-Sent Events. NGINX is configured to disable proxy buffering for SSE routes:

proxy_buffering off;
proxy_cache off;
X-Accel-Buffering: no

This is applied via annotation on the Ingress resource.

TLS termination

TLS terminates at the Ingress controller. Certificates are issued by cert-manager using a Let’s Encrypt ClusterIssuer (letsencrypt-prod). The Ingress references a TLS secret (arbitex-tls) that cert-manager populates.

Security layers

mTLS chain

Internal mTLS is used for Outpost-to-Platform communication. The Platform backend validates the Outpost’s client certificate against the Platform CA. Certificate paths are mounted as Kubernetes Secrets.

Secret management — Azure Key Vault

Sensitive environment variables (DATABASE_URL, SECRET_KEY, AUDIT_HMAC_KEY, REDIS_URL, POLICY_SIGNING_KEY) are stored in Azure Key Vault and surfaced to pods via one of two methods:

Method	When to use
Kubernetes Secret (manual)	Simpler setup; requires manual rotation
Key Vault CSI driver (`SecretProviderClass`)	Recommended for production; automatic rotation via AKS managed identity

The Helm template expects a Kubernetes Secret named <release-name>-api-secrets with at minimum DATABASE_URL and SECRET_KEY keys.

Container hardening

All containers run with these security context settings:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000        # appuser (backend) / 101 nginx user (frontend)
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: [ALL]

HSTS

The Ingress sets the Strict-Transport-Security response header via annotation:

nginx.ingress.kubernetes.io/configuration-snippet: |
  add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

SCIM token rotation

SCIM provisioning tokens are stored in Key Vault and rotated by the identity team. The SCIM_BEARER_TOKEN secret in Key Vault is updated out-of-band; the CSI driver re-syncs the Kubernetes Secret on the next sync interval (default: 2 minutes).

Pod Security Standard

Non-GPU pods run under restricted Pod Security Standard. GPU pods (ner-gpu, deberta-validator) require baseline PSS due to device plugin requirements. If GPU and non-GPU pods share the arbitex namespace, set the namespace enforcement level to baseline with a warn=restricted label.

Ops runbooks reference

The following runbooks are maintained in docs/ops/ of the platform repository:

Runbook	Location	Use when
Alembic migration rollback	`docs/ops/alembic-rollback.md`	A migration breaks the API after deploy (startup errors, 500s, constraint violations)
Docker → AKS migration	`docs/ops/migration-runbook.md`	Migrating an existing Docker Compose deployment to AKS for the first time
Incident response	`docs/ops/incident-response.md`	Kill switch activation, provider failover, DLP bypass, HMAC integrity failure, platform outage

Rollback decision matrix

Symptom	Action
Pods stuck in init container failure after deploy	Alembic rollback runbook (run `alembic downgrade -1` as a one-off Job)
API CrashLoopBackOff, no migration issues	`helm rollback arbitex-platform 0 --namespace arbitex`
Provider returning 5xx errors	Incident response → Incident 2 (provider failover / circuit breaker)
DLP blocking legitimate content	Incident response → Incident 3 (DLP bypass procedure)
HMAC audit chain failure	Incident response → Incident 4 (audit chain integrity)
Complete platform outage	Incident response → Incident 5 (diagnostic steps + restart sequence)

Current migration baseline

The Alembic migration chain as of Epic M Phase A close:

Revision	Description
`051_user_timezone`	Current head — user timezone column
`050_org_scim_tokens`	SCIM token storage for orgs
`049_audit_hmac_key_id`	HMAC key ID for audit chain versioning

The next migration will be 052_*. The full revision chain is documented in docs/ops/alembic-rollback.md, Section 8.