Infrastructure Migration Proposal — Azure App Service (PaaS) → Azure VM (IaaS)

1. Executive Summary

The current production environment runs on Azure App Service Premium v3 (P1V3 — 2 vCPU / 8GB RAM / 250GB), billed at $129.94/mo, with a SQLite database stored on the local instance filesystem. This combination is the root cause of recurring incidents: data loss on instance recycle, write contention under concurrent load, and an opaque runtime that cannot be debugged or log-traced effectively. The Premium v3 tier is also overprovisioned for the actual workload (~1–2GB RAM resident in practice), meaning the client is paying for unused capacity.

We recommend migrating to Azure Virtual Machines (IaaS) running Docker, with two isolated environments (staging + production). PostgreSQL runs as a Docker container on the same production VM, with hourly automated backups uploaded off-VM to Azure Blob Storage (7-day retention). Cloudflare sits in front for free CDN + DNS + DDoS protection, and a GitHub Actions CI/CD pipeline builds once and promotes the same image from staging to production. The client stays inside the Azure ecosystem (single invoice, same compliance boundary) while gaining full SSH access, real logs, and predictable cost.

Future scaling path: when traffic / data volume grows, the Dockerized Postgres can be migrated to Azure Database for PostgreSQL — Flexible Server with minimal app changes (only the connection string changes). This gives the project a low-cost starting point with a clear upgrade path — pay for managed services only when they're actually needed.

~$65/mo

Saved vs current $129.94 P1V3 invoice (~50%)

0 → 99.9%

Data durability with managed Postgres + automated backups

< 30s

Production rollback to any prior release

2. Why the Current Setup is Failing

Problem	Root Cause	Business Impact
Data loss after deploy / restart	SQLite file lives on App Service ephemeral disk; wiped on instance recycle and not shared across scale-out replicas.	Patient/operational records disappearing — direct compliance and trust risk.
"Database is locked" errors	SQLite serializes writes globally; concurrent web requests collide.	5xx errors during peak hours; no horizontal scaling possible.
Cannot debug production	App Service abstracts SSH, process list, real logs, and crash dumps.	Mean-time-to-diagnose measured in days, not minutes.
High monthly cost vs delivered value	Paying PaaS premium for features (autoscale, slot swap) that this workload does not use.	Operating budget consumed by infrastructure, not features.
No reproducible deploy pipeline	Manual or partly-manual deploys; no separation between staging and production.	Every release is risky; testing happens in production.

3. Options Considered

Before recommending a path, we evaluated three viable directions. Each fixes the SQLite data-loss issue; they differ in cost, operational burden, and how much of the existing Azure investment is preserved.

Option A

Stay as-is

Azure App Service + SQLite, no changes

✗ Data-loss incidents continue
✗ Concurrency errors continue
✗ Cannot debug production
✗ $120–200+/mo, no savings

Verdict: NOT RECOMMENDED

Option B ★

Azure VM (IaaS) + Self-hosted Postgres

Stay in Azure, lean cost, clear scaling path

✓ Azure VM Linux (B1ms staging, B2s prod)
✓ Postgres in Docker on prod VM
✓ Hourly pg_dump → Azure Blob (7-day retention)
✓ Azure Container Registry (ACR)
✓ Cloudflare DNS+CDN+WAF (free)
✓ Full SSH / log access — debuggable
✓ Single Azure invoice — no new vendor
✓ Same compliance & networking boundary
✓ Clear upgrade path to Managed PG when traffic grows
✗ Single VM = app + DB share fate (mitigated by hourly off-VM backups)
✗ Postgres patching is in-house

Verdict: RECOMMENDED — fixes all current pain, lowest Azure-native cost

Option C

AWS Lightsail + Cloudflare

Cheapest absolute cost, leaves Azure

✓ Lightsail Managed PostgreSQL bundled
✓ Full SSH + log access
✓ Cloudflare DNS+CDN+WAF (free)
✓ GitHub Actions → GHCR → SSH deploy
✓ Lowest recurring cost (~$56/mo)
✓ Standard Docker — portable anywhere
✗ Adds AWS as a second vendor
✗ Two cloud bills, two compliance reviews
✗ Existing Azure resources / RBAC not reusable

Verdict: VIABLE — best if leaving Azure is acceptable

The remainder of this document details Option B. Option C remains a viable alternative if the client decides Azure is no longer required as a strategic standard — we can issue a separate detailed Lightsail proposal on request.

4. Proposed Target Architecture (Option B)

Two isolated Azure VM environments (staging + production), each running the application and PostgreSQL in Docker behind a Caddy reverse proxy with automatic TLS. Production data is protected by an hourly backup pipeline that uploads compressed Postgres dumps off the VM to Azure Blob Storage with 7-day retention. Cloudflare sits in front of both environments providing free DNS, CDN, SSL and DDoS protection.

flowchart TB Users([End Users]) DNS[Cloudflare DNS + Proxy
effect.healthcare
staging.effect.healthcare
FREE tier] CDN[Cloudflare CDN + WAF
Global edge cache
DDoS protection
FREE tier] subgraph PROD["Production — Azure VM B2s 2vCPU/4GB"] ProdProxy[Caddy
Auto-TLS] ProdApp[Next.js App
Docker container] ProdDB[(PostgreSQL
Docker container
volume on managed disk)] ProdCron[Hourly cron
pg_dump + gzip] ProdProxy --> ProdApp ProdApp --> ProdDB ProdDB -.dump.-> ProdCron end subgraph STG["Staging — Azure VM B1ms 1vCPU/2GB"] StgProxy[Caddy
Auto-TLS] StgApp[Next.js App
Docker container] StgDB[(PostgreSQL
Docker container)] StgProxy --> StgApp StgApp --> StgDB end Blob[(Azure Blob Storage
Cool tier
168 hourly snapshots
7-day lifecycle policy)] Users --> DNS DNS --> CDN CDN -.staging.-> StgProxy CDN --> ProdProxy ProdCron --> Blob style PROD fill:#ecfdf5,stroke:#10b981 style STG fill:#eff6ff,stroke:#3b82f6 style CDN fill:#fff7ed,stroke:#f97316 style DNS fill:#fff7ed,stroke:#f97316 style Blob fill:#fef3c7,stroke:#ca8a04

Component Decisions

✓
Compute — Azure Virtual Machines (Linux Ubuntu 22.04 LTS). Production sized at B2s (2 vCPU / 4GB / ~$30/mo); staging at B1ms (1 vCPU / 2GB / ~$15/mo). Both run Docker + Compose with full SSH access.
✓
Database — PostgreSQL 16 in Docker on production VM. Data lives in a Docker named volume mounted to a 64GB Premium SSD managed disk. Same image on staging VM. Cost: $0 for the database itself (only the disk + VM resources it consumes).
✓
Backups — hourly pg_dump → Azure Blob Storage (Cool tier). A cron job on the production VM runs pg_dump -Fc | gzip every hour, saves locally, then uses azcopy to upload to a private Blob container. Lifecycle policy auto-deletes blobs older than 7 days. Local copy keeps 24 hours of snapshots; off-VM copy keeps 168 (full week). Storage cost: ~$0.50/mo for ~5GB.
✓
Restore drill — monthly. First Friday of each month, we restore the latest blob snapshot to a throwaway VM and run smoke tests. A backup that has never been restored is not a backup.
✓
DNS + CDN — Cloudflare (FREE tier). Migrate effect.healthcare nameservers from the current registrar (Namecheap — dns1/2.registrar-servers.com) to Cloudflare. Unlocks: free global CDN (300+ PoPs), free Universal SSL, free DDoS mitigation, free analytics, and a clear upgrade path to WAF / rate limiting / bot management. Static assets cached aggressively; API endpoints cached selectively with s-maxage headers.
✓
TLS — Cloudflare edge + Caddy origin (Full strict). Cloudflare terminates TLS at the edge with free Universal SSL; Caddy on the VM handles a second TLS leg with Let's Encrypt. End-to-end encryption, zero recurring cost.
✓
Origin protection — Azure NSG locked to Cloudflare IPs. The VM's Network Security Group only accepts HTTP/HTTPS from Cloudflare's published IP ranges. Attackers cannot bypass the CDN to hit the origin directly.
✓
Image registry — Azure Container Registry (Basic tier, $5/mo). Native Azure integration, RBAC tied to existing Azure identities. GHCR remains a viable alternative if preferred.
✓
Monitoring — Azure VM metrics + UptimeRobot. Azure provides CPU, memory, disk, and network metrics out of the box. UptimeRobot (free) monitors public endpoints with email/SMS alerts. Optional: Application Insights free tier (5GB ingest/month).

Future scaling path (when traffic justifies it)

The Dockerized Postgres design is intentionally chosen as a lean starting point. As the project grows, any of the following can be added without re-architecting:

Migrate Postgres to Azure Database for PostgreSQL — Flexible Server (~$15–25/mo additional). Only the connection string changes in the app. Adds: automated patching, point-in-time restore, optional zone-redundant HA.
Add a second VM + Azure Load Balancer for HA on the app tier (~$20/mo additional).
Upgrade VM size in-place (B2s → B2ms → B4ms) when CPU/RAM monitoring shows pressure. Resize is ~5 min downtime.
Enable Cloudflare WAF + rate limiting if attack traffic increases (free Pro plan = $20/mo).

5. CI/CD Pipeline

The pipeline follows a build-once, promote-many model: an image is built once on merge to develop, deployed to staging, and after QA approval the identical image SHA is re-tagged and deployed to production. This eliminates the entire class of "works in staging, fails in production" bugs caused by rebuild drift.

flowchart LR Dev[Developer] -->|push to develop| GH1[GitHub Repo] GH1 --> CI1[GitHub Actions
Build + Test] CI1 --> REG1[(ACR
image:sha-abc123
tag: staging)] REG1 -->|SSH deploy| STG[Staging Azure VM
staging.effect.healthcare] STG -->|QA approves| Promote{Manual approval
required reviewer} Promote --> Retag[Re-tag
image:sha-abc123
as :production] Retag --> REG2[(ACR)] REG2 -->|SSH deploy| PROD[Production Azure VM
effect.healthcare] PROD -->|If incident| Rollback[Re-deploy previous SHA
< 30 seconds] note1[Server NEVER builds
Only docker pull + restart
Zero CPU spike, zero downtime] REG2 -.- note1 style STG fill:#eff6ff,stroke:#3b82f6 style PROD fill:#ecfdf5,stroke:#10b981 style Promote fill:#fef3c7,stroke:#f59e0b style Rollback fill:#fee2e2,stroke:#ef4444

Why the server never builds the image

All docker build work happens inside GitHub Actions runners (free, 2 vCPU / 7GB RAM). The Azure VM only runs docker pull and docker compose up -d — three operations that take seconds and consume minimal CPU/RAM. This means:

No build-time CPU spikes affecting live users
No need to install Node, npm, build toolchains on the production server
Smaller, hardened server image (only Docker + Caddy needed)
Predictable deploy time regardless of dependency size

Pipeline Guarantees

Same artifact in staging & prod

Identical image SHA promoted — no second build that could differ.

Per-environment secrets

GitHub Environments isolate credentials; production requires reviewer approval.

Zero-downtime deploys

New container warms up and passes healthcheck before old one drains. ~3–5s overlap, no user-visible interruption.

Sub-30-second rollback

All prior images retained in ACR; revert is one command or one button click.

Database migrations gated

Migrations run as a one-shot container; failure aborts the deploy before app swaps in.

Auditable history

Every deploy linked to a Git SHA, GitHub Actions run, and approver. Full traceability for compliance.

6. Cost Comparison — All Three Options

Option A — Current (actual invoice)

Azure App Service Premium v3 P1V3

App Service P1V3 (2vCPU / 8GB / 250GB)	$129.94
SQLite on local disk	$0*
Azure CDN (if used)	$5–15
App Insights / logs	$5–20
Manual deploy effort	~2h/wk
Total	~$140–165

* No direct cost — but causes data-loss incidents.

Option B ★ Recommended

Azure VM + Self-hosted Postgres

Azure VM B1ms staging (1vCPU/2GB)	$15
Azure VM B2s prod (2vCPU/4GB)	$30
PostgreSQL in Docker (on prod VM)	$0
Managed disks (Premium SSD, ~64GB)	$8
Azure Blob (backups, Cool tier ~5GB)	$0.50
Bandwidth (~50GB outbound)	$4–8
Azure Container Registry (Basic)	$5
Cloudflare DNS+CDN+SSL+DDoS	$0
GitHub Actions CI/CD	$0
Total	$62–67/mo

Saves ~$73–98/mo vs current. Stays in Azure.

Option C — Alternative

AWS Lightsail + Cloudflare

Lightsail staging (2GB)	$17
Lightsail prod (4GB)	$24
Lightsail Managed PostgreSQL	$15
Cloudflare (DNS+CDN+WAF+SSL)	$0
GHCR storage	$0–5
GitHub Actions CI/CD	$0
Total	$56–61/mo

Cheapest absolute, but adds AWS as a 2nd vendor.

$0

annual savings (Option A)

~$880–1,180

annual savings (Option B ★)

~$950–1,300

annual savings (Option C)

Note on right-sizing

Current P1V3 is 2 vCPU / 8GB RAM. The new production VM is proposed at 4GB RAM — this is a deliberate optimization, not an accidental downgrade. The current workload (Next.js SSR + Postgres) typically uses 1–2GB resident; the extra 4GB on P1V3 is paid-for but unused. Azure VMs can be resized in-place (B2s → B2ms 8GB ~$60/mo → B4ms 16GB ~$120/mo) with ~5 min downtime if monitoring shows we need more headroom. CPU / RAM / disk alerts will be configured before cutover.

7. Decision Summary — Pros & Cons (Option B)

An honest, side-by-side comparison so the decision can be made with full context — not just upside.

+ Pros — What You Gain

~50% lower monthly cost — $62–67/mo vs $129.94/mo P1V3 (or ~$140–165 with extras). Annualized savings ~$880–1,180.
Stays in Azure — single invoice, same compliance boundary, same RBAC / identities. No new vendor onboarding.
Real PostgreSQL ends SQLite data-loss — durable storage, full concurrency, ACID guarantees. The original incident class disappears.
Hourly off-VM backups — pg_dump uploaded to Azure Blob every hour; 7-day retention; survives total VM failure. Worst-case data loss: 1 hour.
Full SSH + real logs — engineers can debug production. MTTR drops from days to minutes.
Server never builds the image — GitHub Actions does the CPU-heavy build; the VM only does docker pull. No build-time spikes affecting live users.
Zero-downtime deploys — healthcheck-gated container swap, ~3–5s overlap, no user-visible interruption.
Sub-30-second rollback — every prior image kept in ACR; revert is one command.
Real staging environment — QA on the same code, secrets, and image SHA that ships to production.
Cloudflare CDN + DDoS for free — global edge caching, free Universal SSL, plus a clear path to WAF / rate limiting when needed.
Clear future scaling path — Postgres → Azure Managed PG when traffic grows; add 2nd VM + LB for HA when SLA demands it. No re-architecture required.
Auditable deploy history — every release linked to a Git SHA and approver — useful for healthcare compliance.

− Cons — What You Take On

App + Postgres share one VM — single point of failure. If the VM fails, both go down. Mitigation: hourly off-VM backups (max 1h data loss), VM snapshot every 24h, documented restore runbook.
You patch Postgres + OS yourself — App Service patches automatically. Mitigated by Ubuntu unattended-upgrades for OS CVEs; quarterly Postgres minor version reviews.
No automatic horizontal scaling — fixed-size VM. Vertical resize (B2s → B2ms → B4ms) requires ~5 min downtime and is done manually based on monitoring.
No HA out of the box — single VM = single failure domain. Recovery from VM failure is ~10–20 min via redeploy from latest snapshot + backup restore. Optional upgrade: 2nd VM + Azure LB ($+20/mo) when uptime SLA demands it.
Hourly backup window = up to 1h data loss in worst case — acceptable for a healthcare landing site, but flag if any hourly write volume is mission-critical (would warrant streaming replication or migration to Managed PG).
Backup restore must be drilled monthly — a backup that's never been restored isn't a backup. We'll add a monthly restore drill to a throwaway VM as part of the runbook.
One-time migration risk — SQLite → Postgres cutover requires a maintenance window (~30 min, off-hours). Mitigated by two dry-runs and a documented rollback path.
Initial setup effort — ~8–10 working days to build the pipeline, backup automation, and migrate. One-time cost; ongoing maintenance is low.

Our Recommendation

For a healthcare landing application with predictable traffic, Option B is the strongest fit: it directly fixes the SQLite data-loss and debuggability problems, cuts infrastructure spend by ~50%, and stays inside the Azure ecosystem — preserving the client's existing compliance boundary, billing relationship, and identity / RBAC investments. The cons are real but well-bounded: the single-VM topology is the explicit tradeoff for the cost saving, and is fully mitigated by hourly off-VM backups plus a documented upgrade path to Managed PostgreSQL and HA when traffic justifies them.