Back to Projects

LLM Cloud Deployment

Self-hosted Google Gemma 4 (E2B-it)streaming inference API on AWS, served with vLLM on GPU-backed ECS with scale-to-zero. Six Terraform modules: multi-AZ VPC, ALB + WAF, a FastAPI auth proxy, CloudFront + S3 chat UI, and CloudWatch dashboards with alerting.

TerraformAWS (ECS, ALB, CloudFront, WAF, ECR, VPC)vLLMGemma 4FastAPIDocker
LLM Cloud Deployment architecture diagram

Scalable LLM inference service on AWS using ECS, Terraform, and HuggingFace TGI. Deploys Microsoft Phi-3 Mini 3.8B (AWQ 4-bit) as a streaming inference API with a real-time chat frontend. Responses are delivered token-by-token via Server-Sent Events.

Architecture

Users ──→ CloudFront ──→ S3 (static frontend)
Users ──→ ALB ──→ ECS Task ──→ nginx (:80) ──→ TGI + Phi-3 (:8080, GPU)
  • Compute: ECS on EC2 with g4dn.xlarge (NVIDIA T4, 16 GB VRAM)
  • Model: Phi-3 Mini 3.8B AWQ quantised (~2.3 GB), pre-baked into Docker image
  • Serving: HuggingFace TGI 3.x with continuous batching and SSE streaming
  • Networking: Private subnets, VPC Endpoints (no NAT Gateway)
  • Scaling: 0–3 instances via ECS Capacity Provider. Scales to zero when idle ($0 cost)
  • Security: API key auth (nginx), WAF, HTTPS, private subnets
  • IaC: Terraform with 6 modules (networking, ecr, alb, ecs, frontend, monitoring)

Quick Start

# 1. Clone
git clone https://github.com/dinosmuc/phi3-cloud-deployment.git
cd phi3-cloud-deployment

# 2. Configure
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars — set your api_key

# 3. Initialise Terraform
cd terraform
terraform init

# 4. Deploy ECR first
terraform apply -target=module.ecr

# 5. Build and push Docker images
cd ..
./scripts/build_and_push.sh

# 6. Deploy everything
cd terraform
terraform apply

Terraform outputs your frontend_url, api_url, and api_key.

Usage

  • Open the frontend_url in a browser
  • Enter your API key
  • Type a message and see the response stream in real-time

Note: If the service has scaled to zero, the first request triggers a cold start (~3–5 minutes). The frontend retries automatically.

Cost Estimate (eu-central-1)

  • ~20 hours active testing (on-demand): ~$17
  • ~20 hours active testing (spot): ~$9
  • Idle (scaled to zero): $0.00/hr

Infrastructure Modules

  • networking — VPC, subnets, security groups, VPC endpoints
  • ecr — Docker image registry
  • alb — Load balancer, target group, WAF
  • ecs — Cluster, task definition, auto-scaling
  • frontend — S3 + CloudFront
  • monitoring — CloudWatch dashboard and alarms

Tear Down

cd terraform
terraform destroy