LLM Cloud Deployment

Self-hosted Google Gemma 4 (E2B-it)streaming inference API on AWS, served with vLLM on GPU-backed ECS with scale-to-zero. Six Terraform modules: multi-AZ VPC, ALB + WAF, a FastAPI auth proxy, CloudFront + S3 chat UI, and CloudWatch dashboards with alerting.

TerraformAWS (ECS, ALB, CloudFront, WAF, ECR, VPC)vLLMGemma 4FastAPIDocker

View Code

Scalable LLM inference service on AWS using ECS, Terraform, and HuggingFace TGI. Deploys Microsoft Phi-3 Mini 3.8B (AWQ 4-bit) as a streaming inference API with a real-time chat frontend. Responses are delivered token-by-token via Server-Sent Events.

Architecture

Users ──→ CloudFront ──→ S3 (static frontend)
Users ──→ ALB ──→ ECS Task ──→ nginx (:80) ──→ TGI + Phi-3 (:8080, GPU)

Compute: ECS on EC2 with g4dn.xlarge (NVIDIA T4, 16 GB VRAM)
Model: Phi-3 Mini 3.8B AWQ quantised (~2.3 GB), pre-baked into Docker image
Serving: HuggingFace TGI 3.x with continuous batching and SSE streaming
Networking: Private subnets, VPC Endpoints (no NAT Gateway)
Scaling: 0–3 instances via ECS Capacity Provider. Scales to zero when idle ($0 cost)
Security: API key auth (nginx), WAF, HTTPS, private subnets
IaC: Terraform with 6 modules (networking, ecr, alb, ecs, frontend, monitoring)

Quick Start

# 1. Clone
git clone https://github.com/dinosmuc/phi3-cloud-deployment.git
cd phi3-cloud-deployment

# 2. Configure
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars — set your api_key

# 3. Initialise Terraform
cd terraform
terraform init

# 4. Deploy ECR first
terraform apply -target=module.ecr

# 5. Build and push Docker images
cd ..
./scripts/build_and_push.sh

# 6. Deploy everything
cd terraform
terraform apply

Terraform outputs your frontend_url, api_url, and api_key.

Usage

Open the frontend_url in a browser
Enter your API key
Type a message and see the response stream in real-time

Note: If the service has scaled to zero, the first request triggers a cold start (~3–5 minutes). The frontend retries automatically.

Cost Estimate (eu-central-1)

~20 hours active testing (on-demand): ~$17
~20 hours active testing (spot): ~$9
Idle (scaled to zero): $0.00/hr

Infrastructure Modules

networking — VPC, subnets, security groups, VPC endpoints
ecr — Docker image registry
alb — Load balancer, target group, WAF
ecs — Cluster, task definition, auto-scaling
frontend — S3 + CloudFront
monitoring — CloudWatch dashboard and alarms

Tear Down

cd terraform
terraform destroy