Back to Projects
LLM Cloud Deployment
Self-hosted Google Gemma 4 (E2B-it)streaming inference API on AWS, served with vLLM on GPU-backed ECS with scale-to-zero. Six Terraform modules: multi-AZ VPC, ALB + WAF, a FastAPI auth proxy, CloudFront + S3 chat UI, and CloudWatch dashboards with alerting.
TerraformAWS (ECS, ALB, CloudFront, WAF, ECR, VPC)vLLMGemma 4FastAPIDocker

Scalable LLM inference service on AWS using ECS, Terraform, and HuggingFace TGI. Deploys Microsoft Phi-3 Mini 3.8B (AWQ 4-bit) as a streaming inference API with a real-time chat frontend. Responses are delivered token-by-token via Server-Sent Events.
Architecture
Users ──→ CloudFront ──→ S3 (static frontend)
Users ──→ ALB ──→ ECS Task ──→ nginx (:80) ──→ TGI + Phi-3 (:8080, GPU)- Compute: ECS on EC2 with g4dn.xlarge (NVIDIA T4, 16 GB VRAM)
- Model: Phi-3 Mini 3.8B AWQ quantised (~2.3 GB), pre-baked into Docker image
- Serving: HuggingFace TGI 3.x with continuous batching and SSE streaming
- Networking: Private subnets, VPC Endpoints (no NAT Gateway)
- Scaling: 0–3 instances via ECS Capacity Provider. Scales to zero when idle ($0 cost)
- Security: API key auth (nginx), WAF, HTTPS, private subnets
- IaC: Terraform with 6 modules (networking, ecr, alb, ecs, frontend, monitoring)
Quick Start
# 1. Clone
git clone https://github.com/dinosmuc/phi3-cloud-deployment.git
cd phi3-cloud-deployment
# 2. Configure
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars — set your api_key
# 3. Initialise Terraform
cd terraform
terraform init
# 4. Deploy ECR first
terraform apply -target=module.ecr
# 5. Build and push Docker images
cd ..
./scripts/build_and_push.sh
# 6. Deploy everything
cd terraform
terraform applyTerraform outputs your frontend_url, api_url, and api_key.
Usage
- Open the frontend_url in a browser
- Enter your API key
- Type a message and see the response stream in real-time
Note: If the service has scaled to zero, the first request triggers a cold start (~3–5 minutes). The frontend retries automatically.
Cost Estimate (eu-central-1)
- ~20 hours active testing (on-demand): ~$17
- ~20 hours active testing (spot): ~$9
- Idle (scaled to zero): $0.00/hr
Infrastructure Modules
- networking — VPC, subnets, security groups, VPC endpoints
- ecr — Docker image registry
- alb — Load balancer, target group, WAF
- ecs — Cluster, task definition, auto-scaling
- frontend — S3 + CloudFront
- monitoring — CloudWatch dashboard and alarms
Tear Down
cd terraform
terraform destroy