Back to Projects

Phi-3 Cloud Deployment

Infrastructure-as-Code project deploying Microsoft Phi-3 Mini (3.8B, 4-bit AWQ) as a streaming inference API on AWS. Six Terraform modules cover networking, ECS on GPU instances with auto-scaling to zero, ALB with WAF rate limiting, CloudFront + S3 frontend, and CloudWatch dashboards with alerting.

TerraformAWS (ECS, ALB, CloudFront, WAF, ECR, VPC)HuggingFace TGIDockernginx

Scalable LLM inference service on AWS using ECS, Terraform, and HuggingFace TGI. Deploys Microsoft Phi-3 Mini 3.8B (AWQ 4-bit) as a streaming inference API with a real-time chat frontend. Responses are delivered token-by-token via Server-Sent Events.

Architecture

Users ──→ CloudFront ──→ S3 (static frontend)
Users ──→ ALB ──→ ECS Task ──→ nginx (:80) ──→ TGI + Phi-3 (:8080, GPU)
  • Compute: ECS on EC2 with g4dn.xlarge (NVIDIA T4, 16 GB VRAM)
  • Model: Phi-3 Mini 3.8B AWQ quantised (~2.3 GB), pre-baked into Docker image
  • Serving: HuggingFace TGI 3.x with continuous batching and SSE streaming
  • Networking: Private subnets, VPC Endpoints (no NAT Gateway)
  • Scaling: 0–3 instances via ECS Capacity Provider. Scales to zero when idle ($0 cost)
  • Security: API key auth (nginx), WAF, HTTPS, private subnets
  • IaC: Terraform with 6 modules (networking, ecr, alb, ecs, frontend, monitoring)

Quick Start

# 1. Clone
git clone https://github.com/dinosmuc/phi3-cloud-deployment.git
cd phi3-cloud-deployment

# 2. Configure
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars — set your api_key

# 3. Initialise Terraform
cd terraform
terraform init

# 4. Deploy ECR first
terraform apply -target=module.ecr

# 5. Build and push Docker images
cd ..
./scripts/build_and_push.sh

# 6. Deploy everything
cd terraform
terraform apply

Terraform outputs your frontend_url, api_url, and api_key.

Usage

  • Open the frontend_url in a browser
  • Enter your API key
  • Type a message and see the response stream in real-time

Note: If the service has scaled to zero, the first request triggers a cold start (~3–5 minutes). The frontend retries automatically.

Cost Estimate (eu-central-1)

  • ~20 hours active testing (on-demand): ~$17
  • ~20 hours active testing (spot): ~$9
  • Idle (scaled to zero): $0.00/hr

Infrastructure Modules

  • networking — VPC, subnets, security groups, VPC endpoints
  • ecr — Docker image registry
  • alb — Load balancer, target group, WAF
  • ecs — Cluster, task definition, auto-scaling
  • frontend — S3 + CloudFront
  • monitoring — CloudWatch dashboard and alarms

Tear Down

cd terraform
terraform destroy