Skip to main content

Designing Infrastructure from the Loop

Infrastructure exists to keep the loop running reliably


Overview

Infrastructure decisions are easy to get wrong in both directions:

  • Over-engineering: Kubernetes, microservices, multi-region before you have users
  • Under-engineering: No monitoring, no backups, no deploy automation

The minimal user loop tells you what matters:

  • What must stay running? The loop's critical path
  • What performance matters? The loop's latency requirements
  • What failures hurt? The loop's reliability requirements

This prompt template helps you make infrastructure decisions proportional to your actual needs.


The Core Infrastructure Prompt

My minimal user loop:
**Action**: [WHAT USER DOES]
**Response**: [WHAT SYSTEM DOES]
**Value**: [WHAT USER GETS]

Loop requirements:
- Latency: [TARGET] ms (p95)
- Availability: [TARGET] % (how often loop must work)
- Users: [CURRENT] / [PROJECTED]

Design infrastructure to serve this loop:

1. **What must always be running?**
- [COMPONENT]: [WHY]

2. **What can tolerate downtime?**
- [COMPONENT]: [ACCEPTABLE DOWNTIME]

3. **What needs monitoring?**
- [METRIC]: [THRESHOLD]

4. **What's the deployment story?**
- How do changes get to production?
- How fast can we roll back?

Start minimal. Add complexity when the loop demands it.

Deployment Strategy from Loop

Use to decide how to deploy:

My loop's requirements:
- Uptime: [X]% (how critical is continuous availability?)
- Deploy frequency: [X] times per [PERIOD]
- Rollback need: [LIKELIHOOD]

Current scale:
- Traffic: [X] requests/day
- Data: [X] GB
- Team size: [X] people

Evaluate deployment options:

**Simple (single server)**:
- SSH + script, or simple CI push
- Pros: Simple, cheap, fast to set up
- Cons: Downtime during deploys, single point of failure
- Appropriate when: [CONDITIONS]

**Managed Platform (Heroku, Railway, Render, Fly.io)**:
- Git push deploy, managed scaling
- Pros: Zero-ops, automatic HTTPS, simple scaling
- Cons: Less control, can get expensive at scale
- Appropriate when: [CONDITIONS]

**Container Platform (Docker + VPS, ECS, Cloud Run)**:
- Containerized, more control
- Pros: Reproducible, portable, can scale
- Cons: More complexity, container management
- Appropriate when: [CONDITIONS]

**Kubernetes**:
- Full orchestration
- Pros: Industry standard, infinite scale, self-healing
- Cons: Massive complexity, operational overhead
- Appropriate when: [CONDITIONS]

For my specific loop and scale, what's the right choice?
What would trigger moving to the next level?

Hosting Decisions

My loop involves:
- Frontend: [STATIC / SSR / BOTH]
- Backend: [API / FULL SERVER]
- Database: [TYPE]
- File storage: [NEEDS?]

For each component, decide:

**Frontend hosting**:
| Option | Cost | Complexity | Performance | When to use |
|--------|------|------------|-------------|-------------|
| Vercel/Netlify | $ | Low | Excellent | Static/JAMstack |
| CDN + origin | $$ | Medium | Excellent | High traffic |
| Self-hosted | $$$ | High | Variable | Full control needed |

**Backend hosting**:
| Option | Cost | Complexity | When to use |
|--------|------|------------|-------------|
| Serverless (Lambda, Cloud Functions) | Pay-per-use | Low | Bursty traffic |
| Managed containers (Cloud Run, App Service) | $ | Medium | Steady traffic |
| VPS (DigitalOcean, Linode) | $$ | Medium | Predictable, control |
| Full cloud (EC2, GCE) | $$$ | High | Scale, specific needs |

**Database hosting**:
| Option | Cost | Complexity | When to use |
|--------|------|------------|-------------|
| Managed (RDS, PlanetScale, Supabase) | $ | Low | Most cases |
| Self-managed | $$ | High | Specific requirements |

For my current needs and budget, what combination makes sense?

Monitoring from Loop

Use to decide what to watch:

For my loop to work, these things must be true:
1. [CONDITION] -- e.g., "API responds in < 500ms"
2. [CONDITION] -- e.g., "Database is reachable"
3. [CONDITION] -- e.g., "External service X is available"

Design monitoring:

**Health checks** (is it up?):

Endpoint: [URL]
Frequency: [X] seconds
Timeout: [X] seconds
Alert if: [CONDITION]

**Performance metrics** (is it fast?):
- Response time (p50, p95, p99)
- Throughput (requests/second)
- Error rate (%)
Alert thresholds: [VALUES]

**Business metrics** (is the loop completing?):
- Loop completions per [PERIOD]
- Drop-off at each stage
- Value delivered
Alert thresholds: [VALUES]

**Resource metrics** (is it healthy?):
- CPU usage
- Memory usage
- Disk usage
- Connection pool usage
Alert thresholds: [VALUES]

What's the minimum viable monitoring to know if the loop is working?
What would I add after that?

Alerting Strategy

My loop's criticality:
- Business hours only? [Y/N]
- Weekend coverage needed? [Y/N]
- Middle-of-night alerts acceptable? [Y/N]

Design alerting tiers:

**P1 - Page immediately** (loop is broken for everyone):
- Conditions: [WHAT]
- Response time: < [X] minutes
- Escalation: [TO WHOM]

**P2 - Alert during business hours** (loop degraded):
- Conditions: [WHAT]
- Response time: < [X] hours
- Escalation: [TO WHOM]

**P3 - Review next business day** (concerning but not broken):
- Conditions: [WHAT]
- Response time: [X] days
- Review process: [HOW]

**Noise reduction**:
- What alerts should be suppressed during deploys?
- What's the de-duplication strategy?
- How do we prevent alert fatigue?

Logging Strategy

For debugging loop issues, I need to capture:

**Structured logs**:
```
{
timestamp: ISO8601,
level: info/warn/error,
request_id: [trace through system],
user_id: [if authenticated],
action: [what happened],
duration_ms: [how long],
metadata: { [relevant context] }
}
```

**What to log at each level**:

INFO (normal operations):
- Loop completions
- Key milestones
- Performance measurements

WARN (concerning but not broken):
- Slow operations
- Retry attempts
- Graceful degradation

ERROR (something broke):
- Unhandled exceptions
- Failed loop completions
- External service failures

**Log retention**:
- Hot (searchable): [X] days
- Warm (archived): [X] months
- Cold (compliance): [X] years

**Log aggregation**:
- Tool: [CloudWatch, Datadog, etc.]
- Search needs: [WHAT QUERIES DO YOU RUN?]

Backup & Recovery

My loop depends on this data:
- [DATA TYPE]: [SIZE] -- [CRITICALITY]
- [DATA TYPE]: [SIZE] -- [CRITICALITY]

Design backup strategy:

**Database backups**:
- Frequency: [HOURLY / DAILY / CONTINUOUS]
- Retention: [X] days
- Storage: [WHERE]
- Tested recovery: [WHEN DID YOU LAST TEST?]

**File/blob backups** (if applicable):
- Approach: [STRATEGY]
- Redundancy: [COPIES, LOCATIONS]

**Recovery objectives**:
- RPO (max data loss acceptable): [X] hours
- RTO (max downtime acceptable): [X] hours

**Disaster scenarios**:
- Database corruption: Recovery plan = [WHAT]
- Cloud region outage: Recovery plan = [WHAT]
- Accidental deletion: Recovery plan = [WHAT]

What's the minimum backup strategy that protects the loop?
What would trigger investing more in DR?

CI/CD from Loop

Use to design your deployment pipeline:

My deploy requirements:
- Deploy frequency: [X] per [PERIOD]
- Downtime tolerance: [NONE / BRIEF / ACCEPTABLE]
- Rollback speed: [TARGET]

Design the pipeline:

**On push to main**:
1. [STEP]: [WHAT HAPPENS] -- [TIME]
2. [STEP]: [WHAT HAPPENS] -- [TIME]
...

**Deploy process**:
1. [STEP]: [WHAT HAPPENS]
2. [STEP]: [WHAT HAPPENS]
...

**Rollback process**:
1. [STEP]: [WHAT HAPPENS]
2. [STEP]: [WHAT HAPPENS]

**Safety checks**:
- [ ] Tests must pass
- [ ] Build must succeed
- [ ] [OTHER GATES]

**Post-deploy verification**:
- [ ] Health check passes
- [ ] Smoke test passes
- [ ] Metrics look normal

Start simple. Add stages as needed.

Security Essentials

My loop handles:
- User data: [Y/N] -- Sensitivity: [LOW/MEDIUM/HIGH]
- Payments: [Y/N]
- PII: [Y/N]

Security checklist for infrastructure:

**Network security**:
- [ ] HTTPS everywhere
- [ ] Database not publicly accessible
- [ ] SSH keys only (no passwords)
- [ ] Firewall rules minimal

**Access control**:
- [ ] Principle of least privilege
- [ ] No shared credentials
- [ ] Secrets in vault/env, not code
- [ ] Audit log for admin actions

**Application security**:
- [ ] Dependencies updated
- [ ] Security headers set
- [ ] Rate limiting in place
- [ ] Input validation everywhere

**Incident response**:
- [ ] Know how to rotate compromised secrets
- [ ] Know how to revoke access
- [ ] Know who to contact for incidents

What's the minimum security posture for my loop?
What increases risk and triggers more investment?

Environment Strategy

My deployment needs:
- Production (real users)
- [OTHER ENVIRONMENTS?]

Design environments:

**Production**:
- Purpose: Serve real users
- Data: Real user data
- Access: [WHO]

**Staging** (if needed):
- Purpose: [WHAT]
- Data: [REAL COPY / SYNTHETIC]
- Parity with prod: [HOW CLOSE]

**Development** (if needed):
- Purpose: [WHAT]
- Data: [WHAT]
- Differences from prod: [ACCEPTABLE?]

**Preview/Review** (if needed):
- Purpose: Per-PR deployments
- Automatic cleanup: [WHEN]

For my current needs:
- What environments are actually necessary?
- What's the cost of each?
- What parity matters?

Start with prod. Add others when pain justifies complexity.

Cost Management

My current/expected usage:
- Compute: [SPEC]
- Database: [SIZE/CONNECTIONS]
- Storage: [GB]
- Bandwidth: [GB/MONTH]
- Users: [COUNT]

Cost analysis:

**Current/projected costs**:
| Component | Provider | Monthly Cost | Cost Driver |
|-----------|----------|--------------|-------------|
| | | | |

**Cost optimization opportunities**:
- [ ] Right-size instances
- [ ] Reserved capacity (if stable)
- [ ] Spot/preemptible (if tolerant)
- [ ] Caching to reduce compute
- [ ] CDN to reduce bandwidth

**Cost triggers to watch**:
- [METRIC] over [THRESHOLD] = investigate

**Budget alerts**:
- Warning at [X]% of budget
- Critical at [X]% of budget

What's my infrastructure budget relative to revenue/runway?
What would justify spending more?

Infrastructure as Code

My infrastructure components:
1. [COMPONENT]
2. [COMPONENT]
...

Decide IaC approach:

**No IaC** (click-ops):
- Acceptable when: Small, stable, single person
- Risk: Hard to reproduce, no history

**Minimal IaC** (scripts + docs):
- Approach: Documented manual steps + automation scripts
- Acceptable when: Simple infra, low change frequency

**Full IaC** (Terraform, Pulumi, CloudFormation):
- Approach: All infrastructure defined in code
- When necessary: Multi-environment, team > 1, complex infra

For my current situation:
- What level of IaC is appropriate?
- What would trigger moving to more IaC?
- What's the minimum documentation needed?

Pre-Launch Checklist

Before the loop goes to real users:

**Reliability**:
- [ ] Health checks configured
- [ ] Database backups working
- [ ] Tested recovery from backup
- [ ] Graceful handling of dependencies down

**Security**:
- [ ] HTTPS configured
- [ ] Secrets not in code
- [ ] Dependencies vulnerability scanned
- [ ] Access controls in place

**Observability**:
- [ ] Can see if loop is working
- [ ] Can see if loop is slow
- [ ] Errors are visible and alerted
- [ ] Can debug issues in production

**Operations**:
- [ ] Know how to deploy
- [ ] Know how to rollback
- [ ] Know how to scale up
- [ ] Know who to call if broken

**Documentation**:
- [ ] How to access systems
- [ ] How to deploy
- [ ] What to do when paged
- [ ] Architecture overview for new people

What's blocking launch?
What can be added after launch?

Checklist: Infrastructure from Loop

  • Deployment method chosen (matches scale)
  • Hosting decisions made (matches needs)
  • Monitoring exists (know if loop works)
  • Alerting configured (know if loop breaks)
  • Backups working (can recover data)
  • CI/CD pipeline exists (can ship safely)
  • Security basics covered (not negligent)
  • Costs understood (within budget)