Skip to main content

Application Scaling

Miren automatically scales your application instances based on traffic. This page explains how scaling works and how to configure it for your needs.

How Autoscaling Works

By default, Miren uses autoscaling for web services. As traffic to your application increases, Miren automatically launches additional instances to handle the load. When traffic decreases, instances are scaled back down.

This approach is similar to Google Cloud Run's scaling model: instead of guessing how many replicas you need, Miren observes actual demand and adjusts automatically.

Why Autoscaling by Default?

Guessing instance counts is error-prone:

  • Too few instances and your app can't handle traffic spikes
  • Too many instances waste resources when traffic is low
  • Manual scaling requires constant monitoring and adjustment

With autoscaling, Miren handles this automatically so you can focus on your application.

Scale to Zero

Miren can scale your application all the way down to zero instances when there's no traffic. This is particularly valuable for self-hosted deployments where resource efficiency matters:

  • Better utilization: Run dozens of apps on a single server—only active apps consume resources
  • Lower costs: Development, staging, and low-traffic production apps don't waste memory or CPU
  • No idle tax: Internal tools, webhooks, and scheduled tasks don't need dedicated instances waiting around

When a request arrives for a scaled-to-zero app, Miren quickly spins up an instance to handle it. The first request may have slightly higher latency, but subsequent requests are served immediately.

Scaling Modes

Miren supports two scaling modes:

ModeDescriptionUse Case
autoScales instances based on trafficStateless web services, APIs
fixedRuns a set number of instancesDatabases, workers, stateful services

Default Behavior

When you deploy without explicit configuration:

  • web service: Auto mode with 10 concurrent requests per instance, 15-minute scale-down delay
  • All other services: Fixed mode with 1 instance

Configuring Scaling

Configure scaling in your .miren/app.toml file under [services.<name>.concurrency].

Auto Mode (Default for Web)

Auto mode scales instances based on concurrent requests:

[services.web.concurrency]
mode = "auto"
requests_per_instance = 10
scale_down_delay = "15m"

Auto Mode Options

OptionDescriptionDefault
modeMust be "auto""auto" for web
requests_per_instanceTarget concurrent requests per instance10
scale_down_delayHow long to wait before scaling down idle instances15m

How Auto Mode Calculates Instances

Miren targets requests_per_instance concurrent requests per instance, the number of in-flight requests being handled simultaneously, not requests over a period of time. For example, with requests_per_instance = 10:

  • 5 concurrent requests → 1 instance
  • 15 concurrent requests → 2 instances
  • 100 concurrent requests → 10 instances

This means a single instance handling fast requests (e.g., 10ms each) can serve thousands of requests per second while staying under the concurrency limit.

The scale_down_delay prevents thrashing when traffic fluctuates. An instance won't be terminated until it has been idle for this duration.

Fixed Mode

Fixed mode runs a specific number of instances regardless of traffic:

[services.worker.concurrency]
mode = "fixed"
num_instances = 3

Fixed Mode Options

OptionDescriptionDefault
modeMust be "fixed""fixed" for non-web
num_instancesExact number of instances to run1

Examples

High-Traffic API

For an API that handles many concurrent requests:

[services.web.concurrency]
mode = "auto"
requests_per_instance = 50
scale_down_delay = "5m"

This configuration:

  • Allows 50 concurrent requests per instance (higher density)
  • Scales down after 5 minutes of reduced traffic (faster scale-down)

Background Worker

For a background job processor:

[services.worker.concurrency]
mode = "fixed"
num_instances = 2

This runs exactly 2 worker instances at all times.

Database Service

For a database that should always be running:

[services.db]
image = "postgres:16"

[services.db.concurrency]
mode = "fixed"
num_instances = 1

Complete Multi-Service App

name = "myapp"

# Web service: autoscales based on traffic
[services.web.concurrency]
mode = "auto"
requests_per_instance = 20
scale_down_delay = "10m"

# Worker service: fixed at 3 instances
[services.worker.concurrency]
mode = "fixed"
num_instances = 3

# Database: single instance with persistent storage
[services.db]
image = "postgres:16"

[services.db.concurrency]
mode = "fixed"
num_instances = 1

[[services.db.disks]]
name = "postgres-data"
mount_path = "/var/lib/postgresql/data"
size_gb = 20

Scaling and Disks

Services with persistent disks must use fixed mode with exactly 1 instance:

[services.db.concurrency]
mode = "fixed"
num_instances = 1

[[services.db.disks]]
name = "db-data"
mount_path = "/data"
size_gb = 10

This restriction exists because disks use exclusive leasing where only one instance can mount a disk at a time.

Tuning Tips

Choosing requests_per_instance

  • Lower values (5-10): More responsive scaling, higher resource usage
  • Higher values (50-100): More efficient resource usage, may have slower response to traffic spikes

Start with the default (10) and adjust based on your application's characteristics.

Choosing scale_down_delay

  • Shorter delays (2m-5m): Faster resource reclamation, but may cause more scaling churn
  • Longer delays (15m-30m): More stable instance counts, but holds resources longer

Consider your traffic patterns:

  • Bursty traffic: Use longer delays to avoid constant scaling
  • Predictable traffic: Shorter delays work well

Monitoring Scaling

View your current instance counts:

# See app status including instance counts
miren app

# Watch instance counts in real-time
miren app --watch

# List all running sandboxes
miren sandbox list

Next Steps