Chio/Docs

ECS Fargate

Fargate runs the chio sidecar and your app as two containers in one task. The kernel boots first, the app waits for it to be HEALTHY, and ECS recycles the task if either container fails its health check. Reference manifest is deploy/ecs/task-definition.json in the Arc repo. Prerequisites: an ECS cluster, an ecsTaskExecutionRole for ECR pulls and CloudWatch log creation, a task role for runtime AWS access, an EFS file system for kernel config, and Secrets Manager entries for the signing key and capability authority URL.


Architecture

The task uses awsvpc network mode, so both containers share an ENI and a single private IP. The sidecar listens on 9090 and is registered to the ALB target group. The app listens on 8080 on the same loopback and is reachable only through the kernel.

rendering…
ECS Fargate runs the chio sidecar and app as one task with awsvpc networking. The sidecar is the only ALB target. EFS mounts kernel config; Secrets Manager injects the signing key and capability authority URL via the task definition's secrets array.

Sidecar is the only target group target

Register the ALB target group to port 9090, never 8080. The app does not need a port mapping for ingress, only for the in-task loopback between the two containers.

Manifest Walkthrough

Task-level shape

deploy/ecs/task-definition.json
{
  "family": "agent-tool-server",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/chio-sidecar-task-role",
  "runtimePlatform": {
    "cpuArchitecture": "X86_64",
    "operatingSystemFamily": "LINUX"
  }

The task runs on Fargate with awsvpc networking. Total task budget is 0.5 vCPU and 1024 MiB; the per-container values below subdivide that ceiling. Two roles separate concerns:

  • executionRoleArn: ECS uses this at boot to pull the image from ECR / GHCR, decrypt Secrets Manager values into env vars, and create CloudWatch log groups.
  • taskRoleArn: the kernel and your app inherit this at runtime for AWS API calls (DynamoDB receipts, S3 policy reads, KMS, etc.).

Application container

deploy/ecs/task-definition.json
    {
      "name": "app",
      "image": "APP_IMAGE_PLACEHOLDER",
      "essential": true,
      "cpu": 384,
      "memory": 896,
      "portMappings": [
        { "containerPort": 8080, "hostPort": 8080, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "CHIO_SIDECAR_URL", "value": "http://localhost:9090" },
        { "name": "CHIO_SIDECAR_HEALTH_URL", "value": "http://localhost:9090/chio/health" }
      ],
      "dependsOn": [
        { "containerName": "chio-sidecar", "condition": "HEALTHY" }
      ],
      "restartPolicy": {
        "enabled": true,
        "ignoredExitCodes": [],
        "restartAttemptPeriod": 60
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/agent-tool-server",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "app",
          "awslogs-create-group": "true"
        }
      }
    }

Two operational contracts here:

  • dependsOn.condition: HEALTHY blocks the app until the sidecar reports HEALTHY from its docker healthcheck. The app never starts before the kernel is ready.
  • restartPolicy.enabled: true with a 60-second restart-attempt period lets ECS restart a flapping container without recycling the entire task.

Sidecar container

deploy/ecs/task-definition.json
    {
      "name": "chio-sidecar",
      "image": "ghcr.io/backbay-labs/chio-sidecar:latest",
      "essential": true,
      "cpu": 128,
      "memory": 128,
      "portMappings": [
        { "containerPort": 9090, "hostPort": 9090, "protocol": "tcp" }
      ],
      "command": [
        "api",
        "protect",
        "--upstream",
        "http://127.0.0.1:8080",
        "--spec",
        "/etc/chio/spec/openapi.yaml",
        "--listen",
        "0.0.0.0:9090"
      ]

command overrides the image's default CMD (--help) so the container stays up and serves requests. The kernel reverse-proxies to 127.0.0.1:8080 after guard evaluation.

Sidecar environment and secrets

deploy/ecs/task-definition.json
      "environment": [
        { "name": "CHIO_LISTEN_ADDR", "value": "0.0.0.0:9090" },
        { "name": "CHIO_HEALTH_PATH", "value": "/chio/health" },
        { "name": "CHIO_KERNEL_CONFIG_PATH", "value": "/etc/chio/kernel.yaml" },
        { "name": "CHIO_POLICY_SOURCE", "value": "s3://ACCOUNT_ID-chio-config/policy.yaml" },
        { "name": "CHIO_RECEIPT_SINK", "value": "dynamodb://chio-receipts" },
        { "name": "CHIO_LOG_LEVEL", "value": "info" }
      ],
      "secrets": [
        {
          "name": "CHIO_SIGNING_KEY",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:chio/signing-key"
        },
        {
          "name": "CHIO_CAPABILITY_AUTHORITY_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:chio/capability-authority-url"
        }
      ]

Plain values land in environment; sensitive values land in secrets and ECS resolves them at task start using the execution role. The receipt sink is set to dynamodb://chio-receipts; the policy source is an S3 object both readable by the task role.

Volume mounts and health check

deploy/ecs/task-definition.json
      "mountPoints": [
        {
          "sourceVolume": "chio-config",
          "containerPath": "/etc/chio",
          "readOnly": true
        }
      ],
      "healthCheck": {
        "command": ["CMD", "/usr/bin/curl", "-fsS", "http://localhost:9090/chio/health"],
        "interval": 10,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 15
      },
      "restartPolicy": {
        "enabled": true,
        "ignoredExitCodes": [],
        "restartAttemptPeriod": 60
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/agent-tool-server",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "chio-sidecar",
          "awslogs-create-group": "true"
        }
      },
      "readonlyRootFilesystem": true,
      "user": "65532:65532"
    }

The sidecar mounts EFS read-only at /etc/chio. The health check curls /chio/health every 10 seconds with a 5-second timeout, marks the container HEALTHY after the first success, and gives a 15-second grace period at startup. After 3 consecutive failures the container is marked unhealthy and ECS recycles the task. Two security defaults are worth keeping:

  • readonlyRootFilesystem: true prevents in-container writes outside mounted volumes.
  • user: "65532:65532" runs the kernel as the non-root distroless user baked into the sidecar image.

Volumes (EFS-mounted)

deploy/ecs/task-definition.json
  "volumes": [
    {
      "name": "chio-config",
      "efsVolumeConfiguration": {
        "fileSystemId": "EFS_FILESYSTEM_ID",
        "rootDirectory": "/chio-config",
        "transitEncryption": "ENABLED",
        "authorizationConfig": {
          "iam": "ENABLED"
        }
      }
    }
  ]
}

The kernel config and OpenAPI spec live in EFS at /chio-config/kernel.yaml and /chio-config/spec/openapi.yaml. IAM-authorized access binds the EFS access policy to the task role; transit encryption is mandatory for Fargate EFS.


IAM Roles

Two roles separate boot-time and runtime concerns. The execution role is consumed by ECS itself before any container runs; the task role is what the kernel and app see at runtime.

RoleRequired actionsScoped to
Executionecr:BatchGetImage, ecr:GetDownloadUrlForLayer, ecr:GetAuthorizationTokenImage pulls
Executionlogs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents/ecs/agent-tool-server:*
Executionsecretsmanager:GetSecretValuechio/signing-key-*, chio/capability-authority-url-*
Tasks3:GetObjectPolicy bucket object
Taskdynamodb:PutItem, dynamodb:BatchWriteItem, dynamodb:UpdateItemReceipts table
Taskelasticfilesystem:ClientMount, elasticfilesystem:ClientWriteEFS file system

Container Ordering

The dependsOn array on the app forces the sidecar to reach HEALTHY before the app starts. ECS evaluates the docker health check on the sidecar:

  1. Sidecar starts, opens listener on :9090.
  2. ECS waits the 15-second startPeriod, then runs curl /chio/health every 10 seconds.
  3. First successful curl marks sidecar HEALTHY.
  4. App container becomes eligible to start. The app's own probe (configured at the load balancer or your runtime) takes over after that.

Bake curl into the sidecar image

The health check uses /usr/bin/curl. The published ghcr.io/backbay-labs/chio-sidecar image includes curl. If you build a slimmer derivative without curl, switch the health check to ["CMD-SHELL", "wget -qO- http://localhost:9090/chio/health || exit 1"] and confirm wget is present.

Secrets

Create the two Secrets Manager entries and the EFS-resident config before registering the task definition.

bash
# Signing key (raw key payload).
$ aws secretsmanager create-secret --name chio/signing-key \
    --secret-binary fileb://signing-key.bin

# Capability authority URL.
$ aws secretsmanager create-secret --name chio/capability-authority-url \
    --secret-string 'https://ctl-a.chio.internal:8940'

# Stage config into EFS (DataSync or a one-shot helper task), layout:
#   /chio-config/kernel.yaml
#   /chio-config/spec/openapi.yaml

For SSM Parameter Store instead of Secrets Manager, swap the valueFrom ARN to the SSM parameter ARN and grant ssm:GetParameters on the execution role.


Networking

The task gets its own ENI in the subnet you assign at service creation. Two security groups matter: the task SG (ingress on 9090 from the ALB SG only, no ingress on 8080, egress to outbound dependencies) and the EFS SG (NFS port 2049 ingress from the task SG). Register the ALB target group to port 9090 on the chio-sidecar container:

bash
$ aws ecs create-service --cluster prod-cluster --service-name agent-tool-server \
    --task-definition agent-tool-server:1 --desired-count 2 --launch-type FARGATE \
    --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-task],assignPublicIp=DISABLED}" \
    --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:ACCOUNT_ID:targetgroup/chio-sidecar-tg/abc,containerName=chio-sidecar,containerPort=9090"

Configure the ALB target group health check to GET /chio/health on port 9090, expecting HTTP 200. That doubles as load-balancer-level draining: a sidecar that fails the kernel-side health check is removed from the target group before the docker healthcheck recycles the task.


Health Probes and Graceful Shutdown

ECS sends SIGTERM to all containers on task stop and waits up to stopTimeout seconds (30 by default; raise via task definition for long drain) before SIGKILL. The sidecar handles SIGTERM by stopping ingress, draining in-flight evaluations, flushing receipts to the configured sink, and exiting.

Pair this with ALB connection draining: set the target group's deregistration delay to a few seconds longer than the in-flight request budget so the load balancer stops sending traffic before the task gets SIGTERM.


Scaling

Scale horizontally with the ECS service's desiredCount and Application Auto Scaling target tracking. CPU and ALBRequestCountPerTarget are the two predefined metrics worth wiring first.

bash
# Register the service as a scalable target.
$ aws application-autoscaling register-scalable-target \
    --service-namespace ecs \
    --resource-id service/prod-cluster/agent-tool-server \
    --scalable-dimension ecs:service:DesiredCount \
    --min-capacity 2 --max-capacity 50

# Track average CPU at 60% across the service.
$ aws application-autoscaling put-scaling-policy \
    --service-namespace ecs \
    --resource-id service/prod-cluster/agent-tool-server \
    --scalable-dimension ecs:service:DesiredCount \
    --policy-name cpu60 --policy-type TargetTrackingScaling \
    --target-tracking-scaling-policy-configuration '{
      "TargetValue": 60.0,
      "PredefinedMetricSpecification": {
        "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
      },
      "ScaleOutCooldown": 60, "ScaleInCooldown": 300
    }'

# Blend Fargate Spot for cost: one on-demand base, the rest on Spot.
$ aws ecs update-service --cluster prod-cluster --service agent-tool-server \
    --capacity-provider-strategy \
        capacityProvider=FARGATE,weight=1,base=1 \
        capacityProvider=FARGATE_SPOT,weight=4,base=0

Observability

The task definition routes both containers to /ecs/agent-tool-server with stream prefixes per container. Tail with:

bash
# Live tail (requires the awslogs CLI plugin)
$ aws logs tail /ecs/agent-tool-server --follow --since 5m

# Filter denied receipts
$ aws logs start-query \
    --log-group-name /ecs/agent-tool-server \
    --start-time $(date -d '1 hour ago' +%s) \
    --end-time $(date +%s) \
    --query-string 'fields @timestamp, @message
                    | filter event = "receipt" and verdict = "deny"
                    | sort @timestamp desc
                    | limit 50'

Set retention on the log group explicitly; CloudWatch defaults to never expire:

bash
$ aws logs put-retention-policy \
    --log-group-name /ecs/agent-tool-server \
    --retention-in-days 30

For metrics and tracing, attach a third sidecar container running the OTel collector and point the kernel at it via CHIO_OTEL_ENDPOINT=http://localhost:4317. Detail in Observability.


Cost Considerations

Fargate bills per-second on vCPU and memory plus storage and data transfer. Three knobs dominate: task size (the reference 0.5 vCPU / 1024 MiB shape; drop to 0.25 vCPU only if app and sidecar together fit), Spot share (FARGATE_SPOT capacity provider cuts per-task cost ~70%; the sidecar drains cleanly inside the 2-minute interruption notice), and log retention (receipts live in DynamoDB; cap CloudWatch retention at 7-30 days).


Operations

Deploy is two calls: register a task definition revision, then roll the service forward. Rollback points the service at a previous revision number. ECS Exec attaches a shell to a running container.

bash
# Deploy a new revision.
$ aws ecs register-task-definition --cli-input-json file://deploy/ecs/task-definition.json
$ aws ecs update-service --cluster prod-cluster --service agent-tool-server \
    --task-definition agent-tool-server --force-new-deployment

# Roll back to revision 42.
$ aws ecs update-service --cluster prod-cluster --service agent-tool-server \
    --task-definition agent-tool-server:42

# Open a shell on a running sidecar (requires --enable-execute-command on the service).
$ aws ecs execute-command --cluster prod-cluster --task <task-arn> \
    --container chio-sidecar --interactive --command "/bin/sh"

Worked Example

After IAM roles, secrets, and EFS config are in place:

bash
# Register the task definition.
$ aws ecs register-task-definition \
    --cli-input-json file://deploy/ecs/task-definition.json

# Create the service behind the ALB.
$ aws ecs create-service \
    --cluster prod-cluster \
    --service-name agent-tool-server \
    --task-definition agent-tool-server:1 \
    --desired-count 2 \
    --launch-type FARGATE \
    --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-task]}" \
    --load-balancers "targetGroupArn=...,containerName=chio-sidecar,containerPort=9090" \
    --enable-execute-command

# Wait for steady state.
$ aws ecs wait services-stable --cluster prod-cluster --services agent-tool-server

# Verify via the ALB.
$ ALB=$(aws elbv2 describe-load-balancers --names chio-alb \
    --query 'LoadBalancers[0].DNSName' --output text)

$ curl -fsS "https://$ALB/chio/health" | jq
{ "ok": true, "kernel": "ready", "policy_loaded": true, "authority_generation": 7 }

# The app is unreachable except through the kernel.
$ curl -fsS "https://$ALB/api/search" \
    -H "Authorization: Bearer $CHIO_CAPABILITY_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"query":"hello"}'

If the task keeps recycling

The most common cause is the dependency block: the sidecar fails its docker healthcheck because the EFS mount is empty or the execution role cannot read a referenced secret. Start with aws ecs describe-tasks and look at stoppedReason plus per-container exit codes. EFS denials surface as ResourceInitializationError before any container runs.

For other deployment shapes, see Cloud Run and Azure Container Apps. For the Lambda Extension flavour of AWS-native chio, see AWS Lambda.

ECS Fargate · Chio Docs