Splitting a CloudFormation Stack at the 500-Resource Limit

Every AWS engineer who runs a non-trivial Serverless Framework service eventually meets the 500-resource limit. This is what I wish someone had handed me about a year before I needed it.

The limit

A CloudFormation stack can contain at most 500 resources.

That's the hard limit. It's per stack, it can't be raised by Support, and you only find out you're close to it when a deploy starts spitting:

Resource limit exceeded (Service: CloudFormation, Status Code: 400)

Or, if you're lucky, the deploy succeeds but your next attempt fails because you crossed the line on this commit.

How fast resources accumulate (the 8-per-function rule)

Serverless Framework makes one Lambda look like one thing. CloudFormation sees more. A function with an HTTP event roughly produces:

Resource	Count
`AWS::Lambda::Function`	1
`AWS::Logs::LogGroup`	1
`AWS::Lambda::Version`	1
`AWS::ApiGateway::Method`	1
`AWS::ApiGateway::Permission`	1
Integration (synthesized)	1
CORS `OPTIONS` method	1
CORS `OPTIONS` integration	1
≈ 8 resources per HTTP-fronted function

Scheduled events add another ~2 (AWS::Events::Rule + AWS::Lambda::Permission). IAM roles, alarms, and DynamoDB streams pile on. SQS workers are a little cheaper, but not by much.

The arithmetic that bit me:

60 HTTP-fronted Lambdas ≈ 480 resources. One more function and you're out.

I had ~57 functions and the next feature would have crossed the line.

Why "just delete things" doesn't work

The naive answer is "consolidate functions." It's mostly bad advice. The reasons:

Cold-start blast radius. A 500-line "general API handler" Lambda pays cold-start latency on every cold path, including ones that don't need it. Five 100-line Lambdas don't.
IAM blast radius. Each consolidated function ends up with the union of every embedded path's IAM permissions. That's the opposite of least-privilege.
Deploy blast radius. A bug in a consolidated handler can take down ten endpoints. A bug in a small one takes down one.

There are exceptions — utility handlers that genuinely do the same job over different inputs can collapse — but in general, fewer functions to dodge the limit is fixing the wrong problem. The right move is to split the stack.

What "splitting" actually means

A "stack split" means taking a single Serverless Framework service (one serverless.yml, one CloudFormation stack) and turning it into two or more services that:

Live in the same git repo (usually — separate repos is an option, see below)
Each have their own serverless.yml (or serverless-<domain>.yml)
Each deploy to their own CloudFormation stack
Share underlying AWS resources (DynamoDB tables, S3 buckets, IAM roles) via CloudFormation exports

The shared resources keep living in whichever stack owns them — typically the "core" stack — and the other stacks reference them by export name. You don't duplicate infrastructure. You partition the deployable surface.

Where to draw the boundaries

This is the actual hard part. Pick the wrong boundaries and you'll be reshuffling functions across stacks for months. Things that worked for me, in order of importance:

1. Split by service domain, not by file structure

The functions that move together are the ones that:

Are triggered by the same kind of event (HTTP API for a feature area; or all cron; or all SQS workers for one queue)
Share a deployment cadence — content automation changes weekly, the data pipeline changes monthly
Share a blast radius — if one breaks, the same people need to know

In my own platform, the natural splits were:

Domain	Lives together because...
Core / data pipeline	Owns shared DynamoDB, S3, CloudFront. Touched least often.
Social	Heavy churn (new sources, new platforms), runs on its own schedule.
Analytics	Mostly scheduled cron — has its own change cadence.
Blog	Content generation, mostly HTTP. Independent product surface.
Pricing	Margin / MAP / discontinuation — touches Shopify, separate concerns.

2. Shared resources go in the smallest stack that owns them

The DynamoDB table that every service writes to belongs in core. The Pinterest content queue table belongs in social. Don't keep everything in core because "it's shared" — only the things multiple downstream stacks actually read.

3. Cross-stack dependencies are a one-way ladder

If social imports from core, core must not import from social. Cyclic imports between CloudFormation stacks aren't allowed and you'll find out at deploy time. Sketch the dependency graph on paper before you write any YAML.

The mechanics

A second Serverless service with its own config looks like:

# serverless-social.yml
service: wcs-dashboard-social
provider:
  name: aws
  runtime: nodejs20.x
  stackName: wcs-dashboard-social-${sls:stage}

# Reference resources owned by the core stack
custom:
  coreStack: wcs-seo-dashboard-${sls:stage}

functions:
  publishScheduledPost:
    handler: src/social/publish.handler
    environment:
      METRICS_TABLE_NAME: ${cf:${self:custom.coreStack}.MetricsTableName}
      CONTENT_BUCKET_NAME: ${cf:${self:custom.coreStack}.ContentBucketName}

And in the core stack, you export those values:

# serverless.yml (core)
resources:
  Resources:
    MetricsTable:
      Type: AWS::DynamoDB::Table
      # ... etc
  Outputs:
    MetricsTableName:
      Value: !Ref MetricsTable
      Export:
        Name: ${self:service}-${sls:stage}-MetricsTableName

Two deploy commands now:

npx serverless deploy                                # core
npx serverless deploy --config serverless-social.yml # social

In CI, the order matters: deploy the stack that owns exports first, the ones that consume them after. Otherwise the consumer tries to resolve an export that doesn't exist yet.

Doing the migration without an outage

The migration is the scary part because CloudFormation does not let you move a resource between stacks without either:

Deleting it from stack A and recreating it in stack B (downtime + data loss for stateful resources), or
Using a manual aws cloudformation import flow (no Serverless Framework support, easy to get wrong)

The pragmatic playbook:

Stateless resources first. Lambdas, API Gateway methods, EventBridge rules. Removing one from stack A and adding it to stack B is brief downtime per function (seconds), no data loss. Do these one at a time, deploy stack B before deploying stack A.
Keep stateful resources where they were born. DynamoDB tables, S3 buckets, SNS topics. Even after the split, these stay in core. Just export them. Don't try to move data.
Move one function at a time unless you're feeling brave. Each move is reversible by reverting two commits. A big-bang split is reversible by crying.
Watch the resource count after every deploy.
```
aws cloudformation describe-stack-resources \
  --stack-name wcs-seo-dashboard-prod \
  --query 'StackResources | length(@)'
```
I aim for 400 / 500 as a soft ceiling. The remaining 100 is headroom for emergencies and partial deploy states (where CloudFormation temporarily creates new resources before deleting old ones during an update).

What I'd tell past me

Plan the split when you hit 60% of the limit, not 95%. Stack-splitting under deploy-failure pressure is much worse than stack-splitting on a quiet afternoon.
Service domains are the right boundary almost always. It's tempting to split by "infrastructure" vs "business logic," or by language, or by which engineer wrote it. None of those track with how change actually flows. Domains do.
Cross-stack exports are a contract. Once social depends on MetricsTableName, renaming that export breaks the deploy. Treat export names like public API.
Keep the deploy order in version control. Whichever CI system you use, write down the explicit order. Don't make the next person discover it from a failed deploy.
Watch out for partial-deploy resource spikes. During an update, CloudFormation can briefly hold both old and new versions of a resource. Your soft ceiling should account for that.

When splitting isn't the answer

Two cases where I'd reach for something else:

You have ~480 resources and a single, coherent product surface. Splitting service domains that don't exist is just adding YAML files for no reason. Consider whether you can move non-product resources (DynamoDB tables, IAM roles) to a separate ops stack rather than re-partitioning the product.
You're a team of two and the limit is six months away. The split is still coming, but you have time to be deliberate. Spend a sprint mapping domains before writing any YAML.

The 500-resource limit feels arbitrary until you've worked through why CloudFormation enforces it (state-machine size, lock contention during updates, IAM evaluation cost). After the second split, it stops feeling arbitrary and starts feeling like a useful forcing function for service boundaries you should have had anyway.

That's the part that surprised me. The limit isn't the problem — it's a deadline on architectural decisions you were going to make eventually.