Right-sizing SQL is one of those jobs that looks simple until it isn't. Scale down and you risk timeouts, angry app owners, and a rollback at 2 a.m. Do nothing and the bill keeps drifting up.

This framework keeps you out of both traps. It is not a list of SKUs. It is a repeatable decision tree you can run every sprint: measure, classify the bottleneck, change one thing, and prove the outcome.

Framework at a glance

Start with the workload and the SLO, not the SKU.
Baseline for at least 7 days so you catch peaks, not just averages.
Decide based on the limiting signal (CPU, data IO, log IO, concurrency, tempdb, or storage).
Make one change at a time, with a rollback trigger defined before you make any modifications.
Treat right-sizing as an operating loop, not a one-time project.

Who this is for

Platform engineers, SREs, FinOps leads, and database owners who need to reduce spend without turning performance into a guessing game.

Step 0: Define 'safe' before you touch compute

Right-sizing is only safe if you agree on success criteria up front. Pick one primary SLO and one rollback trigger. Keep it simple.

Examples of good success criteria:

p95 app query latency stays within X% of baseline during peak hours
no increase in deadlocks/timeouts
CPU stays below a defined ceiling under peak load

Examples of rollback triggers:

p95 latency exceeds your threshold for more than N minutes
error rate/timeouts breach your SRE alert threshold
DTU/vCore saturation (or IO saturation) pins at 95-100% for sustained periods

Write these down in a change note. If you do not define rollback conditions, you will debate them in the middle of the incident.

Step 1: Baseline the right signals (the minimum set)

You do not need a perfect lab. You need a clean baseline that answers one question: what resource is the database begging for when it is slow?

Signal	Where to read it	Why it matters
CPU % / vCore saturation	Azure Monitor metrics; sys.dm_db_resource_stats	High CPU with stable IO usually means compute-bound queries or high concurrency.
Data IO % / read latency	Azure Monitor metrics; Query Store for heavy readers	If reads are slow, scaling compute may not help. You may be IO-bound or missing indexes.
Log IO % / write latency	Azure Monitor metrics; log usage patterns	Heavy writes and slow log flush pushes you toward faster storage/tier or transaction tuning.
Workers / sessions / concurrency	Azure Monitor metrics; sys.dm_exec_sessions; wait stats	If you run out of workers or see scheduler pressure, you need more compute or less blocking.
Tempdb pressure	Waits related to tempdb; Query Store patterns; workload traits	Sorts, hashes, and temp tables can burn tempdb. Some tiers handle this better.
Top queries by CPU/reads/writes	Query Store (runtime stats + plans)	Right-sizing without query awareness is gambling. Know what drives the peaks.

Baseline window: aim for 7 to 14 days. If you only have 24 hours of data, treat every decision as higher risk.

The decision tree

Use this in order. Do not jump to 'scale up' or 'move tiers' until you identify the limiting signal.

What to do for each limiter