Cost blowups make good stories. Cost drift is what actually eats your budget.

Drift comes from normal, boring behavior:

  • A team enables diagnostics “for a week” and forgets.

  • Autoscale does what it’s designed to do, then never scales back down.

  • Backups and snapshots pile up like receipts in a junk drawer.

  • A new service rolls out under a shared subscription, and nobody notices the new meter until Finance asks.

Here’s the uncomfortable part: drift isn’t a tooling problem. It’s an attention problem.

So we fix attention first.

The operator loop (what you do daily)

This is a 10 to 15-minute daily loop, Monday through Friday. No heroics. No war room.

Goal: Detect drift at the management group scope, route it to a real owner, apply the safest action first, then file the “risky” work for change control.

Source of truth (best practice stack):

  • Cost exports (daily) for numbers you can trust and automate against

  • Azure Resource Graph for “what resources changed or exist right now”

  • KQL for “why did logs or telemetry explode”

  • Lightweight alerting to email/Teams when something moves outside your tolerance band

If you want a mental model: this is a smoke detector, not a fire investigation.

The 7-day radar checklist (daily signals)

Run these in order. Top-down only. You don’t start at the resource group unless you like suffering.

1) “What moved?” (Azure Management Group scope)

  • Compare the last 7 days vs the prior 7 days at the management group scope.

  • Pull the top 3 subscriptions contributing to the change.

  • Operator rule: if it isn’t a top 3 mover, it doesn’t exist today.

2) “Is it new spend or more spend?”

  • New spend: new Azure resource group, new service, new region, new SKU, new workload.

  • More spend: same workload, higher usage, bigger scale, more retention, more throughput.

3) “Top service drivers”

Look for a service jumping into your top 5 that normally isn’t there:

  • Logs / telemetry ingestion

  • Data egress

  • Containers scaling (AKS/VMSS)

  • Database compute or storage growth

  • Backup and snapshot growth

4) “Orphans and barnacles”

This is the easiest money you’ll ever save:

  • Unattached disks

  • Old snapshots

  • Idle public IPs

  • Load balancers with no backend

  • Forgotten dev/test resources that never shut off

5) “Scale changed”

Scan for sudden shifts in:

  • AKS node counts

  • VMSS instances

  • App Service plan size or instance count

  • Database tier or vCore changes

6) “Logging and retention drift”

This is the silent killer because it feels harmless.

  • Retention bumped up “temporarily”

  • Debug logging left on

  • Diagnostic settings pushed broadly without a baseline

  • One noisy workload flooding ingestion

7) “Owner and tag integrity”

You can’t route drift without an owner.

  • Does the spend map to an Owner, App, and Environment?

  • If not, your radar still works, but your response becomes a Slack scavenger hunt.

Ownership model (who does what)

This only works when roles are boring and explicit.

FinOps (signals + triage)

  • Runs the daily radar

  • Classifies drift (new vs more, service driver, likely cause)

  • Routes to an owner with a short note: “what moved, where, and why we think”

Platform engineering (guardrails + safe actions)

  • Owns shared settings that cause drift (diagnostics defaults, retention baselines, shared services patterns)

  • Implements safe automation (tag enforcement, dev/test schedules, guardrails)

  • Keeps a “known cost traps” list that teams can actually follow

App owners (intent + workload fixes)

  • Confirms if the spend is expected

  • Owns app-level changes (scaling rules, query tuning, caching, architecture)

Change control (only for risky moves)

  • Anything that can impact availability gets a ticket + change window

  • Reversible changes happen first, always

Action playbook (fixes that don’t break prod)

Start with what’s reversible. Save the “big brain” changes for when you have daylight.

Tier 1: Safe and reversible

  • Delete unattached disks, old snapshots, idle IPs

  • Reduce log retention where it’s clearly excessive for the environment

  • Turn off debug logs that were meant to be temporary

  • Apply dev/test shutdown schedules (policy-backed)

Tier 2: Low risk, but get a quick sign-off

  • Right-size down one step (not three)

  • Tighten autoscale rules to prevent runaway growth

  • Split noisy logs or reduce verbosity

Tier 3: Real engineering work

  • Fix chatty systems driving egress

  • Optimize database queries and indexes

  • Rework telemetry patterns to reduce ingestion without losing signal

Operator rule: never start with the riskiest fix just because it’s the biggest bar on the chart.

Copy/paste checklist block (reuse this every issue)

7-Day Cost Drift Radar (Daily)

  • Azure Management Group scope: last 7 days vs prior 7 days trend

  • Top 3 subscription movers identified

  • Top 3 Azure Resource Group movers identified inside each subscription

  • Classify: new spend vs more spend

  • Check orphans (disks, snapshots, IPs)

  • Check scale changes (AKS/VMSS/App Service/DB tiers)

  • Check logs + retention drift

  • Confirm tags: Owner, App, Environment

  • Route: assign platform/app owner with a 2-sentence summary

  • Apply Tier 1 fixes today, queue Tier 2/3 via ticket + change window

The next 15 minutes (do this today)

  1. Pick one management group and run the 7-day comparison.

  2. Write down the top 3 movers and label each as new or more.

  3. Assign an owner for each mover, even if the owner is “unknown” for now. Unknown is still a status.

  4. Apply one Tier 1 fix. Just one. Prove the loop works.

Keep reading

No posts found