Right-Sizing Your Fabric Capacity: Find & Recover Wasted CUs

Q: Why is the 24-hour smoothed background the right metric for right-sizing?

Because background operations — pipelines, Spark jobs, semantic-model refreshes — are smoothed over 24 hours before throttling fires. That means a Spark job that briefly burns 4x your SKU's baseline doesn't trigger throttling as long as the smoothed sum across the full day stays under your daily CU budget. Peak CU reads tell you about bursting; the 24-h smoothed background tells you whether that bursting created a debt that exceeds your daily allowance. Right-sizing against the peak almost always leads to over-buying.

Q: Does a reserved capacity protect you from right-sizing mistakes?

No — it makes them worse. On a pay-as-you-go capacity, oversizing wastes money but you can rescale at any time. On a reserved capacity you are locked into a floor you pay regardless of whether you use the CUs. You also cannot undercut that floor by pausing: pausing a capacity with a reservation still bills the reservation charge plus any smoothed overage at full PAYG rates. Get the sizing right before committing to a reservation.

An F64 capacity costs $8,409.60/month on pay-as-you-go (64 CUs × $0.18/CU-hour × 730 hours, as of June 2026). Drop to an F32 and that line becomes $4,204.80. The difference — $4,204.80/month, $50,457.60/year — is yours to recover if your smoothed load actually fits the smaller SKU. Most teams never check. They bought the SKU from a vendor estimate, the capacity Metrics app shows green, and the bill stays where it is.

Right-sizing a Fabric capacity means reading the 24-hour smoothed background usage, not the peak CU spike, and picking the smallest SKU whose smoothed numbers stay under 100% with room to breathe. This is the most reliable way to recover real money from an over-built capacity without triggering throttling — and the math is concrete enough to work through in a single afternoon. For the broader cost-reduction playbook this fits into, see how to reduce Microsoft Fabric costs. For picking an initial SKU before you have Metrics data, the Fabric capacity sizing guide has the heuristic table.

Why peak CU is the wrong number to size against

Fabric uses bursting and smoothing to absorb short-lived spikes above your SKU's baseline. A Spark job that runs for 20 minutes and consumes 4× your F32's CU rate doesn't immediately blow through your capacity — the burst is tracked in 30-second intervals and the cost is spread forward over the relevant smoothing window.

For background operations — pipelines, Spark jobs, semantic-model refreshes, Dataflows Gen2 — the smoothing window is 24 hours (Understand your Fabric capacity throttling, Microsoft Learn, checked June 2026). That means the question isn't "did we spike above baseline?" but "does our total smoothed consumption for the day stay under our daily CU budget?" Throttling fires — and background jobs get rejected — only when that 24-hour smoothed total crosses 100% of the SKU's daily allowance (Metrics app calculations, Microsoft Learn, checked June 2026):

Throttling stage	Future-capacity usage band	What happens
Overage protection	Up to 10 min	Burst absorbed silently; no user impact
Interactive delay	10–60 min	A 20-second throttle is applied to interactive requests
Interactive rejection	60 min–24 h	New interactive requests are rejected; users see errors
Background rejection	Over 24 h	All requests rejected, including background jobs

A background rejection reading of 250% in the Metrics app means you have consumed 2.5× your daily CU budget — not that a CPU ran at 250% utilization. The metric measures future capacity consumed, not instantaneous load. Source: Metrics app calculations, Microsoft Learn, checked June 2026.

The practical upshot: teams that see a 400% CU spike in the raw ribbon chart and immediately upsize the SKU are buying capacity for a burst that smoothing already absorbed. The right question is always what the 24-hour background rejection chart shows, not the raw CU chart.

The recoverable-dollars worked example

Scenario: A team runs an F64 capacity ($8,409.60/mo PAYG) for a mixed workload: nightly pipeline runs (midnight to 4 AM), business-hours Power BI report renders, and weekly Spark training jobs on Saturdays. The capacity has been running for six weeks. They pull 14 days of data from the Capacity Metrics app.

Step 1 — Read the background rejection ribbon.

The 24-hour background rejection % across 14 days peaks at 68% on the heaviest Saturday (Spark + pipeline overlap) and averages 41% on normal weeknights. No day crosses 100%. The capacity has never fired a background rejection event.

Step 2 — Read the interactive delay/rejection ribbons.

The 10-minute interactive delay % peaks at 72% during business hours on report-heavy Tuesdays. It never crosses 100% (no interactive delay throttle fired). The 60-minute interactive rejection % stays below 20% every day.

Step 3 — Project the load onto the next SKU down.

The next SKU down the doubling ladder from F64 is F32 (32 CUs, half the daily budget). Dropping to F32 doubles all the percentages:

Metric	On F64 (measured)	On F32 (projected)
Peak 24-h background rejection %	68%	136%
Avg 24-h background rejection % (weeknights)	41%	82%
Peak 10-min interactive delay %	72%	144%

F32 doesn't work. The Saturday Spark + pipeline overlap would push the 24-hour background to 136%, triggering background rejection and blocking all jobs. The 10-minute interactive would hit 144%, blocking interactive users too.

Step 4 — Check the intermediate case.

There is no F48 — the ladder doubles. But the question shifts: is there a way to reach F32 by moving or rescheduling workloads? If the Saturday Spark job moved to a Sunday window (no pipeline overlap), the peak background falls from 68% to approximately 51% on F64, which projects to 102% on F32 — still over the line, but barely. With minor Spark optimization (shared session, incremental training rather than full retrain), that 51% drops below 50%, and F32 becomes safe.

Step 5 — Price the recoverable gap.

Scenario	Monthly cost (PAYG, June 2026)	Monthly saving vs F64
Current F64 (no change)	$8,409.60	—
F32 after workload rescheduling + Spark optimization (est.)	$4,204.80	$4,204.80
F64 → 1-yr reserved (no right-sizing)	$5,002.87 (× 0.5949 factor)	$3,406.73
F32 after optimization → 1-yr reserved	$2,501.44	$5,908.16

The reservation saves $3,406.73/mo without touching the SKU. Right-sizing to F32 first and then reserving saves $5,908/mo — nearly double. All figures are estimates computed from the published rate ($0.18/CU-hour, reserved factor 0.5949) and labeled as such. Your actual number depends on your smoothed load, not this table.

Step 6 — Apply the settlement-as-floor check before any reservation.

This is the honest constraint that every right-sizing guide glosses over. A reserved capacity is a billing floor you cannot undercut. If you commit to a 1-year F32 reservation and discover three months in that the workload has grown past F32's headroom, you cannot simply pause to manage cost — pausing a reserved capacity still bills the reservation charge plus any smoothed overage at full PAYG rates. The floor is locked.

The implication: right-size on PAYG first. Run the smaller SKU for at least four to six weeks, confirm the smoothed percentages hold, then move to reserved. Committing before you have smoothed data locks you into a floor that may not match reality. This is what makes right-sizing an empirical process, not a one-afternoon decision: the 14-day compute retention in the Capacity Metrics app (Understand the metrics app compute page, Microsoft Learn, checked June 2026) is the minimum; four to six weeks on the candidate SKU under PAYG is the responsible standard.

The blast-radius qualifier

One detail the worked example doesn't capture: dropping to a smaller SKU reduces the total CU pool available to absorb a single greedy workload. On an F64 a rogue Spark job consuming 60% of the daily background budget leaves 40% for everything else. On an F32 the same job consumes 100% of the daily budget and leaves zero. That is the throttling blast-radius: one capacity, one shared pool, so a poorly-scoped workload doesn't just hurt itself — it can push the 24-hour background above 100% and reject every other job on the tenant.

Workspace-level surge protection (preview, January 2026) lets capacity admins set a per-workspace CU percentage limit over the rolling 24-hour window and block workspaces that exceed it (Surge protection, Microsoft Learn, checked June 2026). This limits how much background budget any single workspace can consume, which helps on a right-sized F32 where headroom is tighter. It does not give each workspace a guaranteed CU slice — the tenant-wide shared pool still applies — but it prevents one workspace from consuming the entire daily allowance. Enable it before you drop a SKU if you have more than one active workspace competing for background budget.

What to look for in the Metrics app

Pull the Capacity Metrics app's compute page and focus on these three charts, in order:

Background rejection % (24-hour window). This is the right-sizing constraint for any capacity with background workloads. Find the highest value across 14 days. If it stays under 50% of 100%, you likely have room to drop a SKU and stay safe; if it sits 50–80%, investigate before moving; above 80%, a downsize risks background rejection. The compute page shows data for the last 14 days (Understand the metrics app compute page, Microsoft Learn, checked June 2026).
Interactive delay % (10-minute window). This constrains report and DAX query response during business hours. If this peaks above 80%, your users are already close to the 20-second throttle; dropping a SKU will push them into it.
Interactive rejection % (60-minute window). If this is nonzero, users are already seeing hard errors. A downsize in this state would make the situation worse, not better. Fix the headroom on the current SKU first.

The storage page covers the last 30 days (Understand the metrics app storage page, Microsoft Learn, checked June 2026) — a separate retention window from compute. Storage charges continue while a capacity is paused, so factor ongoing OneLake storage (~$0.023/GB-month, as of June 2026) separately from compute when you model the savings from a downsize.

Ordering the moves correctly

Right-sizing sits third in the cost-reduction sequence. Run it out of order and you're optimizing on top of a broken baseline:

Kill idle PAYG compute first. Automate pause/resume around your actual active windows — the biggest, cleanest recovery before you touch the SKU. More detail at the Fabric cost-reduction playbook.
Read 14 days of smoothed data before making any SKU decision. The Capacity Metrics app compute page is the primary source; the item history page (preview, available since August 2025) provides a 30-day view of compute consumption if you need longer history.
Right-size on PAYG. Move to the candidate SKU and run it for four to six weeks. Confirm the 24-hour background rejection stays under 80% under real load, including your heaviest scheduled window.
Enable surge protection on the smaller capacity to cap any single workspace's background share before a rogue job consumes the tighter pool.
Then commit to a reservation — not before. The Fabric smoothing and bursting mechanics article has the full debt-carry math for understanding how carry-forward interacts with your reservation floor.

The enemy this sequence defeats is the throttling blast-radius: the way one greedy workload on an undersized capacity can block every other job on the tenant. Right-sizing is not just about saving money on the over-built case — it's also about not falling into the under-built case without first knowing where your smoothed floor sits.

Frequently asked questions

How do I right-size a Microsoft Fabric capacity? Pull at least 14 days of data from the Capacity Metrics app and look at the 24-hour smoothed background percentage, not the raw CU spike. Size to the smallest F-SKU whose 24-hour smoothed background stays under 100% with 10–20% margin. Dropping one tier on the doubling ladder cuts that capacity line by exactly 50%, so the payoff for getting the sizing right is large.

What does the Fabric Capacity Metrics app show me for right-sizing? The compute page shows 14 days of CU detail and three throttling charts: interactive delay (10-min window), interactive rejection (60-min window), and background rejection (24-h window). The background rejection % is the binding constraint for right-sizing — it tells you how much of your daily CU budget is consumed after smoothing. A background rejection reading of 70% means you have 30% of your daily allowance left; one sitting at 250% means you have consumed 2.5× your daily budget and all requests are being rejected (Microsoft Learn, checked June 2026).

Is it safe to drop one F-SKU tier on a Fabric capacity? Only if your 14-day Capacity Metrics data shows the 24-hour smoothed background staying under 100% at the smaller SKU's allowance. On the doubling ladder an F32 carries exactly half the daily CU budget of an F64, so you need headroom before the drop. If your F64 peaks at 60% smoothed background, an F32 would push that same load to 120% — which triggers background rejection. You need the peak to sit under 50% of the current SKU's smoothed background before dropping a tier safely.

Why is the 24-hour smoothed background the right metric for right-sizing? Because background operations are smoothed over 24 hours before throttling fires. Peak CU reads tell you about bursting; the 24-h smoothed background tells you whether that bursting created a debt that exceeds your daily allowance. Right-sizing against the peak almost always leads to over-buying.

Does a reserved capacity protect you from right-sizing mistakes? No — it makes them worse. On pay-as-you-go you can rescale at any time. On a reserved capacity you are locked into a floor you pay regardless of use. Pausing a reserved capacity still bills the reservation charge plus any smoothed overage at full PAYG rates. Get the sizing right on PAYG before committing to a reservation.

Researched with AI assistance, written and fact-checked by Jonathan Flach, verified against Microsoft Learn.