Real-Time Throttling Triage When the Metrics App Can't Keep Up
By Jonathan Flach · Published 2026-06-20 · Reviewed 2026-06-20
An F64 capacity costs $8,409.60 per month at PAYG rates — and the moment it throttles, your Metrics app shows you data that is already 10–15 minutes old. That gap is not a reporting inconvenience; it is a triage gap. By the time the Compute page confirms an interactive rejection, users have been hitting errors for a quarter of an hour. Fabric throttling triage under a live incident requires a different signal stack than the one you use for post-hoc analysis — and one firm rule: do not pause the capacity to clear the throttle.
This is the step-by-step runbook for a live throttle. The underlying monitoring architecture — why the Metrics app has that lag, and how to build the alerting layer that prevents you from ever needing this runbook mid-incident — is covered in the Microsoft Fabric capacity monitoring guide. For predicting throttles before they land, see predicting Fabric throttling. For alerting that fires before users feel the impact, see Fabric capacity throttling alerts.
Why the Metrics app can't triage a live throttle
The Capacity Metrics app's data "becomes available within 10 to 15 minutes after the activity occurs" (What is the Microsoft Fabric Capacity Metrics app?, Microsoft Learn, checked June 2026). That lag is intentional — the app processes and aggregates telemetry before writing it to the semantic model, then your report refreshes on top of that. The Compute page's 30-second timepoint granularity looks precise, but each point you see is at minimum 10 minutes in the past.
Under normal operations, 10–15 minutes is fine for trending and sizing decisions. Under an active throttle, it means:
- You cannot confirm from the Metrics app that throttling is happening right now.
- You cannot tell from the Metrics app whether the throttle has worsened or begun to clear.
- You cannot identify the live offending workload — the top-consuming items list reflects 10–15 minutes ago, not the current moment.
This is the metrics-latency gap problem, and it is the named enemy this runbook exists to defeat: the Metrics app becomes a lagging post-mortem tool the moment a throttle is actively in progress.
The real-time signal lives elsewhere.
The live signal: Real-Time hub capacity events
Fabric capacity overview events in the Real-Time hub emit a Microsoft.Fabric.Capacity.Summary event every 30 seconds for each active capacity (Explore Fabric capacity overview events in Fabric Real-Time hub, Microsoft Learn, checked June 2026). That summary carries the fields you need for live triage:
| Field | What it tells you |
|---|---|
interactiveDelayThresholdPercentage | % of the 10-minute interactive delay window consumed. Exceeds 100% → delay active. |
interactiveRejectionThresholdPercentage | % of the 60-minute interactive rejection window consumed. Exceeds 100% → rejections active. |
backgroundRejectionThresholdPercentage | % of the 24-hour background rejection window consumed. Exceeds 100% → all requests blocked. |
overageTotalCapacityUnitMs | Total carry-forward CU debt in the current window — the number that needs to burn to zero for throttling to clear. |
overageBurndownCapacityUnitMs | CU debt being retired in this 30-second window. Positive and rising → throttle is clearing. Near zero → capacity is maxed out, debt is not clearing. |
capacityUnitMs vs baseCapacityUnits × 30,000 | Raw utilization for this window. Compare to see how far over budget the current 30 seconds are. |
A Microsoft.Fabric.Capacity.State event fires on every state transition — including the transition to overloaded (throttling). If you have an Eventstream routing these events into an Eventhouse and a Data Activator alert on them, that alert fires within 30 seconds of the throttle starting. If you do not have that wired up yet, you are dependent on user complaints. The alerting setup guide covers building that pipeline.
Note: capacity events are best-effort delivery — rare duplicates and missed events can occur. The Microsoft Learn documentation notes this and provides deduplication patterns for KQL (checked June 2026).
The triage runbook (active throttle)
This is the operational sequence for a capacity that is actively throttling right now. Follow it in order.
Step 1 — Confirm the throttle is real and identify its stage (0–2 min)
If you have capacity events flowing into an Eventhouse, run this KQL against your summary table:
_summaryTable
| where capacityId == "<your-capacity-id>"
| where windowStartTime > ago(10m)
| project windowStartTime,
interactiveDelayThresholdPercentage,
interactiveRejectionThresholdPercentage,
backgroundRejectionThresholdPercentage,
overageTotalCapacityUnitMs,
overageBurndownCapacityUnitMs
| order by windowStartTime desc
Read the threshold columns. Values above 100 tell you which throttle stage is active and how far past the threshold you are. A backgroundRejectionThresholdPercentage of 340 means you are carrying 3.4× the 24-hour background window in smoothed debt — background jobs will not start, and the burndown will take a long time unless you intervene.
If you do not have capacity events, open the Metrics app, go to the Compute page, select the capacity and the current day, and read the Throttling tab. Accept that you are looking at 10–15-minute-old data. Look at the Interactive Delay %, Interactive Rejection %, and Background Rejection % line charts. Any of those above 100 confirms throttling was occurring at least 10–15 minutes ago — but the situation may have changed since.
Determine stage:
| What you see | Stage | User impact |
|---|---|---|
| Interactive Delay % > 100, others < 100 | Interactive delay | Every report click gets a ~20-second added wait |
| Interactive Rejection % > 100 | Interactive rejection | New report loads, queries rejected with errors |
| Background Rejection % > 100 | Background rejection | All new operations rejected; pipelines, refreshes blocked |
Step 2 — Identify the probable offender (2–5 min)
The Metrics app's Compute page Items (1 day) matrix, sorted by CU descending, shows the top-consuming items in the last 24 hours. This is lagged but directionally correct — the item that has consumed the most CUs today is almost certainly the source of the debt.
Look for:
- Dataflow Gen2 items: CI/CD Dataflow Gen2 (the default for all new items since April 2026) bills at 12 CU/s for the first 10 minutes of a run, then 1.5 CU/s for every second beyond that — background-smoothed over 24 hours. Non-CI/CD dataflows (legacy items only) bill at 16 CU/s flat for the entire duration with no tiering. Either way, a dataflow running for 6 hours accumulates significant smoothed debt spread across the following 24-hour window. A single large dataflow refresh that ran 6 hours ago can be silently pushing every timepoint since over budget.
- Spark notebook / pipeline items: long-running jobs burst heavily and spread debt far into the future.
- Semantic model refresh: background operations. Large models refreshing in parallel spread heavy CU debt across the 24-hour background smoothing window. Semantic model queries (report DAX) are interactive, but scheduled refreshes are not.
- Runaway AI / Copilot items: Copilot and AI Functions are classified as background operations and smoothed over 24 hours. High-volume concurrent AI use adds to background smoothing debt — not the interactive window — and can silently push every timepoint over budget for hours after the sessions end.
If you have capacity events in an Eventhouse, use capacityUnitUtilizationBreakdown workload codes to see which workload type is driving consumption in the most recent 30-second windows — this is faster than waiting for the Metrics app to catch up.
Step 3 — Estimate burndown time and decide your intervention (5–10 min)
This is the decision that determines whether you intervene aggressively or wait.
Read the carry-forward total. From the Metrics app's overageTotalCapacityUnitMs in the capacity events (or the Metrics app's cumulative overage chart if you're reading lagged data), estimate the total CU debt outstanding.
Estimate idle headroom per timepoint. At the current utilization level, how many CU-ms is each 30-second window not consuming? If the capacity is running at 120% of baseline (capacityUnitMs is 120% of baseCapacityUnits × 30,000), idle headroom is negative — the debt is growing, not clearing. If utilization drops below 100%, the idle difference each timepoint burns down the carry-forward.
Three intervention options:
| Option | When to use | What it does | Caution |
|---|---|---|---|
| Wait it out | Utilization is falling below 100% and debt is modest (< 2× the 24h window) | Idle capacity burns the debt; throttle clears on its own | Can take hours if debt is large |
| Scale the SKU up | Utilization is still above 100% or debt is large (> 2× the 24h window) | Each timepoint gets more baseline CUs → more idle headroom per window → faster burndown | Cost: F64→F128 costs twice the per-minute rate while scaled up. Estimate: at idle, F128 PAYG = $23.04/hr vs F64 = $11.52/hr |
| Kill the offending workload | You can identify the specific item (Dataflow, Spark job, pipeline) still running | Stops new CU-seconds being added to the smoothing queue | Does not instantly clear existing debt; it just stops the debt from growing |
Step 4 — The pause trap (do not do this)
Pausing the capacity does end the active throttle — the platform resets. But it bills all accumulated smoothed CU debt as a one-time PAYG charge at the moment the capacity is paused. When the capacity resumes, it starts with zero carry-forward. Microsoft's own documentation notes this: "When you pause your capacity, the remaining cumulative overages and smoothed operations on your capacity are summed, and added to your Azure bill" (Pause and resume your capacity, Microsoft Learn, checked June 2026). A spike of up to 288,000% of normal can appear in the capacity events for the window in which the pause occurs — all the accumulated debt, billed at once.
On PAYG, you pay the entire smoothed debt at $0.18 per CU-hour, immediately. For reserved capacities, the precise overage rating at pause is not separately documented by Microsoft — consult your reservation agreement for how carryforward overages are rated (checked June 2026).
Pause only if: you need the capacity offline for maintenance and the billing hit is understood and acceptable. Never pause to clear a throttle as a first response.
Step 5 — Confirm the throttle is clearing (10–30 min post-intervention)
Via capacity events (best): Watch overageTotalCapacityUnitMs decreasing over successive 30-second windows. A consistently falling value confirms burndown. Watch overageBurndownCapacityUnitMs — it should be positive and substantial.
Via the Metrics app (lagged): Check the Throttling tab 15–20 minutes after intervention. The delay/rejection % lines should be falling toward 100% (delay easing) or below 100% (throttle lifted). Flat or still-rising lines after intervention means your fix is not working and you need to escalate to a scale-up or additional workload kills.
Step 6 — Root cause and post-mortem (after throttle clears)
Once throttling has resolved, go to the Metrics app's Compute page and drill into the Timepoint detail for the spike timepoints. The Timepoint detail report shows every operation running in a 30-second window ranked by CU impact — this is the only native way to see which specific item was the largest contributor at the moment of the overload.
Note: the Metrics app's Compute page keeps detailed throttling charts and timepoint data for 14 days. The newer Item History page (Preview, August 2025) extends item-level compute visibility to 30 days — useful for trend analysis — but does not replicate the granular timepoint drill-through. If a post-mortem spans beyond 30 days, the data is not in any native tool. This is the metrics-retention wall, and it is why extracting data to your own store before the window closes matters — see the capacity monitoring guide for the extraction patterns.
The triage-runbook summary table
The original table below consolidates the entire runbook into one scannable reference. Nothing in this structure appears in a sibling article — this is purpose-built for the live-throttle scenario, where every column in the Metrics app is 10–15 minutes behind the incident.
| Phase | Time from incident | Signal to read | Action |
|---|---|---|---|
| Confirm | 0–2 min | Real-Time hub capacity events: threshold % fields | Confirm throttle stage; rule out false alarm |
| Identify | 2–5 min | Metrics app Items (1 day) matrix by CU; events breakdown by workload | Find top-consuming item |
| Decide | 5–10 min | Carry-forward total + current utilization | Choose: wait / scale up SKU / kill workload |
| Do NOT | Any time | — | Pause the capacity — costly billing trap |
| Confirm clearing | 10–30 min post | Events: overageTotalCapacityUnitMs declining; Metrics app Throttling tab falling | Validate intervention worked |
| Post-mortem | After clear | Metrics app Timepoint detail (within 14-day window) | Attribute root cause; prevent recurrence |
Why the blast-radius makes triage harder
Fabric has no native per-workspace CU isolation. One workload's carry-forward debt throttles the entire capacity, affecting every team and every report on that SKU. Workspace-level surge protection shipped in preview (January 2026), but it limits a workspace's total CU consumption against an admin-set threshold — when a workspace exceeds that threshold it is placed in a Blocked state that rejects all operations, both interactive and background. It does not give each workspace a guaranteed CU reservation. The tenant-wide blast radius still applies.
This means your triage has to consider the entire capacity, not just the team that complained. The team whose report is erroring is almost certainly not the team that caused the debt. Finding the actual source requires reading the Items matrix across all workspaces on the capacity, not just the workspace of the affected report.
What to fix so you don't need this runbook next time
The underlying problem in every live-throttle triage is the same: the alerting was absent, so the first signal was user complaints instead of an automated alert. The complete preventive stack is:
- Wire the Real-Time hub capacity events to an Eventstream and Eventhouse. This gives you the 30-second live signal and a permanent history beyond 14 days.
- Set a Data Activator alert on
interactiveDelayThresholdPercentage > 80— before it crosses 100 and becomes a user-visible delay. By the time the Metrics app catches up, you are already past interactive delay. - Track the carry-forward burndown rate over time. A capacity that regularly runs a large carry-forward but clears overnight is not in danger. A capacity where the carry-forward is growing week over week is trending toward a guaranteed throttle event. This pattern is detectable from the events data — it is invisible to the Metrics app's 14-day rolling window.
- Set a SKU-scale automation rule. If
interactiveRejectionThresholdPercentageexceeds 100, trigger an automatic scale-up. Scale back down once the carry-forward returns to zero. This approach costs a fraction of the interactive rejections in lost productivity and emergency escalation time.
The monitoring architecture behind those four steps is covered in Microsoft Fabric capacity monitoring. The alerting wiring specifically is in Fabric capacity throttling alerts.
Frequently asked questions
How do I tell if Microsoft Fabric is throttling right now? The Capacity Metrics app lags 10–15 minutes, so it can't confirm a live throttle. The real-time signal is the Real-Time hub Fabric capacity overview events feed: an active capacity emits a Microsoft.Fabric.Capacity.Summary event every 30 seconds carrying interactiveDelayThresholdPercentage and interactiveRejectionThresholdPercentage. If either exceeds 100%, the capacity is actively throttling.
Should I pause my Fabric capacity to stop throttling? No. Pausing does technically end the throttle by resetting the capacity, but it triggers an immediate billing event for all accumulated smoothed CU debt. On PAYG, that debt is billed at $0.18 per CU-hour; for reserved capacities, consult your reservation agreement for how carryforward overages are rated (checked June 2026). The correct fix is to scale up the SKU temporarily or stop the offending workload.
What are the stages of Fabric throttling? Fabric throttling progresses through four stages based on future-capacity time windows: overage protection (up to 10 min — no impact), interactive delay (10–60 min — 20-second wait added), interactive rejection (60 min–24 h — requests denied), and background rejection (over 24 h — all operations blocked).
How long does it take for Fabric throttling to clear on its own? Throttling clears when idle capacity burns the carry-forward CU debt to zero. This can take many hours if new workloads keep adding to the debt. Scaling the SKU up temporarily accelerates burndown by giving each 30-second timepoint more idle headroom.
Why does the Capacity Metrics app show normal utilization even when users are hitting throttling errors? Two reasons. First, the app lags 10–15 minutes. Second, throttling is triggered by carry-forward debt accumulated from smoothed background operations — a capacity can show 80% instantaneous utilization while carrying hours of smoothed debt from earlier jobs. That hidden debt triggers the throttle, not the number on the chart.
Researched with AI assistance, written and fact-checked by Jonathan Flach, verified against Microsoft Learn.