Why Dataflows Gen2 Quietly Drain Your Fabric Capacity

Q: When should I use a Spark notebook instead of Dataflows Gen2?

Use Spark notebooks when transforms run longer than about 2.5 minutes, involve large datasets, or need to run frequently. Under the Autoscale Billing model for Spark (an opt-in serverless tier), a job draws approximately 0.5 CU-hours of compute for active runtime only — the crossover point where a CI/CD dataflow matches that cost is exactly 150 seconds (2.5 minutes). Above 2.5 minutes, Spark is cheaper and the gap widens: at 5 minutes a CI/CD dataflow costs 1.0 CU-hours, twice the Spark autoscale rate. On standard Fabric capacity, the actual consumption depends on pool size and job duration — a medium starter pool running 30 minutes costs roughly 2.0 CU-hours. Either model has no tiered-rate penalty for long runs. Use Dataflows Gen2 when the transform is short (under 2.5 minutes), data volumes are small, you need the low-code Power Query interface, or the refresh frequency is low — say, once or twice a day — and the total CU cost stays well under what Spark would spend on startup overhead.

A Dataflows Gen2 refresh is not free: standard compute charges start at 12 CU per second for the first ten minutes of query duration, dropping to 1.5 CU per second beyond that (Pricing for Dataflow Gen2, Microsoft Learn, checked June 2026). Add high-scale staging compute at 6 CU per second and — when a supported connector fires Fast Copy — a conditional data-movement charge at 1.5 CU per second, and a single moderately complex dataflow can consume more capacity in 30 minutes than a single Spark notebook job running the same transform would consume under Spark's Autoscale Billing model — which charges a flat 0.5 CU-hour per job regardless of duration. The hidden costs of Microsoft Fabric playbook flags Dataflows Gen2 as one of the most reliably mis-sized ETL choices in a tenant, and the full Fabric cost-optimization playbook puts dataflow sprawl in the broader context of capacity waste — this article shows you the exact math, the CI/CD rate change that took effect in April 2026, and the decision table for choosing when to use a dataflow versus a notebook or stored procedure.

How the billing meters actually stack

Dataflows Gen2 do not use a single meter. Every refresh draws from up to four simultaneous meters, and the combination is what surprises most teams.

Standard Compute (CI/CD — the only rate for new items since April 2026)

Microsoft documented this directly: as of April 2026, all new Dataflows Gen2 items are created with CI/CD and Git integration support by default, and the option to create non-CI/CD Dataflows Gen2 is no longer available (Create your first Microsoft Fabric dataflow, Microsoft Learn, checked June 2026). The CI/CD rate uses two tiers:

For every second up to 10 minutes of query duration: 12 CU per second
For every second beyond 10 minutes: 1.5 CU per second

This tiered structure means the cost per additional second drops sharply once a query crosses 10 minutes, but total CU-seconds always rise with duration. A query that finishes in 8 minutes burns 8 × 60 × 12 = 5,760 CU-seconds. A query that runs 12 minutes burns the first 10 at 12 CU/s (7,200 CU-s) plus 2 minutes at 1.5 CU/s (180 CU-s) = 7,380 CU-seconds total — 28% more than the 8-minute query. Each additional second beyond 10 minutes is billed at 1.5 CU/s — 87.5% cheaper per marginal second than the first-tier rate.

Standard Compute (legacy non-CI/CD — existing items only)

Existing non-CI/CD dataflows continue to run and bill at a flat 16 CU per second for the full query duration. At 16 CU/s there is no tier break — a 2-minute query burns 1,920 CU-seconds, a 12-minute query burns 11,520 CU-seconds. The legacy rate is higher per second than the CI/CD sub-10-minute rate and far higher per second than the CI/CD post-10-minute rate for longer queries.

High Scale Compute (when staging is enabled)

When staging is turned on, queries route through a Lakehouse or Warehouse SQL engine. This adds 6 CU per second, billed at the workspace level rather than the item level. Billed per workspace means you may not easily isolate it to a specific dataflow in the Capacity Metrics app unless you filter carefully. For a 30-minute dataflow with staging enabled, that is 30 × 60 × 6 = 10,800 additional CU-seconds on top of the standard compute.

Fast Copy (Data movement)

When a Fast Copy-enabled connector is used and Fast Copy fires, the Data movement meter charges 1.5 CU per second for the duration of the copy activity, reported per item (Pricing for Dataflow Gen2, Microsoft Learn, checked June 2026). This is a conditional, time-based charge — it only applies when Fast Copy is active. Dataflows that do not use Fast Copy-capable connectors, or where Fast Copy is disabled, incur no Data movement charge.

The complete CU rate table

Engine	Rate	Scope
Standard Compute — CI/CD (first 10 min)	12 CU/second	Per item
Standard Compute — CI/CD (beyond 10 min)	1.5 CU/second	Per item
Standard Compute — non-CI/CD (legacy)	16 CU/second	Per item
High Scale Compute (staging enabled)	6 CU/second	Per workspace
Fast Copy (Data movement, conditional)	1.5 CU/second	Per item
VNET Data Gateway	4 CU (uptime)	Infrastructure

Source: Pricing for Dataflow Gen2, Microsoft Learn (checked June 2026).

Dataflow Gen2 vs notebook vs stored procedure — a CU comparison

This comparison exists nowhere in the official documentation. It uses the published rates above plus the Spark billing rate of 0.5 CU-hours per Spark job for active runtime (Apache Spark billing, Microsoft Learn, checked June 2026), and assumes a Warehouse SQL query draws from the Fabric capacity at the standard Data Warehouse CU meter (charged only for active query execution).

Spark billing model note: The 0.5 CU-h per job figure applies to the Autoscale Billing for Spark model (an opt-in, serverless pay-as-you-go tier — separate from standard Fabric capacity). On standard Fabric capacity, Spark consumes CUs proportional to actual vCores × active session time. A single-job 30-minute run on a medium starter pool (8 vCores = 4 CUs) on standard capacity would cost approximately 4 CU × 0.5 h = 2.0 CU-hours — still cheaper than a CI/CD dataflow at 2.5 CU-h, but a narrower gap. The table below uses the Autoscale Billing rate; scope the comparison to that billing model accordingly.

Scenario: a moderate ETL transform — 30 minutes of processing, no staging, CI/CD dataflow.

Approach	Active duration	CU calculation	Total CU-seconds	CU-hours	Estimated cost (PAYG)
Dataflow Gen2 (CI/CD, no staging)	30 min	600s × 12 CU/s + 1,200s × 1.5 CU/s	9,000 CU-s	2.5 CU-h	~$0.45 est.
Dataflow Gen2 (legacy non-CI/CD)	30 min	1,800s × 16 CU/s	28,800 CU-s	8.0 CU-h	~$1.44 est.
Dataflow Gen2 (CI/CD + staging)	30 min	9,000 CU-s standard + 10,800 CU-s high-scale	19,800 CU-s	5.5 CU-h	~$0.99 est.
Spark notebook (starter pool)	30 min	0.5 CU-h per job × 1 job	1,800 CU-s	0.5 CU-h	~$0.09 est.
Warehouse stored procedure	30 min	Billed only for active SQL engine time; typical 1-2 CU/s for DW queries	~1,800–3,600 CU-s	0.5–1.0 CU-h	~$0.09–$0.18 est.

All cost estimates computed from $0.18/CU-hour (PAYG, as of June 2026). Labeled as estimates — your actual consumption depends on query complexity, data volume, and node configuration. Spark billing: Apache Spark billing, Microsoft Learn. Warehouse billing: Fabric operations, Microsoft Learn, (both checked June 2026).

What the table shows:

A 30-minute Spark notebook job costs roughly 5 times less than the same duration CI/CD dataflow on standard compute, and about 16 times less than the legacy non-CI/CD rate. With staging enabled, the CI/CD dataflow runs to about 11 times the Spark cost.

For short transforms — say, a 2-minute refresh — the math inverts slightly. The Spark starter pool charges 0.5 CU-hours regardless of job duration (billed for the active session time, not per-minute-minimum like pipelines). A 2-minute CI/CD dataflow query bills 120s × 12 CU/s = 1,440 CU-seconds = 0.4 CU-hours — slightly cheaper than Spark autoscale for that run. The crossover point is approximately 2.5 minutes: at 150 seconds, 150s × 12 CU/s = 1,800 CU-seconds = exactly 0.5 CU-hours, matching the Spark autoscale charge. Above 2.5 minutes the gap grows steadily in Spark's favor — at 5 minutes a CI/CD dataflow costs 1.0 CU-hours, which is already twice the Spark autoscale rate.

Scenario: ten daily refreshes of the same 30-minute dataflow (CI/CD, no staging)

10 × 2.5 CU-h = 25 CU-hours per day = 750 CU-hours per month (est.). At $0.18: ~$135/month from a single dataflow item refreshing ten times daily. If that same logic ran as a scheduled Spark notebook ten times daily: 10 × 0.5 CU-h = 5 CU-hours per day = 150 CU-hours per month = ~$27/month. The difference: ~$108/month per item, from a single moderate transform running ten times a day.

The CI/CD vs legacy rate distinction (April 2026)

This is the most important operational change for teams managing existing dataflows. Before April 2026, Dataflows Gen2 existed in two forms: CI/CD (with Git integration) and non-CI/CD (the default for new items). Since April 2026, you can no longer create the non-CI/CD type. Every new item is CI/CD, so every new ETL workload benefits from the tiered rate instead of the flat 16 CU/s legacy rate.

Existing non-CI/CD dataflows are not automatically migrated. They keep running and billing at 16 CU/second. That means a team with a mix of old and new dataflows may have two very different cost profiles running side-by-side, which looks like noise in the Capacity Metrics app unless you filter by item.

The migration path is straightforward: recreate the non-CI/CD dataflows as CI/CD items, connect them to your Git repo, and decommission the old ones. For items running transforms under 10 minutes, the CI/CD rate (12 CU/s) is actually slightly cheaper than legacy (16 CU/s). For items running longer, the CI/CD tiered rate is substantially cheaper after the 10-minute mark.

When to use Dataflows Gen2 vs alternatives

Situation	Recommendation
Short transforms, under 2.5 minutes, low-code requirement	Dataflows Gen2 (CI/CD) — at or below Spark autoscale cost, easier to author
Complex transforms, 10+ minutes per refresh	Spark notebook or Warehouse stored procedure — significantly cheaper per CU
High-frequency refreshes (10+ per day) on heavy transforms	Warehouse stored procedure — lowest per-execution cost, reusable
Data with legacy non-CI/CD dataflows	Migrate to CI/CD rate or to Spark; legacy rate is the most expensive option
Staging enabled for a wide transform	Profile before committing — staging adds 6 CU/s per workspace on top
VNET gateway requirement	Factor in 4 CU infrastructure uptime charge; assess whether gateway sharing reduces cost

For the broader pipeline vs ADF cost question, see Fabric vs ADF pipeline cost. For the full cost-reduction playbook that includes dataflow sprawl as one of its line items, see how to reduce Microsoft Fabric costs.

Finding the waste in the Capacity Metrics app

Dataflows Gen2 appear in the Metrics app under three separate meter lines: Dataflows Standard Compute, High Scale Dataflow Compute (workspace-level), and Data movement (Fast Copy). If you see unexplained background CU burn, filter for Dataflow Gen2 items in the item kind selector and look for:

Items refreshing far more frequently than business value justifies
Long-running queries that cross the 10-minute tier boundary repeatedly
Workspace-level High Scale Compute that dwarfs the per-item standard compute

The Capacity Metrics app stores compute detail for 14 days. That window is enough to establish a baseline per-item CU cost and project it forward. The Dataflow refresh history (available in the workspace) gives you per-query duration needed to apply the formulas above.

The named enemy: the attribution void

Dataflows Gen2 cost is a textbook case of the attribution void — the inability to tell which specific workload is eating your capacity. The High Scale Compute meter bills at the workspace level, not the item level, which means if you have ten dataflows in one workspace, the workspace-level staging charge is pooled. You can see the total; you cannot easily see which dataflow drove it without cross-referencing item-level refresh history with the workspace-level meter. This is the same attribution gap that makes per-user cost reporting impossible natively, and it is why migrating heavy ETL to Spark notebooks or stored procedures can simplify both your cost profile and your cost visibility: Spark charges land per-item in the metrics app, and stored procedures report as Warehouse queries traceable to a specific item.

SpendWeave reads your Capacity Metrics data and maps CU consumption to item kind, so the dataflow vs notebook cost split is visible in your actual numbers rather than a generic benchmark.

Frequently asked questions

How much do Dataflows Gen2 cost in Microsoft Fabric? Dataflows Gen2 pull from your Fabric capacity using up to four meters simultaneously. Standard compute for a CI/CD dataflow runs at 12 CU per second for the first 10 minutes of query duration, dropping to 1.5 CU per second beyond that. High-scale compute (when staging is enabled) adds 6 CU per second on top, billed per workspace. Fast Copy (when a supported connector is used and Fast Copy is enabled) charges 1.5 CU per second for the duration of the copy activity — this is a conditional, time-based charge, not a standing overhead. All figures are from Microsoft Learn, checked June 2026.

Are Dataflows Gen2 more expensive than Spark notebooks? For heavy ETL — transforms running more than a few minutes, large dataset sizes, or repeated high-frequency refreshes — yes, substantially. A 30-minute notebook job on Spark draws roughly 0.5 CU-hours of compute (billed only for active runtime, no idle cost). The same 30-minute dataflow transform on standard CI/CD compute costs about 2.5 CU-hours — roughly five times more. The gap widens further with staging enabled. For complex, repeated heavy transforms, Spark notebooks or Warehouse stored procedures are cheaper at scale.

What is the difference between CI/CD and non-CI/CD Dataflows Gen2 pricing? Since April 2026, all new Dataflows Gen2 are CI/CD by default — the non-CI/CD item type can no longer be created. The CI/CD rate uses a two-tier model (12 CU/s for the first 10 minutes, then 1.5 CU/s). Legacy non-CI/CD dataflows run at a flat 16 CU/s for the full duration. Existing non-CI/CD dataflows continue to work, but any new ETL work is automatically at the CI/CD rate.

Does enabling staging in Dataflows Gen2 increase cost? Yes. Staging routes queries through a Lakehouse or Warehouse SQL engine and adds 6 CU per second on top of your standard compute charges. That additional meter is billed at the workspace level, so it can be harder to isolate in the Capacity Metrics app. For wide or complex transforms, staging can improve performance, but the cost premium is real and compounds with long query durations.

When should I use a Spark notebook instead of Dataflows Gen2? Use Spark notebooks when transforms run longer than a few minutes, involve large datasets, or need to run frequently. Under the Autoscale Billing model for Spark (an opt-in serverless tier), a job draws approximately 0.5 CU-hours of compute for active runtime only. On standard Fabric capacity, the actual consumption depends on pool size and job duration — a medium starter pool running 30 minutes costs roughly 2.0 CU-hours. Either model has no tiered-rate penalty for long runs. Use Dataflows Gen2 when the transform is short, data volumes are small, you need the low-code Power Query interface, or the refresh frequency is low — once or twice a day at most.

Researched with AI assistance, written and fact-checked by Jonathan Flach, verified against Microsoft Learn.