Why Modern Data Engineering Is Failing Audits

SCD Type 2 is painful. Avoiding it is negligent.

I’m going to call out your favorite data influencer right now, and I might get canceled for it, but this concept is too dangerous to ignore.

That is why you fail. ~Yoda

I read the article on DataExpert.io. It is persuasive. It points out that managing valid_from and valid_to dates manually in SQL is a nightmare. It correctly notes that backfilling history in complex Type 2 tables is miserable.

And, to be fair to the author, if you are analyzing “Likes” or “Friend Requests,” they are absolutely right.

But here’s the reality. You do not work at a social media company.

If you are building a dashboard for “Daily Active Users” at Facebook, following this advice will save you time. But if you work in finance, healthcare, insurance, or operations (like 99% of us do) following that advice won’t just save you time. It might end your career.

Today I’m going to tell you the difference between “Product Engineering” and “Production Reality.”

The “Snapshot” Seduction

The argument for killing SCD 2 is simple: Storage is cheap.

Instead of tracking changes row-by-row, just append a full copy of the table every night with a ds (date stamp) column. When you need to know what the world looked like on November 14th, you just query WHERE ds = ‘2024-11-14’. (The reality is, with time travel it is even simpler than this… but that is another gap I don’t even want to get into here.)

For Product Metrics, this is brilliant.

  • Trend Analysis: “How many followers did User X have last month?” Easy.

  • Backfills: “We need to re-process metrics.” Just re-read the partition. (Again, time travel)

  • Simplicity: No complex merge logic. No overlapping date bugs.

If your job is to calculate “Retention,” Snapshots are the correct tool. But “Retention” is a derivative metric. It is not the State of Record.

The Reality For The Non-Top 1%: Human-Edited Data

The “Snapshot” approach assumes that the state at the time of your batch run is the only state that matters. It treats intra-day volatility as noise. And that is where the rubber meets the road.

In my world, and yours, intra-day volatility is not noise. It is where reality lives. It’s where auditability lives, and it is the difference between getting promoted and getting laid off.

Let’s look at a couple Salesforce scenarios (because we all have some SFDC PTSD).

Scenario 1: The Fraudulent Close

  • 10:00 AM: Sales Rep opens a deal. Value: $10,000.

  • 12:00 PM: Rep talks to the client. They inflate the value to $500,000 to hit a quota milestone and mark it Closed Won.

  • 4:00 PM: The Regional Sales Director catches the error (it fails the sniff test) and rejects it.

  • 4:30 PM: Rep panics, edits the deal back to $0, and closes it as Closed Lost.

  • 2:00 AM: Your pipeline runs and takes a Snapshot.

Scenario 2: The Fat Finger

  • 9:30 AM: Sales Rep updates a $50,000 deal to $80,000.

  • 10:05 AM: Sales Rep accidentally updates the same deal to $25,000 (wrong record).

  • 11:32 AM: Sales Rep creates a new lead for $25,000.

  • 2:00 PM: Sales Rep starts looking for the $80,000 lead but can’t find it. Decides to “wait until tomorrow” to find it on the Power BI report.

  • 2:00 AM: Your pipeline runs and takes a Snapshot.

  • 9:00 AM: The Rep still can’t find the deal and blames the data team for “bad data.”

The Results

The Snapshot Engineer (The Victim):

  • Scenario 1: Your report shows the deal was $0 on that day. You have zero record that a $500k pipeline opportunity existed. The Director doesn’t have the “paper trail” to audit the Rep’s behavior. You have lost data, and HR won’t allow reprimand without the receipts.

  • Scenario 2: The sales team loses trust in your data team. You potentially missed $55,000 of opportunity. Your funnel reporting is wrong, and you have no way to prove it wasn’t a pipeline bug.

The Senior Engineer (The Hero):

  • Scenario 1: Because you used SCD Type 2 (fed by Change Data Capture), you have three rows. You can reconstruct the crime scene. Sales Rep gets fired for unethical behavior.

  • Scenario 2: You query the history and show the Rep exactly when they updated the wrong lead. The Rep corrects the account, funnel reports remain accurate, and they close the $80,000 lead.

Code to Cash Rule: The business value here isn’t the final number. The business value is the audit trail of how we got there. If you cannot produce that trail, you are exposing the company to risk.

The Hidden Cost of “Cheap” Storage

The second argument for Snapshots is that “Storage is cheap.”

This is true if you are a startup with 50GB of data. It is a lie if you are an enterprise with 500TB.

You don’t go out and buy a second loaf of bread just because you have empty counter space. It is still a waste if you know you aren’t going to use it before it molds.

But, Let’s do the math.

  • Scenario: You have a user table with 100 Million users.

  • Change Rate: Only 1% of users change their profile daily.

  • The Snapshot Approach: You write 100 Million rows every single night. 99% of your write cost is wasted on duplicate data (Bread that is molding).

  • The SCD Type 2 Approach: You write 1 Million rows (the changes).

Over a year, the Snapshot approach writes 36.5 Billion rows. The SCD approach writes 1.3 Billion rows.

When your CFO asks why the Snowflake bill doubled, “The SCD logic was complex” is not an acceptable answer. And with CDF and CDC available, it’s not even an accurate one.

The Verdict: The “Hybrid” Architecture

The article argues SCD 2 is harmful because “backfilling is hard” and “querying is annoying.”

Let me be brutal… SCD 2 is only hard because you are writing bad SQL.

If you are manually writing UPDATE statements to close out date ranges in 2025, you are doing it wrong. Modern table formats (Delta Lake, Iceberg) and tools (Databricks, Snowflake, dbt snapshots) handle the merge logic for you.

The Production-Grade Standard: Don’t choose between Snapshots and SCD 2. That is a false dichotomy. Layer them.

  1. The Immutable Log (Your Truth Layer): Use CDC or high-frequency ingestion to capture every change. Store this raw. This is your bronze table.

  2. The SCD Layer (Your Audit Protection): Build an SCD Type 2 table from the logs. This minimizes storage and maximizes auditability. This is your silver table.

  3. The Snapshot View (Your Convenience Store): Do not materialize a snapshot table. Create a View on top of your SCD table that simulates a snapshot logic. This is probably your gold table.

Here’s the code:

CREATE VIEW dim_users_daily_snapshot AS
SELECT
date_spine.ds,
u.*
FROM date_spine
JOIN dim_users_scd u
ON date_spine.ds >= u.valid_from
AND (date_spine.ds < u.valid_to OR u.valid_to IS NULL)

Now your Analysts get their easy “Snapshot” queries (WHERE ds = ‘...’), but your Auditors get their full history, and your CFO gets a lower storage bill.

The “Delorian” Fix: Hydrating SCD 2 from the Log

The critics of SCD 2 complain that backfilling is hard. They are right, if you are parsing raw CSV files from six months ago.

But if you are on Databricks and you were smart enough to enable Change Data Feed (CDF) or you can rely on Delta Time Travel, backfilling isn’t complex or difficult.

It is a window function.

Here is the scenario: You started with a Snapshot table (dim_customers). You didn’t build SCD 2. Now Legal needs to know exactly when a customer changed their risk_profile last month.

If you enabled delta.enableChangeDataFeed = true, you don’t need to beg the DBA for backups. You just query the feed.

The “Magic” Query: This SQL transforms a raw transaction log (CDF) into a SCD Type 2 table in one pass.

/*
Step 1: Read the raw history from the Delta Change Data Feed
We grab the state of the row AFTER every update (update_postimage) or INSERT.
*/


WITH raw_changes AS (
SELECT
customer_id,
risk_profile,
-- The timestamp of the commit becomes our ‘valid_from’
_commit_timestamp AS change_ts
FROM table_changes(’source_dim_customers’, ‘2024-01-01’, ‘2024-02-01’)
WHERE _change_type IN (’insert’, ‘update_postimage’)
),

/* Step 2: Use Window Functions to calculate the ‘valid_to’ date
This is where the magic happens. The ‘valid_to’ of this row
is simply the ‘valid_from’ of the NEXT row. */

scd_logic AS (
SELECT
customer_id,
risk_profile,
change_ts AS valid_from,
-- Look ahead to the next change for this ID.
-- If no future change exists, this is the current record (NULL or Future Date).
LEAD(change_ts, 1, ‘9999-12-31’) OVER (PARTITION BY customer_id ORDER BY change_ts) AS valid_to,
-- Simple flag logic
CASE WHEN LEAD(change_ts) OVER (PARTITION BY customer_id ORDER BY change_ts) IS NULL THEN true
ELSE false END AS is_current
FROM raw_changes
)


-- Step 3: Write your backfilled history
INSERT INTO dim_customers_scd2
SELECT * FROM scd_logic;
--But don’t be lazy… actually write out your fields here!


Why This Matters:

  • Zero “Stitching”: We didn’t have to join 30 daily snapshot tables. We read the stream of changes.

  • Total Fidelity: Unlike snapshots, which miss updates that happen and revert between midnight runs, CDF captures every commit.

  • The Remaining Risk: This relies on your retention settings. If you ran VACUUM with a short retention period, this history is gone. That is why Anti-Fragility means configuring your storage retention before you need it.

Final Thought

When a hiring manager asks you about data modeling, they aren’t testing your syntax. They are testing your judgment.

The “Junior” answer is: “Snapshots are easier to write.” The “Senior” answer is: “Snapshots are convenient, but I use SCD 2 for the underlying storage to ensure auditability and manage costs at scale.”

Stop optimizing for “easy code.” Start optimizing for “keeping your job.”


About the Author

To the 100+ new subscribers who joined this week: You are not here for a coding tutorial. You are here for a survival guide.

I am a Senior Data Strategy and Engineering Leader with over 25 years of experience building systems that handle billions of dollars and petabytes of data. I have survived outages, divestitures, budget cuts, and platform migrations.

I am here to share the brutal truth, Data Platform PTSD, and lessons that bootcamps won’t show you.

Gambill Data operates on three core pillars:

  1. Code to Cash: If your code doesn’t solve a business problem or save money, it doesn’t have a reason for existence.

  2. Anti-Fragility: The goal is not to avoid errors; it is to build systems, generate revenue, and excel in careers that survive them.

  3. Strategy > Syntax: Anyone can write code. Few can architect for production.

If you are ready to stop thinking like a Junior Engineer and start thinking like a Strategic Partner, you are in the right place.

Join me in the trenches:

Get to work.

Next
Next

The Only Data Engineering Roadmap You Need to Build a Killer Portfolio (Plus an AI Bonus)