Data anomalies are the termites of modern business. By the time you see the hole in the wall, the structure behind it is already compromised. Unlike a system outage or a regulatory fine, anomalies do not announce themselves. They accumulate quietly, passed along from one process to the next, until the point at which they surface is no longer the point at which they can be contained.

In an AI-driven workflow, this problem is structurally worse than it has ever been. A single decimal error, a mislabelled category, a stale data source used to train a model — any of these can scale from a minor inconsistency to an operational crisis in the time it takes a system to complete a run. We have written about the importance of foundational data skills as the prerequisite for working safely with AI. This article goes one level deeper into a specific and practical consequence of those skills: the ability to catch what is wrong before it compounds into something that cannot be undone.

Anomaly detection is not glamorous work. It rarely comes with a presentation slot. Nobody builds a case study around the crisis that did not happen. But for any professional working with AI-generated outputs — and that is increasingly most professionals — it is one of the highest-value skills available. Consider it the ultimate job-security skill for the modern data-literate worker.

When the World Moves and the Model Stays Still

Data is not a static archive. It is a living record of a world that keeps changing. A model trained on last year's patterns, still running on this year's reality, is not making predictions anymore — it is making assumptions. The longer the gap between when the model learned and when it is being applied, the more dangerous those assumptions become.

This is what data professionals call drift: the gradual divergence between the conditions a model was trained on and the conditions it is now operating in. It is insidious precisely because it is gradual. No single data point looks catastrophically wrong. The model keeps producing outputs. They keep getting approved. And the gap between the model's picture of reality and reality itself quietly widens — the data equivalent of termites working through a load-bearing wall.

The practical response to drift is not to rebuild models constantly — that is neither feasible nor always necessary. It is to establish guardrail metrics: defined thresholds beyond which a shift in the data requires a human to pause and ask why. If a weekly revenue figure moves by more than one standard deviation without an obvious external catalyst, that is not a number to explain away in a footnote. It is a signal to investigate before the next automated process acts on it.

Case Study Zillow Offers — 2021
$300M+ inventory write-down, division shuttered

Zillow's algorithmic home-buying programme, Zillow Offers, was built on a pricing model designed to predict residential property values accurately enough to buy homes, renovate them, and resell at a profit. For a period, the model performed well. Then the housing market began to cool, and the model did not notice in time.

The algorithm had been trained on a period of sustained bullish trends. As conditions shifted, it continued making purchase offers that reflected the market as it had been rather than the market as it was becoming. Zillow found itself holding a growing inventory of homes it had overpaid for, in a market that was no longer willing to pay the prices the model had anticipated. By the time the scale of the problem was visible, the company had accumulated losses that required shuttering the entire division and writing down over $300 million in inventory.

The core failure was not a technical one. It was an absence of drift monitoring — of the kind of ongoing human scrutiny that would have flagged, at an earlier stage, that the model's predictions were diverging from market reality in a systematic and worsening way.

Not Every Outlier Is a Problem. Some of Them Are Clues.

Drift is a slow-moving phenomenon. Outliers are its opposite: sudden, jarring departures from expected patterns that appear in a single data point or a single time period. The challenge with outliers is not finding them. With the right tools, they are easy to spot. The challenge is knowing what they mean — whether you are looking at a genuine signal worth investigating or statistical noise worth ignoring.

Not all outliers are threats. Some are what you might call golden outliers: a new sales channel that has suddenly taken off, a product that has gone viral in an unexpected market, a campaign that dramatically outperformed its projections. Treating these as errors to be corrected would be a costly mistake. The less welcome category — a toxic outlier — is a system fault dressed up as a data point: a volume spike caused by a bot, a return rate made implausibly perfect by a data entry shortcut, a transaction that no human being actually initiated. The discipline of outlier triage is learning to tell one from the other before acting on either.

A simple and underused practice for this is random auditing — selecting a small percentage of automated entries, perhaps five percent, for manual review on a regular basis. If that audit reveals a customer showing four thousand interactions in a single hour, you have found a bot, not a superfan. If it reveals a product category with implausibly perfect return rates, you have found a data entry shortcut someone adopted three months ago and never mentioned. These things do not show up in aggregate dashboards. They show up when someone actually looks.

Case Study The 2010 Flash Crash
$1 trillion in market value erased in under 36 minutes

On 6 May 2010, the US stock market experienced one of the most dramatic single-day collapses in its history. In the space of roughly 36 minutes, the Dow Jones Industrial Average dropped nearly 1,000 points before partially recovering. Approximately a trillion dollars in market value temporarily disappeared.

The trigger was a large automated sell order — a single institutional trade executed by an algorithm without price or time constraints, which flooded the market with sell pressure. High-frequency trading algorithms, interpreting this as a genuine market trend rather than a mechanical outlier, responded by pulling liquidity and accelerating the sell-off. Each system was behaving rationally according to its own logic. None of them were designed to ask whether the input they were responding to was itself anomalous.

The absence of a meaningful human checkpoint — a pause mechanism triggered by volume or velocity readings that fell far outside any normal distribution — meant there was no moment at which anyone could intervene before the cascade was already underway. By the time circuit breakers engaged, the damage was done. The outlier was not noise. It was a fire alarm that nobody had configured anyone to hear.

When Two Teams Are Doing the Maths in Different Languages

Sometimes the problem is not that the data is wrong. It is that two parts of the organisation have different definitions of what "right" means. Sales and Marketing both track leads. Finance and Operations both report costs. Two departments can share a spreadsheet, use identical column headers, and be measuring entirely different things without anyone having noticed, because the conversation about what the columns actually mean never happened.

This is what might be called a unit hallucination: the false confidence that shared vocabulary means shared understanding. It is an anomaly that hides not in the numbers themselves but in the assumptions behind them, and it tends to surface only at the worst possible moment — when a major decision is being made on data that two key stakeholders have been interpreting differently for months.

The practical fix is a plumbing audit: a cross-team definitions review conducted before any dataset is merged or any model is built on combined organisational data. It means agreeing, in writing, on what every key metric means.

The Plumbing Audit

Before merging datasets or building any AI model on combined organisational data, agree in writing on what every key metric means: what counts as a "lead" versus a "qualified buyer," whether revenue figures are gross or net, which time zone timestamps use, and how currency conversions are handled. This conversation is tedious. It is also the only thing standing between your organisation and the kind of structural mismatch that cost NASA a spacecraft.

Case Study Mars Climate Orbiter — 1999
$125M spacecraft lost to a unit conversion error

NASA's Mars Climate Orbiter was a sophisticated piece of engineering representing years of work and $125 million in investment. On 23 September 1999, it entered the Martian atmosphere at the wrong angle and was destroyed. The cause was not a software bug, a component failure, or an engineering miscalculation. It was a units mismatch. One engineering team had been transmitting navigation data in metric units. Another had been receiving and processing it in imperial units. Both teams were doing their jobs correctly. The data was flowing accurately. Nobody had checked whether the two systems were speaking the same language.

The anomaly was present in the navigation data throughout the approach phase. In retrospect, the trajectory deviations were visible. The problem was that without a defined threshold for flagging unit-related inconsistencies, nobody was looking for them in the right way. A plumbing audit conducted before the mission's final approach phase might have saved the spacecraft. A definitions check costs nothing. A $125 million probe costs rather more.

Where Anomalies Go to Hide: Inside the Average

Averages are enormously useful. They are also one of the most reliable places for a serious problem to disappear from view. A stable average can mask two populations moving in opposite directions simultaneously. A flat trend line can sit on top of a dataset that is quietly bifurcating into two groups with entirely different trajectories. As long as you are looking at the mean, everything appears normal. The anomaly only becomes visible when you look at the distribution underneath it.

This is not an abstract statistical concern. It has direct operational consequences whenever AI systems are used to allocate resources, assess risk, or make decisions about people. A model that reports an average outcome has answered a question — but not necessarily the right one. The right question, in most contexts, is whether that average is concealing meaningful variation between groups, and if so, what the consequences of that variation are for the people on the wrong end of it.

⚠ Apple Card Credit Limits — 2019

When Apple Card launched, its credit limit algorithm appeared to be functioning normally at a system level. Aggregate metrics showed nothing unusual. It was only when individual cases were compared side by side that a pattern emerged: women were routinely being assigned significantly lower credit limits than men with similar or identical financial profiles. In some cases, the disparity was dramatic enough that spouses sharing the same assets and liabilities received limits that differed by a factor of ten or more.

As Wired reported, the issue was invisible in the model's overall performance metrics. It only became visible through subgroup analysis — comparing outcomes not across the full user base, but within comparable financial profiles broken down by gender. The algorithm was not flagging anything wrong because nothing was wrong by the only measure it was being held to. The anomaly lived entirely in the distribution beneath the average, and it required the kind of deliberate disaggregation that most post-deployment audits never perform.

The practical takeaway is a simple habit shift: wherever an AI system produces an average or an aggregate score, ask what the distribution behind it looks like. A histogram showing how data points are clustering across the range will reveal patterns that a single summary statistic never can. If the data is bimodally distributed — clumping at two separate points rather than spreading naturally around a centre — that is almost always a signal worth investigating before any decisions are made on the back of the average.

The Red Button Philosophy

Every organisation that operates with AI-augmented workflows needs what might be called a red button culture: the shared understanding that any employee, at any level, who spots something that looks anomalous has both the authority and the expectation to pause, flag it, and ask why, before the process continues.

This sounds straightforward. It is not, in practice. Anomalies are easy to explain away. A metric that shifts by more than a standard deviation without an obvious cause is uncomfortable to raise in a meeting where everyone else has accepted the output. A data point that does not fit the expected pattern can always be framed as an edge case. The institutional pressure in most organisations runs toward completion, not investigation. Red button culture runs directly against that pressure — and it requires active, deliberate cultivation.

The cases in this article — Zillow, the Flash Crash, the Mars Orbiter, the healthcare algorithm — share a common thread. In each of them, the anomaly was present before the crisis. The data was there to be interrogated. What was absent was either the skill to recognise it, the tool to flag it, or the organisational permission to act on it. None of those absences are inevitable. All of them are addressable with the right training and the right culture.

The Core Principle

Anomalies are the smoke. The crisis is the fire. A data-literate organisation does not wait until there are flames to start looking for the source. It builds the habit of investigating the smoke — even when the smoke looks statistically insignificant, even when it is inconvenient, even when everyone else in the room has already moved on. If you manage the data, you manage the risk. Do not let the termites win.

Want to build these skills
across your organisation?

Our data literacy courses cover the practical competencies behind anomaly detection — from working with distributions and guardrail metrics to auditing AI outputs before they scale. If your team is making decisions on data they cannot yet interrogate, we can help change that.