· Yair Knijn
The SaaS outage that took down a business process nobody had mapped to a tool
The COO assumed that anything critical was in the systems inventory, and the inventory was the thing the IT team owned and the auditors had seen. The tool that took the workflow down was not on it. Someone in operations had signed up for it two years earlier to chase document sign-offs, expensed it on a card, and it quietly absorbed a step that the compliance attestation process now could not run without.
Nobody decided that this tool would become load-bearing. It just did. And on the morning it returned a 503 for four hours, the question on the bridge call was not "how do we fail over" but "wait, what runs through this thing."
How a convenience tool quietly becomes critical infrastructure
Criticality is rarely assigned. It accretes. A team picks a tool because it removes friction from one task, and then a second task leans on the first, and a control attestation starts citing the output, and three quarters later a regulated workflow has a single point of failure that exists on nobody's diagram. The tool did not change. Its blast radius did.
The tell is who panics during the outage. If the people scrambling are not the people who own the vendor relationship, you have a process-to-tool dependency that was never written down. That gap is exactly where recovery planning falls apart, because you cannot set an objective for a dependency you have not named.
Business-impact analysis the estate never had
A business-impact analysis asks a plain question for every workflow: if this stops, what breaks, how fast does it hurt, and what does it cost per hour of downtime. DORA, which became applicable to EU financial entities in January 2025, makes this explicit. Its ICT business-continuity requirements rest on a documented impact analysis that identifies critical functions, the dependencies between them, and the recovery requirements for each. The word dependencies is doing the heavy lifting. You cannot recover a function whose supporting tools you never enumerated.
Most estates have done the opposite of this. They have an asset list (servers, the big platforms, the contracts finance knows about) and they have a set of workflows that live in people's heads. The two were never joined. The outage is what joins them, badly, in real time, with a regulator's clock running.
Mapping process to tool to recovery objective in one place
The fix is one table, not another monitoring dashboard. Each row is a business process. Each row names the tools it actually touches, the owner, the maximum tolerable downtime, and the fallback. When that record exists, an outage becomes a lookup instead of an investigation.
- Process: the workflow in business terms, e.g. quarterly control attestation.
- Tools it depends on: every system the step touches, including the card-expensed one.
- RTO: how long it can be down before it causes real harm.
- Fallback: the documented manual or alternate path, and who runs it.
Designing fallback for dependencies you finally made visible
Visibility is the prerequisite, not the deliverable. Once the map exists, you size the response to the harm. A tool whose workflow can wait a day needs a written manual procedure and a known contact. A tool whose workflow has a regulatory deadline needs a tested alternate path, a contractual recovery commitment from the vendor, and a rehearsal on the calendar at least once a year. Setting an RTO you have never exercised is just a number in a document.
This is the work an estate record makes routine. In Spot Suite, a Customer Environment holds the systems, owners, and the process-to-tool dependencies in one place, so when something returns a 503 you already know which workflow it stops and what the fallback is. See how the products handle this if your last outage turned into an investigation instead of a lookup.