Detection

The Gap Between
Failure and Discovery

Reliability work often starts too late because the system only tells the truth after someone asks.

The gap between failure and discovery is operational debt in motion. During that gap, the system is already broken, but the team is still behaving as if it is healthy. This is why postmortems often reveal that the triggering failure happened long before the incident began in chat.

Background processes are especially vulnerable. A job may stop making progress without crashing. A scraper may continue running while collecting empty results. A drift report may complete but publish a warning nobody sees.

Close the gap at the source

The best place to reduce discovery time is inside the process doing the work. When software emits useful events as it moves, discovery becomes a property of the system instead of an accident of human attention.