The Problem Wasn't the Failure.
It Wasn't Knowing.
Failures are inevitable. Long periods of uncertainty are a design choice.
Many incidents are painful not because the initial failure was exotic, but because nobody knew it was happening. A worker stopped, a data source changed shape, a queue stopped draining, or a job completed with the wrong result. The technical failure may be ordinary. The discovery failure is what makes it costly.
Not knowing slows every response. It delays triage, hides scope, creates duplicate investigation, and forces people to reason from symptoms instead of facts. The team repairs the bug and also repairs the timeline.
Design for awareness
OpenTrace treats awareness as an output of the system. Code should report what it is doing in terms people recognise. That does not remove failures, but it makes them easier to find, explain, and bound.