Operations

Turning Production Knowledge
Into Telemetry

The things experienced operators know should not live only in their heads.

Every production system has unwritten knowledge around it. Someone knows that a batch job normally finishes before 06:00. Someone knows that 200 skipped rows is suspicious but 3 skipped rows is routine. Someone knows that a supplier file is late if it has not arrived by the second retry.

That knowledge is valuable, but it is fragile when it only exists in memory, chat history, runbooks, or the habits of a few people. It helps during incidents only if the right person is available and remembers the right context at the right time.

Telemetry should capture judgement

Good telemetry is not just a stream of values. It encodes the operational judgement around those values. The system should not only say that 4,000 records were imported. It should help explain whether 4,000 records is normal, incomplete, late, or surprising.

Start with repeated questions

The best instrumentation points are often hiding in repeated questions: is it running, how far did it get, is that count normal, what did it skip, did the report go out, and who is waiting on the result? Each repeated question points to production knowledge that could become telemetry.

Make assumptions visible

Production systems are full of assumptions about timing, volume, order, dependencies, and acceptable failure. Telemetry can make those assumptions explicit. A milestone can show that validation started. A note can explain why a fallback path was used. A metric can report the count that experienced operators already check manually.

Lower the dependency on memory

Runbooks still matter, but they should not be the first place a team discovers what normal looks like. When production knowledge becomes telemetry, the system carries more of its own context. New team members learn faster. Incidents involve less guessing. Status updates become less dependent on one person checking logs.

Where OpenTrace fits

OpenTrace is designed to make this translation small enough to do in real code. A process can publish progress, metrics, notes, payloads, durations, and milestones where the production knowledge already lives. The aim is simple: turn the things operators know into signals the system can show.

Telemetry should capture judgement

Start with repeated questions

Make assumptions visible

Lower the dependency on memory

Where OpenTrace fits

Related