Lessons From Managing Technical Operations
Technical operations is invisible until it breaks. The dashboard that loads on the first try, the support ticket that routes to the right queue, the handoff between engineering and operations that just happens — nobody notices any of it, and that is the whole point. The day people start noticing your work is usually the day something is on fire. So the scoreboard is upside down: you win by making sure nothing interesting happens.
I came to this work from the engineering side, which warped my instincts in a useful way. I assumed the hard part would be the code, the integrations, the cloud architecture. It almost never was. The systems themselves were usually fine. The failures lived everywhere the systems touched each other and touched people. That is the lesson under all the other lessons, so I want to start there.
The real work is the seams between systems
Pick any individual tool in a business-systems stack — Zendesk, HubSpot, a Talkdesk phone tree, a Power BI dashboard, an AWS service — and on its own it usually does what it says. The outages and the slow bleed of data quality almost always live in the gaps: the nightly export nobody actually owns, the customer field that support defines one way and finance defines another, the manual reconciliation a person quietly does every Friday afternoon because the two systems never agreed and someone decided a human would just paper over it.
Those seams are invisible on an architecture diagram because no box represents them. They are the arrows, and arrows do not have owners. So the first real job of technical operations is to make the seams visible: name them, assign them, and decide on purpose whether each one should be automated, monitored, or deleted. Most of the recurring pain I ever chased traced back to a seam that everyone assumed someone else was watching.
A skipped process today is a defect tomorrow. You just pay for it later, with interest, and usually in front of a customer.
Make the decisions legible
The decisions that actually run a company rarely live in a document. They live in someone’s head. Why a particular ticket priority maps to a particular SLA, why one customer segment gets routed differently, why we never touch that one integration on a Monday — this is tribal knowledge, and tribal knowledge is a single point of failure that takes vacations and eventually quits.
I got in the habit of writing decisions down in a flat, boring format: what we chose, why we chose it, and what would make us change our mind. That last part matters more than it looks. A decision without its reversal condition is just an opinion that calcified. When you write down what would change your mind, you give the next person permission to revisit it with evidence instead of treating it as scripture. It is the cheapest insurance a team can buy, and almost nobody buys it until after the expensive incident that would have been a footnote if anyone had written the thing down.
Measure the toil before you fix it
Early on I fixed whatever was loudest. The loud problem is the one the angriest person is standing next to. It is almost never the most expensive one. The expensive problems are quiet: a fifteen-minute manual step done by four people every day, a report that three teams rebuild from scratch every month, a data-quality issue that silently inflates a number leadership makes decisions on.
So before changing a workflow I started forcing two numbers out of it: how often does this happen, and what does each occurrence cost in time or rework. You do not need a perfect measurement. You need an order of magnitude, because order of magnitude is enough to sort the queue. Half the time the act of measuring ended the debate on its own — the thing everyone wanted me to automate turned out to happen twice a quarter, and the thing nobody mentioned was eating an afternoon a week.
Three habits that did most of the work
1. Own the boring middle. The glamorous work is the new dashboard or the new automation. The work that compounds is keeping the field definitions consistent across four systems so the dashboard is not quietly lying. Boring is where reliability comes from.
2. Write the runbook before you are the runbook. If the answer to “how does this work” is “ask me,” you have built yourself a cage. The relief of being needed is real, and it is a trap. The goal is to make yourself replaceable on every individual thing so you are free to work on the next one.
3. Treat every incident as a process telling you something. A thing that broke once is bad luck. A thing that broke twice the same way is a missing control. The postmortem question I cared about was not “who missed it” but “what would have caught it automatically,” because people will always miss things and systems can be taught not to.
None of this is heroic, and that is the point I keep landing on. The version of operations that gets celebrated is the all-nighter, the save, the person who knew the one thing. Good operations is the unglamorous opposite: building the systems and writing the decisions down so that nobody ever has to be a hero, because the work simply does the right thing by default. The best week I ever had in operations was one where, on paper, nothing happened at all.
Andrew Nguyen
Technical Operations Manager