Skills Are Where Your Judgment Goes

Geoff Stearns · June 14, 2026

In my last post I argued you should give agents jobs, not processes: tell an agent what it’s responsible for, and split the work so tools enforce the rules while the skill’s instructions carry the goals and the judgment. So now let’s talk about how to build a good one.

Suppose you want to teach an agent to handle a real job: responding to escalated customer support tickets. Here’s a skill you’ve seen before: pull the ticket history, check the help center, draft a reply in the brand voice, tag and close. It looks responsible. It’s almost worthless as a skill, because every line is either generic model work or tool work. Now compare: a credit above $200 needs approval. A customer who mentions a deadline or a migration is churn-risk, not venting. Stop recommending the CSV workaround, it corrupted a customer’s data last month. We’d rather eat a $50 credit than spend an hour of an engineer’s time. The first skill writes down the steps. The second writes down the judgment.

The first one is the trap. Your current process feels like the work, but it isn’t. It’s the work bent around the tools you had and the handoffs between people, and the steps that were easy to write down were the mechanical ones. Hand those to the agent and you’ve automated the version of the job you already do, a little faster. That’s iteration. It’s not nothing, but it’s not what you’re here for. The agent can do far more than run your old playbook quicker, and getting there means letting go of the playbook and asking what the job actually needs. The parts that matter are the ones only you can answer, and they’re rarely the parts you wrote down.

It comes down to four questions. What can the model already do? Where does experience change the answer? What’s true here that isn’t true everywhere? What must the tool enforce? The answers decide what belongs in the skill, what belongs in the tool, and what shouldn’t be written down at all. A skill is for the part the model doesn’t get from general competence: your judgment, your local knowledge, and your rules for what good means here.

What’s mechanical?

Start by splitting the job into two piles.

The first pile is mechanical work. It’s the stuff where there’s a known way to do it, where you can follow the steps and get the result. Sometimes that’s a fixed procedure, sometimes it’s a bounded task you can hand to a tool, but either way, doing it well doesn’t take an expert. It takes correct execution, and a capable model already has that. Looking up an account, running a query, formatting a reply, fetching the relevant docs: mechanical. You might still point the agent at it, but that’s a cheap, shallow kind of teaching. Hand it a tool and, if the tool is well built, it explains itself, what it’s for, what it expects, what it returns, so a single line in the skill does the whole job: this is the customer CRM, use it to look up customer info. The agent takes it from there. The most a skill should say about a mechanical step is which tool to reach for and when, never how the tool works.

The second pile is judgment, a word that gets thrown around loosely. Here it means something specific: the part of the work where two things are true at once. There’s no procedure that settles it, because if a checklist could settle it, it would just be mechanical work. And yet experience reliably gets you better answers. Put a researcher who knows the method from the textbook next to one who has watched it fail in production, and the second one will catch that the sample can’t support the claim, that the method is trendy but shaky, that the real question isn’t the one being asked. No formula got them there. They’ve just seen what goes wrong. That’s judgment: no rule to follow, but a right and a wrong you can still be held to.

It shows up in every field. A new-grad engineer reaches for the clever abstraction that demos beautifully and becomes a maintenance sinkhole two years later; the staff engineer knows which corners are safe to cut on a throwaway prototype and which will haunt a system that lives for a decade. A junior analyst trusts the dashboard; the senior one knows which metric quietly lies. The judgment pile is where experience changes the answer, and that’s the pile a skill is actually for.

Where does experience change the call?

Finding that judgment is the step people skip, because it’s rarely written down. The runbook recorded the mechanical steps and left the judgment in someone’s head, for the same reason it was easy to write down: nobody records what feels like second nature. And you won’t get it by asking your expert to write down their reflexes. They’ll stare at a blank page.

You get it by watching the work. Shadow the person who does the job well. Put two of their outputs side by side and ask why one is better. Watch for the calls they make on reflex, the places where two good practitioners would choose differently, the moment a junior turns in something plausible and wrong and can’t say why. That gap, the one the junior couldn’t name, is the judgment. Write it down, all of it. The next question sorts it.

Some of what you write down won’t actually be yours. A fair amount of expertise is just general skill the model already has: knowing to check your work, not trusting a thin result, reading the whole file before editing it. Useful, but the model already does it, so you don’t need to write it down. Cross that off. What’s left is the part that’s specific to you, and that’s where the value is.

What’s true here that isn’t true everywhere?

This is the heart of it, the question almost nobody asks. The model knows the general version of your job. It has read more about customer support, or research, or your codebase’s language, than you ever will. What it doesn’t have is what’s true in your particular corner of the work.

There are two kinds. First, what’s current and local: the workaround your team stopped recommending last month, the method your field has quietly moved past, the convention this codebase follows that no other one does. Take the support job: the model will happily suggest the CSV export to a frustrated customer, because that was good advice in all the docs it trained on, and it has no way to know your team pulled it three weeks ago after it corrupted someone’s data. That isn’t something it gets from knowing customer support in general; it lives in your team’s heads until you write it into the skill. The model’s default picture of your world is built from what’s been written down elsewhere, which averages toward the past and the popular. It doesn’t know what changed in your shop last quarter, and it can’t, because that knowledge lives with you, not on the internet. Letting it search helps, but search mostly returns what someone wrote down. The part that makes your work yours is often the part nobody has written cleanly yet.

Second, what you value. A better model gets better at the task, not at knowing which tradeoff you favor when speed and rigor pull apart, what “good enough to ship” means here, or what’s even worth doing. Those are choices, and they’re yours. Left without them the model doesn’t pause to ask; it fills the gap with the most common answer, which is to say someone else’s. The thing you’re adding when you write a skill isn’t skill in the general sense, which the model already has. You’re adding the judgment that comes from being in this specific job, at this specific place, with these specific goals. That’s the part you’re there to add, and no model arrives with it, because it was never written down anywhere for the model to learn from.

What must the tool enforce?

One last pile, and it comes out for the opposite reason. Some rules you should never leave to judgment at all, no matter how good the agent gets. Two teams must not contact the same customer. A research participant must not get pulled into three studies in a month. A discount past a threshold needs sign-off. A smarter model doesn’t make these safe to ask for politely, so they don’t belong in the skill’s instructions as prose. They belong in the tool, enforced, which is the shared-truth argument from earlier in this series. The skill can explain why the rule exists. The tool makes it unbreakable.

Skills can bundle code, which muddies this. A skill is the whole folder: the instructions in SKILL.md, plus any scripts and reference files it ships with. But a script the agent chooses to run is not the same as an enforced rule. It’s a convenience the agent can skip, reorder, or talk itself out of, which makes it closer to a suggestion than a guarantee. If a rule must hold every time, it goes in the tool or the API the agent can’t route around, not in a script the instructions merely point at.

It’s also why the tools should be cut to fit the skill, not lifted whole from your software. The reflex is to wrap a system’s entire API and hand it over. That gives the agent something shaped like your software instead of the task it’s doing. Don’t give the agent the whole machine. Give it the one part this skill is allowed to use, with the rules built in. Smaller tools confuse the agent less and limit what it can do wrong.

There’s a sharper version of that mistake: skills that are really just instructions for how to talk to your backend. How to call the CLI, what the endpoints are, which flags to pass. That isn’t a skill, it’s developer documentation, and writing it into a skill is a sign the tooling is wrong. If the agent needs to know how to drive your system, the answer is a self-describing tool it can introspect, a CLI with real --help, an API that explains itself, not a markdown file restating the docs by hand. Build the tools so the agent can ask them what they do, and the skill stays free for the part that actually needs you.

The method, end to end

Back to the support job, with all four questions applied.

Start with the mechanical: pulling the history, looking up the account, fetching known issues, producing a first draft. The model can already do the basic version of all of it, so none of it needs step-by-step instructions in the skill.

Then the part where experience changes the call. Senior support people don’t treat tickets equally. They read the same angry email and tell venting from churn-risk. They notice when three unrelated tickets are the same underlying bug. They know when “works as intended” is technically true and still the wrong answer.

Then what’s true here that isn’t true everywhere: the workaround you stopped recommending, the escalation path that changed last quarter, how much goodwill you’ll spend to keep a customer (which a startup and an enterprise vendor answer very differently), and the bar that says a ticket isn’t resolved just because it’s closed. This is the part you actually have to write, because it exists nowhere else.

And finally what the tool enforces: credits above the threshold need approval, never reveal another account’s data, no commitments on ship dates.

It plays out the same way if you’re teaching an agent to recruit research participants. Querying the panel and sending invites are mechanical. Whether this sample can answer the question is where experience changes the call. Your team’s current read on incentive size, no-show risk, over-recruited segments, and speed versus rigor is what’s true here and not everywhere. Consent and over-use are rules, enforced in the recruiting tool.

What you’re left with

A skill built this way is short on procedure and long on judgment. It holds what’s current and local, so the agent works from how things are done here and now rather than the averaged-out version it trained on. It holds the tradeoffs, so it knows what to give up when it can’t have everything, and the bar for good, so “done” means more than “finished.” It puts a good call and a bad one side by side, so the judgment it’s using is visible, and it says when to hand off to a person, so the agent has a way out when it’s unsure. And it carries the few narrow tools it’s allowed to use, with the hard rules built in.

The counterargument is that this all dissolves: a capable enough model will infer your judgment from your repo history, your tickets, your past decisions. Some of that will happen. But inferring the pattern of your past choices isn’t the same as knowing what you’re trying to do now, or whether you’d still make those choices today. The model can learn what you did. It can’t decide what the work is for. That part was never in the training data, and it isn’t yours to hand off.

Lightly edited for tone on June 17, 2026.

Geoff Stearns is a Product Manager at Google building AI agents and skills.