Tools should return data, Language models should return language
How I stopped trying to make my MCP tool write emails and let claude do its job
I’ve spent the last few weeks building something called Dark Matter co-pilot (cool name right?). It’s an MCP server that gives Claude access to my studio’s actual data. Case studies, leads, pricing, processes I follow, findings I’ve recorded for prospective websites.
The goal is that when I ask Claude for help drafting an outreach email, it has real context and doesn’t have to guess anything.
The first two weeks were great, I was learning a lot, not making any major design mistakes. But then around the third week mark, I was about to make a design mistake, or in technical words, an oopsie.
It was the kind that doesn’t blow up the code, it just makes the performance worse.
I want to write about it, because I haven’t seen it addressed or articulated cleanly anywhere, and I don’t want anyone who might be leaning into MCPs to make this mistake.
The mistake: trying to make my tool generate text.
The scenario
I was building a tool called draft_outreach_email. The job: When I tell Claude “Draft a cold outreach email for lead 5.” or “Draft an outreach email for the lead Acme Marketing”, it should produce a proper sendable email. It should be personalized. It should reference one of my actual case studies, and it should be written in my studio’s voice by referencing a markdown document through a defined resource.
The first version I sketched was the obvious one. The tool would:
Fetch the lead’s information from the database
Fetch the most relevant case study
Read my positioning.md document for the studio’s voice
Compose the email
Return the email as string
The mistake? Step 4. Thats where I almost went wrong. I still remember when I wrote this flow on my whiteboard, I felt like a genius. Lol.
I, being the genius that I am… or so I thought, was going to have my tool do the actual writing. You read that right. The tool’s function would call another LLM internally, through the Anthropic’s or OpenAI’s API, generate a small string, and return that. My Claude Desktop would then just return that string to me.
If you’ve built LLM tooling, it probably sounds reasonable. Its not. Its wrong. Let me explain why:
The architectural mistake
Here’s the thing. I was already talking to Claude. Its the most capable LLM I know and have access to. Its the actual interface that the user (genius me) is actually looking at and interacting with. And yet, I was about to implement a system where my tool would call a DIFFERENT language model to generate text, which would then return that text to Claude, and then Claude would show that to me.
Genius. Right?
Anyways, I was gaming with my friends. I was stuck on the loading screen and that whiteboard with that genius flow written on it was staring back at me. Out of nowhere, it hit me. My Eureka moment. WHY? Why am I introducing a second language model into my system? Claude is literally right there. Claude can draft an amazing email. The only sensible reason to have my tool generate the text for the email is if my tool’s LLM is somehow better than Claude. It isn’t.
It isn’t just inefficient. It’s the wrong abstraction.
The right abstraction: my tool gathers and returns the contextual data. Claude’s job is to use that contextual data to generate language. Each layer doing what its good at. Abstraction.
So, I rewrote the tool. Now, draft_outreach_email returns structured context:
The lead’s information
Any prior findings I’ve recorded about their website
All the case studies
My voice i.e positioning.md
A document describing my outreach structure
An optional angle parameter
That’s it. Just data. Context. No text generation inside my tool. No additional LLM call.
When Claude calls this tool, it gets a JSON object back. It reads the object. Gets the context, and then drafts an email on its own. If I say “Make it shorter” or “Lead with the angle”, it can adjust accordingly since Claude is doing the writing. If my tool had been doing the writing, every revision would require another tool call with different parameters. Dumb.
The principle
The actual takeaway is:
Tools should return data. Language models should return language.
Any AI-augmented system has two layers, whether you've thought about it that way or not. A deterministic layer (your code, your database, your APIs, the boring stuff) and a language layer (the LLM, doing reasoning and composition). The deterministic layer is for things with rules. The language layer is for things without them. There’s always abstraction, always separation of concern. The deterministic layer fetches, filters, validates, computes. It’s reliable, debuggable and cheap. The language layer composes, adapts, reasons over fuzzy context and adjusts tone. Flexible and contextual.
When you make your tools generate text, you’re forcing the deterministic layer to do language work it’s bad at, and you’re making the language layer do the downstream work. You get the worst of both worlds.
When you make your language layer do retrieval or validation, you get the inverse problem. Hallucinations and inconsistent behavior.
The clean abstraction: code provides data, the LLM provides language.
What this looks like in practice
A few rules of the road:
Your tools should be boring. Boring is good. get_user_recent_orders should return a list of orders, not a “natural language summary of recent orders”. The LLM can summarize it. Let it.
Your prompts can stay simple. Most “complex prompts” I see are trying to compensate for missing context. Give the model the context it needs and let it do its magic. The prompt can stay simple.
You can swap models without rewriting your tools. If the tool returns data, you can swap Claude for whatever comes next. If the tool generates text, you’re stuck with whichever model you wired into. Future you will thank present you for avoiding this trap.
The hard case
The argument above is easy when your tool is doing something obviously data-esque, like fetching leads. It gets tougher when the tool’s job feels language shaped from the start.
Take draft_outreach_email as an example. It’s tempting to put the language work inside the tool, because the tool’s literal name involves writing.
But notice what the tool actually needs to know to do that job: who's the lead, what have I observed about their site, what past work is relevant, what's my voice, what's my structure. All of that is data. The language work is downstream of having that data assembled.
The instinct should be: separate the gathering from the composing. Tools gather. LLMs compose. The moment you blur this, you're doomed.
“But what about LLM-as-a-judge?”
Yeah, I know. Someone is going to ask. What about agentic systems where one agent calls another? Aren’t those tools that return language?
Not really. An LLM agent isn’t a tool in the sense I’m talking about. It’s another instance of the language layer, just one that’s been delegated a sub-task. The same principle applies recursively: each agent’s tools should return data, and each agent should do its own language work.
If you’re using LLM-as-a-judge to evaluate an output, that’s evaluation, which is a language task, and the language layer is the right place for it. But you wouldn’t have your retrieval tool call a judge to “decide which document is most relevant” if you could measure relevance deterministically. Same principle, different costume.
Why this matters more than it looks
I think this is a pattern AI engineers will keep getting wrong for a while, because the tooling makes it easy to blur the layers. Most LLM frameworks let you slap a @tool decorator on a function that calls another LLM and nothing yells at you. The framework doesn’t care.
The system becomes much harder to reason about. The tools become opaque. It becomes harder to debug. Costs scale weirdly because you’re paying for tokens at every layer instead of just at the surface.
Tools are deterministic. LLMs are not. Tools have schemas, return types, and unit tests. LLMs have evals. They're different kinds of things, and you treat them differently.
A rough test: If I removed the LLM from my tool, would the tool do something useful? If the answer is no, if the tool’s only job is to format an LLM call, then it probably shouldn’t be a tool at all. It’s just a prompt template in a disguise.
The part that surprised me most
Once I separated the layers cleanly, Claude got better. Not because the model changed. Because the prompts got simpler and the context got cleaner. Claude could focus on writing instead of on figuring out what to write about. The context was already there.
I think that's the deepest version of this principle. You don't make language models better by giving them more elaborate instructions. You make them better by giving them better context. Tools that return data are how you do that.
If you want to see the actual architecture (with a mermaid diagram and a demo), the README has it. The code is here.

