AI Agent Outputting Garbage? The Problem Is You're Not Willing to Burn Tokens

Author: Systematic Long Short

Translation: Deep Tide TechFlow

Deep Tide Guide: The core point of this article is simple: AI Agent output quality is directly proportional to the number of Tokens you invest.

The author isn’t just discussing theory; they provide two practical methods you can start using today and clearly define the boundary where Token stacking doesn’t work—“The Novelty Problem.”

For readers using Agents to write code or run workflows, the information density and actionability are high.

Introduction

Well, you have to admit, this title is quite eye-catching—but honestly, it’s no joke.

In 2023, while we were still using LLMs to generate production code, everyone around us was stunned because the common perception was that LLMs could only produce useless garbage. But we knew something others didn’t: the output quality of an Agent is a function of the Tokens you put in. That’s it.

You can see this yourself with a few experiments. Have an Agent complete a complex, somewhat obscure programming task—like implementing a constrained convex optimization algorithm from scratch. First, run it at the lowest thinking level; then switch to the highest level, have it review its own code, and see how many bugs it can find. Try intermediate and high levels as well. You’ll intuitively see that the number of bugs decreases monotonically as you increase the Token input.

It’s not hard to understand, right?

More Tokens = Fewer Errors. You can push this logic further—this is essentially the core idea behind code review products (albeit simplified). In a completely new context, feeding in massive amounts of Tokens (for example, having it parse code line-by-line to identify bugs) can detect most, if not all, bugs. This process can be repeated ten, a hundred times, each time examining the codebase from a different perspective, ultimately uncovering all bugs.

The idea that “more Tokens improve Agent quality” is also empirically supported: teams claiming to use Agents to write production-ready code directly are either the model providers themselves or companies with extremely deep pockets.

So, if you’re still struggling with Agents not producing production-level code—frankly, the problem is on your side. Or, more precisely, in your wallet.

How to judge if you’ve used enough Tokens

I’ve written an entire article stating that the problem is definitely not your framework (harness); “keeping it simple” can still produce excellent results, and I still hold that view. You read that article, followed its advice, but still found the Agent’s output disappointing. You DM me, I read it but didn’t reply.

This article is my reply.

In most cases, poor Agent performance and inability to solve problems stem from insufficient Token investment.

How much Token is needed to solve a problem depends entirely on its scale, complexity, and novelty.

“2+2 equals what?” doesn’t require many Tokens.

“Help me write a bot that scans all markets between Polymarket and Kalshi, identifies semantically similar markets that should settle before or after each other, sets arbitrage boundaries, and automatically trades with low latency when arbitrage opportunities appear”—that requires a huge amount of Tokens.

In practice, we’ve found an interesting phenomenon.

If you invest enough Tokens to handle problems caused by scale and complexity, the Agent can solve them no matter what. In other words, if you want to build something extremely complex with many components and lines of code, just pour enough Tokens into these problems, and they will eventually be thoroughly solved.

There is a small but important exception.

Your problem can’t be too novel. At this stage, no amount of Tokens can solve the “novelty” problem. While enough Tokens can reduce errors caused by complexity to zero, they can’t make the Agent invent things it doesn’t know.

This conclusion actually relieves us.

We spent enormous effort, burning—many, many Tokens—trying to see if we could get the Agent to reconstruct institutional investment processes with almost no guidance. Part of the reason was to understand how many years it will take before we (quantitative researchers) are fully replaced by AI. The result? The Agent simply can’t approximate a proper institutional investment process. We believe this is because it has never seen such a thing—meaning, institutional investment processes are simply absent from the training data.

So, if your problem is novel, don’t expect to solve it by stacking Tokens. You need to guide the exploration yourself. But once you’ve determined the implementation plan, you can confidently pour Tokens into execution—no matter how large the codebase or how complex the components.

Here’s a simple heuristic: your Token budget should grow proportionally with the number of lines of code.

What exactly are more Tokens doing?

In practice, additional Tokens usually improve engineering quality through several means:

  • Allowing the Agent to spend more time reasoning in a single attempt, giving it a chance to discover logical errors itself. Deeper reasoning = better planning = higher hit rate.

  • Permitting multiple independent attempts, exploring different solution paths. Some paths are better than others. Multiple tries enable selecting the best.

  • Similar to above, more independent planning attempts let it abandon weak directions and keep the most promising ones.

  • More Tokens allow critique of previous work in a new context, giving it a chance to improve rather than being stuck in a “reasoning inertia.”

  • And my favorite: more Tokens enable testing and tooling. Running code to see if it works is the most reliable way to verify correctness.

This logic works because engineering failures of the Agent are not random. They are almost always caused by choosing the wrong path too early, not checking whether that path is feasible (early on), or lacking enough budget to recover and backtrack after errors are found.

That’s the story. Literally, Tokens are the decision quality you buy. Think of it as research work: if you ask a person to answer a difficult question on the spot, the quality of their answer decreases as time pressure increases.

Research ultimately produces the foundational “knowing the answer.” Humans spend biological time producing better answers; Agents spend more computational time.

How to improve your Agent

You might still be skeptical, but many papers support this—honestly, the existence of a “reasoning” knob is proof enough that you need it.

One paper I particularly like trained a small set of carefully curated reasoning samples, then used a method to force the model to keep thinking when it wanted to stop—by appending “Wait” at the point where it intended to halt. This alone boosted performance on a benchmark from 50% to 57%.

Let me be straightforward: if you’re always complaining that your Agent’s code is mediocre, single-shot maximum reasoning levels are probably still not enough.

I have two very simple solutions for you.

Simple Solution 1: WAIT

Start today: set up an automatic loop—after building the initial prompt, have the Agent review in a new context N times, fixing issues each time.

If you find this simple trick improves your engineering results, then you understand that your problem is just Token quantity—so join the Token-burning club.

Simple Solution 2: VERIFY

Have the Agent verify its work early and often. Write tests to confirm that the chosen path actually works. This is especially useful for highly complex, deeply nested projects—where one function is called by many downstream functions. Catching errors upstream saves you a lot of subsequent computation (Tokens). So, if possible, set verification checkpoints throughout the process.

After the Agent says “done,” have a second Agent verify the output. Different thought streams can cover systemic biases.

That’s about it. I could write much more on this topic, but I believe that just understanding these two points and executing them well can solve 95% of your problems. I firmly believe in doing simple things to the extreme and adding complexity only as needed.

I mentioned that “novelty” is a problem that cannot be solved with Tokens alone. I want to emphasize this again because you will eventually encounter this trap and come crying to me that stacking Tokens is useless.

When your problem isn’t in the training set, you are the one who truly needs to provide a solution. Domain expertise remains critically important.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin