AI Agent Outputting Garbage? The Problem Is You're Not Willing to Burn Tokens

Question

The issue isn't with the prompt!Author: Systematic Long ShortTranslation: Deep Tide TechFlow**Deep Tide Guide:** The core point of this article is simple: the quality of AI Agent outputs is directly proportional to the number of Tokens you invest.The author isn't just discussing theory; they provide two practical methods you can start using today, clearly defining the boundary where Token stacking no longer helps—the "novelty problem."For readers using Agents to write code or run workflows, the information density and actionability are high.Introduction--Well, you have to admit, this title is pretty eye-catching—but honestly, it's no joke.In 2023, while we were still using LLMs to generate production code, everyone around us was stunned because the common belief was that LLMs could only produce useless junk. But we knew something others didn't: the output quality of an Agent is a function of the Tokens you put in. That's it.You can see this yourself with a few experiments. Have an Agent complete a complex, somewhat obscure programming task—like implementing a constrained convex optimization algorithm from scratch. First, run it at the lowest thinking level; then switch to the highest level, have it review its own code, and see how many bugs it can find. Try intermediate and high levels as well. You'll intuitively see that the number of bugs decreases monotonically as you increase the Token input.It's not hard to understand, right?More Tokens = fewer errors. You can push this logic further—this is essentially the core idea behind code review tools (albeit simplified). In a completely new context, feeding in a massive number of Tokens (for example, having it parse code line-by-line to identify bugs) can detect most, if not all, bugs. This process can be repeated ten, a hundred times, each time examining the code from a different angle, ultimately uncovering all bugs.The idea that "more Token expenditure improves Agent quality" is also empirically supported: teams claiming to use Agents to write production-ready code directly are either the model providers themselves or companies with extremely deep pockets.So, if you're still struggling with Agents not producing production-level code—frankly, the problem is on your side. Or rather, in your wallet.How to judge if you're spending enough Tokens-----------------I've written an entire article saying the problem isn't your framework (harness); "keeping it simple" can still produce excellent results, and I still hold that view. You read that article, followed its advice, but still found the Agent's output disappointing. You DM me, I read it but didn't reply.This article is my reply.Most of the time, poor Agent performance and inability to solve problems come down to insufficient Token expenditure.How much Token is needed to solve a problem depends entirely on its scale, complexity, and novelty."2+2 equals what?" doesn't require many Tokens."Help me write a bot that scans all markets between Polymarket and Kalshi, identifies semantically similar markets that should settle around the same event, sets arbitrage boundaries, and automatically trades with low latency when arbitrage opportunities appear"—that requires a huge amount of Tokens.In practice, we've found an interesting phenomenon.If you invest enough Tokens to handle problems driven by scale and complexity, the Agent can solve them no matter what. In other words, if you want to build something extremely complex with many components and lines of code, just pour enough Tokens into these problems, and they will eventually be thoroughly solved.There's a small but important exception.Your problem can't be too novel. At the current stage, no amount of Tokens can solve the "novelty" problem. Sufficient Tokens can eliminate errors caused by complexity, but they can't make the Agent invent things it doesn't know.This conclusion actually relieves us.We spent enormous effort, burning—many, many Tokens—trying to see if we could get the Agent to reconstruct institutional investment processes with almost no guidance. Part of the reason was to understand how many years it will take before we (quant researchers) are fully replaced by AI. The result? The Agent simply can't approximate a proper institutional investment process. We believe this is because it has never seen such a thing—meaning, institutional investment processes are simply absent from the training data.So, if your problem is novel, don't expect to solve it by stacking Tokens. You need to guide the exploration yourself. But once you've determined the implementation plan, you can confidently pour Tokens into execution—no matter how large the codebase or how complex the components.Here's a simple heuristic: your Token budget should grow proportionally with the number of code lines.What exactly are more Tokens doing?----------------In practice, additional Tokens usually improve engineering quality through several means:- Letting the Agent spend more time reasoning in a single attempt, giving it a chance to discover logical errors itself. Deeper reasoning = better planning = higher hit rate.- Allowing multiple independent attempts, exploring different solution paths. Some paths are better than others. Multiple tries enable it to pick the best.- Similar to the above, more independent planning attempts let it abandon weak directions and retain promising ones.- More Tokens enable critique of its previous work in a new context, giving it a chance to improve rather than being stuck in a "reasoning inertia."- And my favorite: more Tokens mean it can use tests and tools to verify its work. Running code to see if it works is the most reliable way to confirm correctness.This logic works because engineering failures of the Agent are not random. They are almost always caused by prematurely choosing the wrong path, not checking whether that path is viable early on, or lacking enough budget to recover and backtrack after errors are found.That's the story. Tokens, in a literal sense, are the decision quality you buy. Think of it as research work: if you ask a person to answer a difficult question on the spot, the quality of their answer drops as time pressure increases.Research ultimately produces the foundational "knowing the answer." Humans spend biological time producing better answers; Agents spend more computational time.How to improve your Agent------------You might still be skeptical, but many papers support this—honestly, the existence of a "reasoning" knob is all the proof you need.One paper I particularly like trained on a small set of carefully curated reasoning samples, then used a method to force the model to keep thinking when it wanted to stop—by appending "Wait" at the stopping point. Just that alone boosted a benchmark from 50% to 57%.Let me be straightforward: if you've been complaining that your Agent's code is mediocre, the single highest thinking level might still be insufficient.Here are two very simple solutions:### Simple Solution 1: WAITStart today: build an automatic loop—after constructing the initial prompt, have the Agent review its output N times in a new context, fixing issues each time.If you find this simple trick improves your engineering results, at least you understand that the problem is just Token count—so join the Token-burning club.### Simple Solution 2: VERIFYHave the Agent verify its work early and often. Write tests to confirm that the chosen path actually works. This is especially useful for highly complex, deeply nested projects—where one function is called by many downstream functions. Catching errors upstream can save you a lot of subsequent computation (Tokens). So, if possible, set verification checkpoints throughout the build process.After the main Agent says "done," have a second Agent verify it. Unrelated reasoning flows can cover systemic biases.That's about it. I could write much more on this topic, but I believe that just being aware of these two points and executing them well can solve 95% of your problems. I firmly believe in doing simple things to the extreme and adding complexity only as needed.I mentioned that "novelty" is a problem that cannot be solved with Tokens alone. I want to emphasize this again because you'll eventually encounter this trap and come crying to me that stacking Tokens doesn't work.When your problem isn't in the training set, you're the one who truly needs to provide a solution. Domain expertise remains critically important.

AI Agent Outputting Garbage? The Problem Is You're Not Willing to Burn Tokens

Introduction

How to judge if you’re spending enough Tokens

What exactly are more Tokens doing?

How to improve your Agent

Simple Solution 1: WAIT

Simple Solution 2: VERIFY

Trending Topics

Gate13thAnniversaryGlobalCelebration

GateProofOfReservesReport

CryptoMarketVolatility

GoldSeesLargestWeeklyDropIn43Years

TrumpIssues48HourUltimatumToIran

Hot Gate Fun

🐉

华夏

bitc

gate

硅基茶水间

硅基茶水间

ToKen

ToKen

183727

啊哦

Pin