Why You Keep Hitting Token Limits (And How to Fix It)

You have been there. You are mid-flow, building something with an LLM API, and then it happens: Error: maximum context length exceeded. Or worse -- you do not hit a hard limit, but your monthly invoice is three times what you budgeted because every request is stuffing 80,000 tokens into a 128K context window.

The frustrating part? Most of that token usage is waste. Not because your code is bad, but because nobody taught us how to think about tokens the way we think about memory or bandwidth.

This post breaks down what tokens actually are, why your context window fills up so fast, the five most common waste patterns we see in production codebases, and how to find and fix them.

What Are Tokens, Really?

Before we fix the problem, let us make sure we agree on what a token is.

A token is the smallest unit of text that a language model processes. It is not a character and it is not a word -- it is somewhere in between. For English text:

1 token is roughly 4 characters or about 0.75 words
The word "optimization" is 3 tokens
A line of Python code like response = client.chat.completions.create( is about 12 tokens
A 1,000-word document is roughly 1,300 tokens (use tiktoken for exact counts — line count is unreliable since a dense one-liner can be 50+ tokens while a blank line is 1)

Every token costs money. OpenAI charges per 1 million tokens. Claude charges per 1 million tokens. Gemini charges per 1 million tokens. The numbers vary, but the principle is the same: more tokens = more money.

Here is what catches people off guard: both your input (prompt + context) and the model's output (response) count toward your bill. And input tokens are not free just because they are "cheaper" than output tokens -- when you are sending 50K input tokens per request, that adds up fast.

Why Your Context Window Fills Up So Fast

Modern models advertise massive context windows. Claude supports 200K tokens. GPT-4o handles 128K. Gemini 2.0 Flash goes up to 1M. So why are you still hitting limits?

Because context windows are not just for your question. Here is what actually fills them:

System prompt: 200-2,000 tokens. Present on every single request.
Conversation history: Grows linearly. By message 15, you might have 10K+ tokens of history.
RAG context: Your retrieval pipeline pulls "relevant" documents. Often 5,000-20,000 tokens.
Code context: Pasting a file for analysis? A single 500-line file is 4,000-6,000 tokens.
Tool/function definitions: If you are using function calling, each tool definition is 100-500 tokens.

Add those up for a typical RAG-enhanced coding assistant request:

| Component | Tokens | |-----------|--------| | System prompt | 800 | | Conversation history (10 turns) | 8,000 | | RAG context (5 chunks) | 6,000 | | Current code file | 4,500 | | Tool definitions (8 tools) | 2,400 | | User question | 200 | | Total input | 21,900 |

That is 21,900 input tokens per request. At GPT-4o pricing ($2.50 per 1M input tokens), that is $0.055 per request. Make 1,000 requests a day across a team of five engineers and you are looking at $275/day -- over $8,000 per month just on input tokens.

The worst part? Most of those 21,900 tokens are doing nothing useful.

The 5 Most Common Token Waste Patterns

After scanning thousands of production codebases, these are the patterns we see again and again.

1. Sending Entire Files When You Need 10 Lines

This is the most common waste pattern. You need the model to understand one function, but you send the entire 800-line file.

The wasteful pattern:

# Wasteful: sending entire file for a single function analysis
with open("services/user_service.py") as f:
    full_file = f.read()  # 800 lines, ~6,000 tokens

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Review this function for bugs:\n\n{full_file}"
    }]
)

The optimized pattern:

# Optimized: extract only the relevant function
import ast

def extract_function(filepath: str, func_name: str) -> str:
    with open(filepath) as f:
        tree = ast.parse(f.read())
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef) and node.name == func_name:
            return ast.get_source_segment(open(filepath).read(), node)
    return ""

relevant_code = extract_function(
    "services/user_service.py", "create_user"
)  # ~40 lines, ~300 tokens

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Review this function for bugs:\n\n{relevant_code}"
    }]
)

Savings: ~5,700 tokens per request (95% reduction)

2. Redundant System Prompts on Every API Call

Your system prompt is sent with every single request. If it is 2,000 tokens and you make 500 requests per hour, that is 1 million tokens per hour just on system prompts.

The wasteful pattern:

// Wasteful: massive system prompt repeated on every call
const SYSTEM_PROMPT = `You are an expert software engineer...
[800 words of instructions, examples, formatting rules,
persona details, edge case handling, output schemas,
few-shot examples, and chain-of-thought instructions]`;

// This runs 500 times per hour
async function analyzeCode(code: string) {
  return openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: SYSTEM_PROMPT },  // 2,000 tokens every time
      { role: "user", content: code },
    ],
  });
}

The optimized pattern:

// Optimized: minimal system prompt + task-specific context
const CORE_PROMPT = `You are a code reviewer. Return JSON with
{issues: [{line, severity, message}]}.`;  // 35 tokens

// Few-shot examples only when needed for complex tasks
const FEW_SHOT = `Example: {issues: [{line: 12, severity: "high",
message: "SQL injection via string concatenation"}]}`;

async function analyzeCode(code: string, needsExamples = false) {
  const system = needsExamples
    ? `${CORE_PROMPT}\n${FEW_SHOT}`  // 80 tokens for complex cases
    : CORE_PROMPT;                     // 35 tokens for simple cases

  return openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: system },
      { role: "user", content: code },
    ],
  });
}

Savings: ~1,965 tokens per request (98% reduction on system prompt)

Important caveat: This optimization only applies when your system prompt genuinely contains padding — conversational filler, redundant examples, or instructions the model does not need. If your task requires detailed multi-step reasoning instructions, few-shot examples for edge cases, or strict output schemas, stripping the prompt will hurt quality. Always A/B test prompt changes against your evaluation suite before deploying to production.

3. Over-Stuffed RAG Retrieval

RAG (Retrieval-Augmented Generation) is powerful, but most implementations retrieve far more context than the model needs. Pulling 20 chunks when 3 would suffice is like loading an entire database table when you need one row.

The wasteful pattern:

# Wasteful: retrieve 20 chunks regardless of query complexity
results = vectorstore.similarity_search(
    query=user_question,
    k=20  # Always 20 chunks, each ~500 tokens = 10,000 tokens
)

context = "\n\n".join([doc.page_content for doc in results])

The optimized pattern:

# Optimized: adaptive retrieval with relevance filtering
results = vectorstore.similarity_search_with_score(
    query=user_question,
    k=10  # Fetch candidates
)

# Filter by relevance score -- only keep high-quality matches
relevant = [
    doc for doc, score in results
    if score > 0.78  # Threshold tuned to your embedding model
][:5]  # Cap at 5 chunks max = ~2,500 tokens

context = "\n\n".join([doc.page_content for doc in relevant])

Savings: 5,000-7,500 tokens per request (50-75% reduction)

4. Re-Embedding Unchanged Documents on Every Deploy

Every time you deploy, your RAG pipeline re-embeds every document in your knowledge base. If nothing changed, you just burned thousands of embedding API calls for nothing.

# Wasteful: re-embed everything on startup
def refresh_knowledge_base():
    docs = load_all_documents()  # 500 documents
    embeddings = embed_documents(docs)  # 500 API calls every deploy
    vectorstore.add(docs, embeddings)

# Optimized: content-hash check, only embed changes
import hashlib

def refresh_knowledge_base():
    docs = load_all_documents()
    existing_hashes = vectorstore.get_metadata("content_hash")

    new_or_changed = [
        doc for doc in docs
        if hashlib.sha256(doc.content.encode()).hexdigest()
        not in existing_hashes
    ]

    if new_or_changed:
        embeddings = embed_documents(new_or_changed)
        vectorstore.upsert(new_or_changed, embeddings)

Savings: 90-99% reduction in embedding API calls on unchanged deploys

5. Verbose Prompt Templates That Could Be 60% Shorter

Developers write prompts like documentation -- thorough, readable, full of examples. But the model does not need your prompt to be readable to humans. Every unnecessary word is a token you pay for.

# Wasteful: verbose, human-readable prompt (380 tokens)
prompt = """
Please carefully analyze the following piece of source code
and identify any potential security vulnerabilities that may
exist within it. For each vulnerability that you find, please
provide the following information in a structured format:

1. The specific line number where the vulnerability occurs
2. A classification of the severity (using the scale: low,
   medium, high, or critical)
3. A detailed description of the vulnerability
4. A concrete suggestion for how to fix the vulnerability

Please be thorough in your analysis and consider common
vulnerability categories including but not limited to:
SQL injection, XSS, CSRF, authentication bypass, insecure
deserialization, and path traversal.
"""

# Optimized: concise, same output quality (95 tokens)
prompt = """Analyze for security vulnerabilities. Return JSON array:
[{line: int, severity: "low"|"medium"|"high"|"critical",
  vuln: string, fix: string}]
Categories: SQLi, XSS, CSRF, auth bypass, deserialization, path traversal."""

Savings: ~285 tokens per request (75% reduction)

The optimized version produces the same quality output. Models are trained on structured instructions -- they do not need conversational padding to understand what you want.

How to Find Token Waste in Your Codebase

You have two options.

The Manual Approach

Search your codebase for all LLM API calls (openai.chat.completions.create, anthropic.messages.create, genai.GenerativeModel, cohere.chat, groq.chat.completions.create, etc.)
For each call, estimate the token count of every input
Check for the five patterns above
Calculate costs using provider pricing tables
Prioritize fixes by estimated savings

This works for small codebases (under 10 files that make LLM calls). For anything larger, it is tedious and error-prone. You will miss calls buried in utility functions, overlook dynamic prompt construction, and underestimate token counts for variable-length inputs.

The Automated Approach

Run a scanner that does all of this automatically:

Parse every file using AST analysis (not regex) to find all LLM API calls
Trace token flow from construction to API call, including dynamic inputs
Calculate actual costs using live model pricing
Flag waste patterns with specific line numbers and fix suggestions
Generate a report with total spend, savings potential, and actionable diffs

How erabot.ai Solves This

erabot.ai automates the entire process. Upload your code, connect your GitHub repo via OAuth, or point our OpenAI-compatible proxy at your existing API calls. Five specialist AI agents scan your codebase in parallel:

Token Optimizer — finds bloated prompts, redundant context, oversized max_tokens
Model Selector — identifies where cheaper models (GPT-4o-mini, Claude Haiku) achieve the same accuracy
Cache & Batch Analyzer — detects N+1 LLM call patterns, missing semantic cache, duplicate embedding calls
Architecture Reviewer — flags sequential calls that should be parallel, missing streaming, no retry logic
Cost Projector — projects costs at 10x scale, ranks quick-wins by ROI, compares providers

The engine uses tree-sitter AST parsing for Python, TypeScript, and JavaScript, with tiktoken-accurate token counting and pricing data for 60 models across 10 providers (OpenAI, Anthropic, Google, Cohere, Mistral, Groq, Together AI, AWS Bedrock, Azure). Every finding shows per-request savings alongside monthly projections with transparent volume assumptions.

You get three output formats:

Branded PDF for stakeholders with executive summary and optimization grade
Markdown report you can feed directly to Claude Code or Cursor to auto-apply every fix
Unified diff patches compatible with git apply

The free tier gives you 3 scans per month with full findings. No credit card required. Paid tiers unlock helioscan — an iterative optimization loop that proposes, applies, and verifies code changes automatically.

Start Scanning

If you are spending more than $100/month on AI APIs, you almost certainly have token waste hiding in your codebase. The question is not whether it is there -- it is how much.

Scan your codebase free at erabot.ai and find out in five minutes.

Why You Keep Hitting Token Limits (And How to Fix It)

The frustrating part? Most of that token usage is waste. Not because your code is bad, but because nobody taught us how to think about tokens the way we think about memory or bandwidth.

This post breaks down what tokens actually are, why your context window fills up so fast, the five most common waste patterns we see in production codebases, and how to find and fix them.

What Are Tokens, Really?

Before we fix the problem, let us make sure we agree on what a token is.

A token is the smallest unit of text that a language model processes. It is not a character and it is not a word -- it is somewhere in between. For English text:

1 token is roughly 4 characters or about 0.75 words
The word "optimization" is 3 tokens
A line of Python code like response = client.chat.completions.create( is about 12 tokens
A 1,000-word document is roughly 1,300 tokens (use tiktoken for exact counts — line count is unreliable since a dense one-liner can be 50+ tokens while a blank line is 1)

Why Your Context Window Fills Up So Fast

Modern models advertise massive context windows. Claude supports 200K tokens. GPT-4o handles 128K. Gemini 2.0 Flash goes up to 1M. So why are you still hitting limits?

Because context windows are not just for your question. Here is what actually fills them:

System prompt: 200-2,000 tokens. Present on every single request.
Conversation history: Grows linearly. By message 15, you might have 10K+ tokens of history.
RAG context: Your retrieval pipeline pulls "relevant" documents. Often 5,000-20,000 tokens.
Code context: Pasting a file for analysis? A single 500-line file is 4,000-6,000 tokens.
Tool/function definitions: If you are using function calling, each tool definition is 100-500 tokens.

Add those up for a typical RAG-enhanced coding assistant request:

The worst part? Most of those 21,900 tokens are doing nothing useful.

The 5 Most Common Token Waste Patterns

After scanning thousands of production codebases, these are the patterns we see again and again.

1. Sending Entire Files When You Need 10 Lines

This is the most common waste pattern. You need the model to understand one function, but you send the entire 800-line file.

The wasteful pattern:

# Wasteful: sending entire file for a single function analysis
with open("services/user_service.py") as f:
    full_file = f.read()  # 800 lines, ~6,000 tokens

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Review this function for bugs:\n\n{full_file}"
    }]
)

The optimized pattern:

# Optimized: extract only the relevant function
import ast

def extract_function(filepath: str, func_name: str) -> str:
    with open(filepath) as f:
        tree = ast.parse(f.read())
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef) and node.name == func_name:
            return ast.get_source_segment(open(filepath).read(), node)
    return ""

relevant_code = extract_function(
    "services/user_service.py", "create_user"
)  # ~40 lines, ~300 tokens

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Review this function for bugs:\n\n{relevant_code}"
    }]
)

Savings: ~5,700 tokens per request (95% reduction)

2. Redundant System Prompts on Every API Call

Your system prompt is sent with every single request. If it is 2,000 tokens and you make 500 requests per hour, that is 1 million tokens per hour just on system prompts.

The wasteful pattern:

// Wasteful: massive system prompt repeated on every call
const SYSTEM_PROMPT = `You are an expert software engineer...
[800 words of instructions, examples, formatting rules,
persona details, edge case handling, output schemas,
few-shot examples, and chain-of-thought instructions]`;

// This runs 500 times per hour
async function analyzeCode(code: string) {
  return openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: SYSTEM_PROMPT },  // 2,000 tokens every time
      { role: "user", content: code },
    ],
  });
}

The optimized pattern:

// Optimized: minimal system prompt + task-specific context
const CORE_PROMPT = `You are a code reviewer. Return JSON with
{issues: [{line, severity, message}]}.`;  // 35 tokens

// Few-shot examples only when needed for complex tasks
const FEW_SHOT = `Example: {issues: [{line: 12, severity: "high",
message: "SQL injection via string concatenation"}]}`;

async function analyzeCode(code: string, needsExamples = false) {
  const system = needsExamples
    ? `${CORE_PROMPT}\n${FEW_SHOT}`  // 80 tokens for complex cases
    : CORE_PROMPT;                     // 35 tokens for simple cases

  return openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: system },
      { role: "user", content: code },
    ],
  });
}

Savings: ~1,965 tokens per request (98% reduction on system prompt)

Important caveat: This optimization only applies when your system prompt genuinely contains padding — conversational filler, redundant examples, or instructions the model does not need. If your task requires detailed multi-step reasoning instructions, few-shot examples for edge cases, or strict output schemas, stripping the prompt will hurt quality. Always A/B test prompt changes against your evaluation suite before deploying to production.

3. Over-Stuffed RAG Retrieval

The wasteful pattern:

# Wasteful: retrieve 20 chunks regardless of query complexity
results = vectorstore.similarity_search(
    query=user_question,
    k=20  # Always 20 chunks, each ~500 tokens = 10,000 tokens
)

context = "\n\n".join([doc.page_content for doc in results])

The optimized pattern:

# Optimized: adaptive retrieval with relevance filtering
results = vectorstore.similarity_search_with_score(
    query=user_question,
    k=10  # Fetch candidates
)

# Filter by relevance score -- only keep high-quality matches
relevant = [
    doc for doc, score in results
    if score > 0.78  # Threshold tuned to your embedding model
][:5]  # Cap at 5 chunks max = ~2,500 tokens

context = "\n\n".join([doc.page_content for doc in relevant])

Savings: 5,000-7,500 tokens per request (50-75% reduction)

4. Re-Embedding Unchanged Documents on Every Deploy

Every time you deploy, your RAG pipeline re-embeds every document in your knowledge base. If nothing changed, you just burned thousands of embedding API calls for nothing.

# Wasteful: re-embed everything on startup
def refresh_knowledge_base():
    docs = load_all_documents()  # 500 documents
    embeddings = embed_documents(docs)  # 500 API calls every deploy
    vectorstore.add(docs, embeddings)

# Optimized: content-hash check, only embed changes
import hashlib

def refresh_knowledge_base():
    docs = load_all_documents()
    existing_hashes = vectorstore.get_metadata("content_hash")

    new_or_changed = [
        doc for doc in docs
        if hashlib.sha256(doc.content.encode()).hexdigest()
        not in existing_hashes
    ]

    if new_or_changed:
        embeddings = embed_documents(new_or_changed)
        vectorstore.upsert(new_or_changed, embeddings)

Savings: 90-99% reduction in embedding API calls on unchanged deploys

5. Verbose Prompt Templates That Could Be 60% Shorter

Developers write prompts like documentation -- thorough, readable, full of examples. But the model does not need your prompt to be readable to humans. Every unnecessary word is a token you pay for.

# Wasteful: verbose, human-readable prompt (380 tokens)
prompt = """
Please carefully analyze the following piece of source code
and identify any potential security vulnerabilities that may
exist within it. For each vulnerability that you find, please
provide the following information in a structured format:

1. The specific line number where the vulnerability occurs
2. A classification of the severity (using the scale: low,
   medium, high, or critical)
3. A detailed description of the vulnerability
4. A concrete suggestion for how to fix the vulnerability

Please be thorough in your analysis and consider common
vulnerability categories including but not limited to:
SQL injection, XSS, CSRF, authentication bypass, insecure
deserialization, and path traversal.
"""

# Optimized: concise, same output quality (95 tokens)
prompt = """Analyze for security vulnerabilities. Return JSON array:
[{line: int, severity: "low"|"medium"|"high"|"critical",
  vuln: string, fix: string}]
Categories: SQLi, XSS, CSRF, auth bypass, deserialization, path traversal."""

Savings: ~285 tokens per request (75% reduction)

The optimized version produces the same quality output. Models are trained on structured instructions -- they do not need conversational padding to understand what you want.

How to Find Token Waste in Your Codebase

You have two options.

The Manual Approach

Search your codebase for all LLM API calls (openai.chat.completions.create, anthropic.messages.create, genai.GenerativeModel, cohere.chat, groq.chat.completions.create, etc.)
For each call, estimate the token count of every input
Check for the five patterns above
Calculate costs using provider pricing tables
Prioritize fixes by estimated savings

The Automated Approach

Run a scanner that does all of this automatically:

Parse every file using AST analysis (not regex) to find all LLM API calls
Trace token flow from construction to API call, including dynamic inputs
Calculate actual costs using live model pricing
Flag waste patterns with specific line numbers and fix suggestions
Generate a report with total spend, savings potential, and actionable diffs

How erabot.ai Solves This

Token Optimizer — finds bloated prompts, redundant context, oversized max_tokens
Model Selector — identifies where cheaper models (GPT-4o-mini, Claude Haiku) achieve the same accuracy
Cache & Batch Analyzer — detects N+1 LLM call patterns, missing semantic cache, duplicate embedding calls
Architecture Reviewer — flags sequential calls that should be parallel, missing streaming, no retry logic
Cost Projector — projects costs at 10x scale, ranks quick-wins by ROI, compares providers

You get three output formats:

Branded PDF for stakeholders with executive summary and optimization grade
Markdown report you can feed directly to Claude Code or Cursor to auto-apply every fix
Unified diff patches compatible with git apply

Start Scanning

If you are spending more than $100/month on AI APIs, you almost certainly have token waste hiding in your codebase. The question is not whether it is there -- it is how much.

Scan your codebase free at erabot.ai and find out in five minutes.