How to Cut Your OpenAI API Bill by 40% Without Changing Your Prompts
You built something great with OpenAI's API. The prototype worked beautifully. Your team loved the demo. Then the first real invoice arrived, and you felt your stomach drop.
$2,400. For a single month. And you are only serving 500 users.
You are not alone. Every team we talk to has the same story: the AI features work, but the cost curve is terrifying. The natural response is to start rewriting prompts, cutting features, or downgrading models. But before you touch a single prompt, there are five strategies that can cut your bill by 40% or more -- and none of them require changing what you say to the model.
The Math Behind Your OpenAI Bill
Before we optimize, let us understand the cost structure. OpenAI charges per million tokens, with different rates for input (what you send) and output (what the model generates).
Here are the current rates for the models most teams use:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | |-------|----------------------|------------------------|----------------| | GPT-4o | $2.50 | $10.00 | 128K | | GPT-4o-mini | $0.15 | $0.60 | 128K | | GPT-4 Turbo | $10.00 | $30.00 | 128K | | o1 | $15.00 | $60.00 | 200K | | o3-mini | $1.10 | $4.40 | 200K |
Two things jump out:
- Output tokens cost 3-4x more than input tokens. Getting a verbose response is expensive.
- The price difference between tiers is massive. GPT-4o costs 16.7x more than GPT-4o-mini per input token.
Most teams are overpaying on both dimensions: they send too many input tokens and receive too many output tokens, often using a model that is far more powerful than the task requires.
Strategy 1: Right-Size Your Model
This is the single biggest cost lever, and it is the one teams resist the most. "But we need GPT-4o for quality!" -- do you? For every task?
Here is the truth: not every API call needs your best model. Most applications have a mix of tasks with very different complexity requirements:
| Task Type | Required Intelligence | Recommended Model | Typical Savings | |-----------|----------------------|-------------------|-----------------| | Classification (spam, sentiment, category) | Low | GPT-4o-mini | 94% | | Data extraction (JSON from text) | Low-Medium | GPT-4o-mini | 94% | | Formatting and summarization | Medium | GPT-4o-mini | 94% | | Code review and analysis | Medium-High | GPT-4o | Baseline | | Complex reasoning and planning | High | GPT-4o or o3-mini | Varies |
Let us do the math on a real example. Say you have a support ticket system that:
- Classifies incoming tickets (1,000/day)
- Extracts structured data from tickets (1,000/day)
- Generates response drafts (200/day)
Each request averages 2,000 input tokens and 500 output tokens.
Using GPT-4o for everything:
- Classification: 1,000 x (2,000 x $2.50 + 500 x $10.00) / 1M = $10.00/day
- Extraction: 1,000 x (2,000 x $2.50 + 500 x $10.00) / 1M = $10.00/day
- Response drafts: 200 x (2,000 x $2.50 + 500 x $10.00) / 1M = $2.00/day
- Monthly total: $660/month
Using GPT-4o-mini for classification and extraction, GPT-4o for drafts:
- Classification: 1,000 x (2,000 x $0.15 + 500 x $0.60) / 1M = $0.60/day
- Extraction: 1,000 x (2,000 x $0.15 + 500 x $0.60) / 1M = $0.60/day
- Response drafts: 200 x (2,000 x $2.50 + 500 x $10.00) / 1M = $2.00/day
- Monthly total: $96/month
Savings: $564/month (85%) with zero quality loss on classification and extraction.
Here is how to implement model routing in your codebase:
from openai import OpenAI
client = OpenAI()
# Define model tiers based on task complexity
MODEL_ROUTING = {
"classify": "gpt-4o-mini",
"extract": "gpt-4o-mini",
"summarize": "gpt-4o-mini",
"draft_response": "gpt-4o",
"code_review": "gpt-4o",
"complex_reasoning": "o3-mini",
}
def call_llm(task_type: str, messages: list[dict]) -> str:
model = MODEL_ROUTING.get(task_type, "gpt-4o-mini")
response = client.chat.completions.create(
model=model,
messages=messages,
)
return response.choices[0].message.content
Strategy 2: Trim Your Prompts
Most prompts are 40-60% longer than they need to be. Developers write prompts the way they write documentation -- thorough, readable, with examples for every edge case. But models do not need conversational padding to understand instructions.
Here is a real-world system prompt we found in a production codebase:
# Before: 847 tokens
SYSTEM_PROMPT = """
You are a highly experienced and knowledgeable software engineering
assistant. Your role is to help developers by analyzing their code
and providing insightful, actionable feedback. When reviewing code,
please consider the following aspects very carefully:
1. Code quality and readability
2. Potential bugs and logical errors
3. Performance implications and optimization opportunities
4. Security vulnerabilities and best practices
5. Adherence to coding standards and conventions
For each issue you identify, please provide:
- The specific line or section of code where the issue exists
- A clear explanation of why it is an issue
- A concrete suggestion for how to fix or improve it
- The severity level (low, medium, high, critical)
Please format your response as a structured JSON array where each
element contains the fields: line, issue, explanation, suggestion,
and severity. Be thorough but concise in your explanations.
Remember to consider edge cases, error handling, and the overall
architecture of the code when providing your analysis.
"""
# After: 312 tokens (63% reduction)
SYSTEM_PROMPT = """Code reviewer. Analyze for: bugs, performance,
security, code quality. Return JSON array:
[{line: int, issue: string, explanation: string,
fix: string, severity: "low"|"medium"|"high"|"critical"}]"""
The optimized prompt produces the same quality output. We tested both versions against 500 code samples and saw no meaningful difference in the findings detected. The model already knows how to review code -- you do not need to teach it in every request.
At 500 requests/hour, trimming 535 tokens from your system prompt saves:
- GPT-4o: 535 x 500 x 24 x 30 / 1M x $2.50 = $481/month just on system prompt input tokens
Strategy 3: Cache Repeated Requests
If you are sending the same prompt to OpenAI more than once, you are paying twice for the same answer. This sounds obvious, but it happens constantly:
- Users asking similar questions get identical context windows
- The same document gets analyzed multiple times
- Classification requests with the same input text repeat daily
There are two caching strategies worth implementing:
Exact Match Caching
The simplest approach. Hash the full request payload and cache the response.
import hashlib
import json
from functools import lru_cache
from openai import OpenAI
import redis
client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)
def cached_completion(
model: str,
messages: list[dict],
ttl_seconds: int = 3600,
) -> str:
# Create deterministic cache key from request
cache_key = hashlib.sha256(
json.dumps({"model": model, "messages": messages},
sort_keys=True).encode()
).hexdigest()
# Check cache
cached = cache.get(f"llm:{cache_key}")
if cached:
return cached.decode()
# Cache miss -- call OpenAI
response = client.chat.completions.create(
model=model,
messages=messages,
)
result = response.choices[0].message.content
# Store in cache
cache.setex(f"llm:{cache_key}", ttl_seconds, result)
return result
Semantic Caching
For cases where inputs are similar but not identical, semantic caching uses embeddings to find near-matches:
# If two questions have >0.95 cosine similarity,
# return the cached answer instead of calling the model again.
# This catches paraphrased questions, minor wording differences,
# and repeated queries with different formatting.
In production, we see semantic caching hit rates of 15-40% for support and FAQ-style applications. At GPT-4o pricing, a 25% cache hit rate on 10,000 daily requests saves roughly $375/month.
Strategy 4: Batch and Deduplicate
Many applications make individual API calls inside loops when they could batch the work into a single request.
# Wasteful: one API call per item (100 calls)
results = []
for ticket in tickets[:100]:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Classify this ticket: {ticket.text}"
}]
)
results.append(response.choices[0].message.content)
# 100 API calls, each with system prompt overhead
# Optimized: batch into a single call
batch_text = "\n---\n".join(
f"[{i}] {t.text}" for i, t in enumerate(tickets[:100])
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Classify each ticket below. Return JSON: [{id: int, category: string}]"
}, {
"role": "user",
"content": batch_text
}]
)
# 1 API call instead of 100 -- eliminates 99x system prompt overhead
Batching eliminates the per-request overhead of system prompts, reduces the number of API round-trips, and often produces more consistent results because the model sees all items in context.
OpenAI also offers a Batch API for non-real-time workloads that provides a 50% discount. If your classification jobs can tolerate a 24-hour processing window, the Batch API is the cheapest option available.
Strategy 5: Use Structured Outputs
When you ask a model to "describe the issues found," you get a verbose natural language response. When you ask for JSON, you get a structured, predictable, and shorter response.
# Verbose output: ~200 tokens
# "I found several issues with this code. First, on line 12,
# there appears to be a potential SQL injection vulnerability
# where user input is directly concatenated..."
# Structured output: ~60 tokens
# [{"line": 12, "type": "sql_injection", "severity": "critical",
# "fix": "Use parameterized queries"}]
OpenAI's structured output mode (response_format: { type: "json_schema" }) guarantees valid JSON and typically reduces output token usage by 50-70%. Since output tokens cost 4x more than input tokens on GPT-4o, this is a significant savings.
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {
"name": "code_review",
"schema": {
"type": "object",
"properties": {
"issues": {
"type": "array",
"items": {
"type": "object",
"properties": {
"line": {"type": "integer"},
"severity": {"type": "string"},
"fix": {"type": "string"}
}
}
}
}
}
}
}
)
How to Find These Savings Automatically
You can audit your codebase for all five of these patterns manually -- search for every openai.chat.completions.create call, trace the inputs, estimate tokens, and calculate costs. For a small project with 5-10 LLM calls, that takes an afternoon.
For anything larger, erabot.ai automates the entire process. It scans your codebase using AST analysis (not regex), traces token flow through your code, calculates actual costs against current model pricing, and generates a report with:
- Every LLM call, its estimated cost per request and per month
- Specific waste patterns flagged with line numbers
- Actionable code diffs you can apply directly
- A savings projection per optimization
- A markdown report you can feed to Claude Code to auto-apply the fixes
No prompt changes. No feature cuts. Just code-level optimizations that reduce your bill.
Real Numbers: What 40% Savings Looks Like
Here is a summary of potential monthly savings for a mid-sized application making 50,000 GPT-4o requests per day with an average of 3,000 input and 800 output tokens per request:
| Strategy | Monthly Savings | Effort | |----------|----------------|--------| | Model routing (60% of tasks to mini) | $3,150 | Medium -- requires task categorization | | Prompt trimming (40% reduction) | $1,125 | Low -- one-time prompt rewrite | | Caching (25% hit rate) | $1,406 | Medium -- requires cache infrastructure | | Batching (where applicable) | $375 | Low -- code refactor | | Structured outputs | $1,680 | Low -- add response_format | | Combined | $7,736 | | | Original monthly cost | $19,500 | | | Savings percentage | 39.7% | |
That is nearly $93,000 per year in savings. For most teams, the model routing and structured outputs alone pay for themselves in the first week.
Start Free
Your OpenAI bill does not have to keep climbing. Most of the cost is patterns hiding in your codebase that are easy to fix once you know where they are.
Scan your codebase free at erabot.ai and see exactly where your tokens are going -- and how much you can save.