Stop Burning Money: How an AI Consultancy Can Optimize Your RAG System Costs

Stop Burning Money: How an AI Consultancy Can Optimize Your RAG System Costs
0:00 / 0:00 Listen to this article

Your RAG System Is Probably Costing You Three Times What It Should

If you've deployed a Retrieval-Augmented Generation (RAG) system and watched your monthly LLM costs climb without a clear explanation, you're not alone. Most organisations build their first RAG pipeline to prove the concept works, then leave it running in that proof-of-concept state indefinitely. The retrieval logic is naive, the context windows are bloated, and every query hits the model cold - no caching, no filtering, no cost controls. An ai consultancy that specialises in production knowledge systems will tell you the same thing: the architecture that gets you to demo day is rarely the architecture that survives contact with real usage volumes.

This article covers the specific levers that reduce RAG operating costs in production - semantic caching, token budgeting, retrieval precision tuning, and infrastructure choices - with enough technical detail to act on.


Why RAG Costs Spiral in Production

RAG costs spiral in production because most implementations treat every query as unique and pass oversized context windows to the LLM on every call. The two primary cost drivers in a RAG system are the number of LLM API calls made and the total tokens processed per call. When neither is controlled, costs scale linearly - or worse - with query volume.

Consider a mid-sized Australian professional services firm running an internal knowledge base over 50,000 documents. Their initial RAG implementation retrieved the top 10 chunks per query, each chunk averaging 400 tokens, and passed them all to GPT-4 with a 500-token system prompt. That's roughly 4,500 input tokens per query before the user's question is even included. At GPT-4 pricing, a modest 10,000 queries per month produces an input token bill of approximately 45 million tokens - around $1,350 USD in input costs alone, before output tokens.

The fix isn't switching models. The fix is fixing the architecture.


Semantic Caching: The Fastest Win Available

Semantic caching is the single highest-leverage optimisation for most RAG deployments. Semantic caching refers to storing the results of previous LLM queries and returning those cached results when a new query is semantically similar - not just lexically identical - to a prior one.

Unlike traditional key-value caching, semantic caching uses vector similarity to match queries. A user asking "What's our parental leave policy?" and another asking "How much parental leave do employees get?" are different strings but semantically equivalent. A properly configured semantic cache handles both with one LLM call.

Implementation approach:

  1. Embed each incoming query using the same embedding model as your retrieval layer (e.g. text-embedding-3-small from OpenAI or a locally hosted bge-m3 model).
  2. Query a vector store (Pinecone, Qdrant, or pgvector) containing embeddings of previously answered questions.
  3. Set a similarity threshold - typically cosine similarity ≥ 0.92 for high-confidence cache hits.
  4. If a match is found above the threshold, return the cached response directly. If not, proceed with the full RAG pipeline and store the result.
def query_with_cache(user_query: str, cache_store, llm_pipeline, threshold=0.92):
    query_embedding = embed(user_query)
    cached_result = cache_store.search(query_embedding, top_k=1)

    if cached_result and cached_result[0].score >= threshold:
        return cached_result[0].payload["response"]

    response = llm_pipeline.run(user_query)
    cache_store.upsert(query_embedding, {"response": response})
    return response

In practice, semantic caching reduces LLM API calls by 30-60% on knowledge base workloads where users ask overlapping questions - which is almost every enterprise deployment.


Token Budgeting: Stop Sending Junk to the Model

Token budgeting is the practice of setting explicit limits on the number of tokens passed to an LLM in each pipeline stage, enforced programmatically rather than left to chance. Most RAG systems have no token budget - they retrieve chunks and concatenate them until the context is full, or until an arbitrary chunk count is hit.

A disciplined token budget has three components:

  • System prompt ceiling: Lock your system prompt at a fixed token count. Audit it quarterly. A system prompt that has grown to 1,200 tokens through accumulated additions is a cost leak.
  • Retrieved context ceiling: Set a hard limit on context tokens - typically 1,500-2,500 tokens for most enterprise Q&A tasks. If your retrieval is returning 6,000 tokens of context, your retrieval precision is the problem, not the budget.
  • Output token ceiling: For structured tasks (summaries, classifications, extractions), set max_tokens explicitly. Leaving it unbounded on a summarisation task is how you generate 800-word answers to yes/no questions.

A practical token budget for a customer support RAG system might look like this:

System prompt:     400 tokens
Retrieved context: 2,000 tokens
User query:        ~150 tokens (95th percentile)
Output ceiling:    500 tokens
─────────────────────────────
Total ceiling:     3,050 tokens per call

Compare that to an unbudgeted system processing 5,000-8,000 tokens per call. At scale, the difference is a 40-60% reduction in per-query cost with no degradation in answer quality - provided your retrieval is returning relevant chunks.


Retrieval Precision: Fewer Chunks, Better Answers

Poor retrieval precision is the root cause of most token waste in RAG systems. Retrieving 10 loosely relevant chunks when 2 highly relevant chunks would suffice doubles your context cost and typically degrades answer quality by introducing noise.

Three techniques materially improve retrieval precision:

Hybrid search combines dense vector retrieval with BM25 keyword search. Dense retrieval handles semantic similarity; BM25 handles exact term matching. For enterprise knowledge bases with domain-specific terminology, hybrid search reduces irrelevant chunk retrieval by 20-35% compared to dense-only retrieval.

Re-ranking adds a cross-encoder model as a second-pass filter. After your initial retrieval returns the top 20 candidates, a re-ranker (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2) scores each candidate against the query and returns only the top 3-5. Re-ranking adds 50-150ms of latency but reduces context tokens sent to the LLM by 60-70%.

Chunk strategy review is often overlooked. Fixed-size chunking at 512 tokens is a default, not a recommendation. For structured documents like policy manuals or legal contracts, semantic chunking - splitting on meaningful boundaries rather than token counts - produces chunks that are more self-contained and require fewer of them to answer a query.


Model Selection and Routing: Not Every Query Needs GPT-4

Intelligent model routing is the practice of directing queries to the least expensive model capable of handling them adequately. Not every query in a RAG system requires a frontier model. Factual lookups from a well-structured knowledge base, simple classifications, and short-form extractions are tasks where GPT-4o mini, Claude Haiku, or a locally hosted Llama 3.1 8B model performs comparably to GPT-4 at 10-20x lower cost.

A routing layer classifies incoming queries by complexity before dispatch:

  • Simple factual queries (e.g. "What are our office hours?") → small, fast model
  • Multi-hop reasoning queries (e.g. "Compare our enterprise and SMB pricing tiers and explain the compliance implications") → frontier model

Implementing this routing layer reduces average per-query cost by 35-50% on mixed-complexity workloads, with no perceptible quality loss on simple queries.

This is exactly the kind of architectural decision that an experienced AI consultancy makes during system design - not after six months of unexpected invoices.


How to Audit Your Existing RAG System in 5 Steps

Auditing an existing RAG system for cost inefficiencies follows a repeatable process. Here is the standard approach used in production reviews:

  1. Instrument your pipeline. Log token counts at each stage - system prompt, retrieved context, user query, and output - for every call. If you're not measuring it, you can't optimise it.

  2. Calculate your cost-per-query. Divide your monthly LLM spend by total query volume. A well-optimised enterprise RAG system typically costs $0.002-$0.008 USD per query on GPT-4-class models. Above $0.015 per query is a signal that architectural issues exist.

  3. Analyse your context token distribution. Plot a histogram of context tokens per query. A long right tail (queries consuming 6,000+ tokens) points to retrieval precision problems. A uniformly high distribution points to a chunking or system prompt problem.

  4. Measure your cache hit potential. Sample 1,000 queries and cluster them by semantic similarity. If more than 25% of queries fall into clusters with cosine similarity ≥ 0.90, semantic caching will produce significant savings.

  5. Benchmark retrieval precision. For a sample of 100 queries, manually review the retrieved chunks. Score each chunk as relevant or irrelevant. A precision rate below 70% means your retrieval layer is sending noise to the LLM on nearly one in three chunks.


What to Do Next

If your RAG system is in production and you haven't applied at least two of the optimisations above, your LLM costs are higher than they need to be. The good news is that semantic caching and token budgeting are implementable in days, not months, and typically produce a 40-60% cost reduction without touching your underlying knowledge base or user experience.

The harder work - retrieval precision tuning, model routing, and re-ranking - requires a clear view of your query distribution and document structure before you start. Getting that wrong costs time and introduces regression risk.

If you want a structured assessment of where your current system is leaking cost, Exponential Tech offers RAG architecture reviews as part of our broader AI consulting work. You can explore our services or use our AI ROI calculator to estimate what optimised infrastructure would mean for your budget.


Frequently Asked Questions

Q: What is semantic caching in a RAG system?

Semantic caching is a technique that stores the results of previous LLM queries and retrieves them when a new query is sufficiently similar in meaning - measured by vector cosine similarity - rather than requiring an exact string match. It reduces redundant LLM API calls by 30-60% on typical enterprise knowledge base workloads.

Q: How much can token budgeting reduce RAG costs?

Token budgeting reduces per-query LLM costs by 40-60% on unoptimised systems by enforcing hard limits on system prompt size, retrieved context, and output length. The reduction is achieved without degrading answer quality, provided retrieval precision is also addressed.

Q: When should I use a smaller model instead of GPT-4 in my RAG pipeline?

Use a smaller model for simple factual lookups, short-form extractions, and classification tasks where the answer is directly present in the retrieved context. Reserve frontier models for multi-hop reasoning, synthesis across multiple documents, and tasks requiring nuanced judgement. Intelligent routing between model tiers reduces average per-query cost by 35-50% on mixed workloads.

Q: What does an AI consultancy actually do to optimise a RAG system?

An AI consultancy audits the full RAG pipeline - retrieval strategy, chunking logic, prompt architecture, model selection, and caching infrastructure - to identify where cost and latency inefficiencies exist. The output is a prioritised set of architectural changes with projected cost impact, implemented either by the consultancy or handed to the client's engineering team with detailed specifications.

Related Service

RAG & Knowledge Systems

Intelligent search and retrieval powered by your own data.

Learn More
Stay informed

Get AI insights delivered

Practical AI implementation tips for IT leaders — no hype, just what works.

Keep reading

Related articles

Ask about our services
Hi! I'm the Exponential Tech assistant. Ask me anything about our AI services — I'm here to help.