AI Cost Visibility Before the Invoice: How to Trace, Measure and Optimize Token Spend

by Nikolay Iliev

Last Updated: June 04, 2026 Published: May 21, 2026 13 min read AI, Release 0 Comments

JustMockT2-dark-TB 1200x303 Blog Cover - Top Image

Summarize with AI:

Learn how to measure and optimize AI token spend before billing surprises hit. Discover why production AI costs diverge from estimates and how trace-level observability helps teams control LLM spending.

The Cost Visibility Problem

Most AI teams don’t realize they’ve overspent until the invoice arrives. The math seems simple—you know the per-token rate, you estimate average usage and you multiply. But production agent behavior introduces compounding factors that make those estimates wildly inaccurate: retries, context window growth, multi-step reasoning chains, automated evaluations and framework-level overhead that’s invisible without proper instrumentation.

The real problem isn’t just cost—it’s the lack of visibility into what caused the cost.

The Spreadsheet Model

When evaluating an LLM provider, the cost model looks deterministic:

Monthly cost = (avg_input_tokens + avg_output_tokens) × per_token_rate × monthly_invocations

For a simple weather agent: ~500 input tokens + ~200 output tokens per query, at GPT-4.1-mini rates ($0.40/1M input, $1.60/1M output), running 10,000 queries/month = roughly $3.60/month. Budget approved.

The Production Reality

In production, that same agent exhibits behavior the spreadsheet never captured:

Context accumulation – Each tool result is appended to the conversation context. A weather agent that calls one tool adds ~300 tokens of tool output to every subsequent LLM call. An agent with 5 tools might accumulate 2,000+ tokens of context before generating the final response.
Retry amplification – Most agent frameworks implement automatic retry logic. LangChain’s create_agentwill retry on parsing failures. If the LLM returns a malformed tool call 20% of the time, you’re paying 1.2x the expected token cost—and the retry itself includes the full accumulated context.
Multi-step reasoning – ReAct-style agents loop: Think → Act → Observe → Think → Act → Observe → Final Answer. Each iteration sends the entire conversation history. A 3-iteration agent sends roughly 3 × (system_prompt + accumulated_context + new_thought)—not 3× the base cost but more like 5-7× due to context growth.
Framework overhead – LangChain, LlamaIndex and other frameworks inject system prompts, format instructions and intermediate parsing prompts that aren’t visible in your application code. These add 200-500 tokens per call that never appear in your cost estimate.
Invisible evaluations – Platforms that run automatic quality evaluations consume tokens for the judge LLM call. If evaluations run on every trace, they effectively double your token spend.

The result: That $3.60/month estimate becomes $25-40/month in practice. At enterprise scale with multiple agents, the gap between estimated and actual spend can reach thousands of dollars monthly.

Why This Becomes Urgent with Usage-Based Billing

This problem is about to get significantly worse. The AI tooling industry is rapidly shifting from flat-rate to usage-based pricing, making every token a direct billing event.

GitHub recently announced that Copilot is moving to usage-based billing starting June 2026. The reasoning is explicit: “a quick chat question and a multi-hour autonomous coding session can cost the user the same amount” under flat pricing—so they’re replacing premium request counts with “GitHub AI Credits” consumed based on token usage (input, output and cached tokens) at published API rates per model.

This isn’t an isolated decision. It reflects a structural reality: as AI tools become agentic—running multi-step sessions, invoking tools, iterating across codebases—the cost variance between “light” and “heavy” usage becomes too large for flat pricing to absorb. Providers are pushing that variance downstream to users.

The implication for engineering teams is clear: in a token-based billing world, cost optimization requires the same granular visibility as performance optimization. Every retry, long context window, tool call, model switch and evaluation becomes a measurable billing event. You can’t optimize what you can’t measure and you can’t measure token economics with infrastructure metrics.

What Teams Need for AI Cost Visibility and Management

Before choosing any tool or platform, it helps to define what metrics and dimensions actually matter for AI cost control. These are the building blocks of cost visibility regardless of implementation:

Token Metrics

Input, output, total and cached tokens – the raw cost drivers behind every LLM call. Cached tokens matter because many providers price them differently (often at a discount) and knowing your cache hit rate affects cost projections.

Cost Dimensions

Cost by model and provider – shows where routing decisions affect spending. If 90% of cost comes from one model, that’s where optimization has the highest leverage.
Cost by request, trace, span, workflow, service and agent – progressively broader views from a single LLM call up to an entire service, enabling drill-down from aggregate anomalies to root causes.
Cost by customer, team, environment, release and experiment – attribution dimensions that answer, “who or what caused the spend?” rather than just “how much did we spend?”

Behavioral Metrics

Retry count and agent iteration count – expose hidden cost amplification. A retry rate of 20% means you’re paying 1.2x what you expected, compounded by context size.
Tool calls and retrieval steps – each tool invocation may trigger additional LLM calls or add context that inflates subsequent calls.
Context growth across a conversation or workflow – the silent cost multiplier. If context grows linearly with conversation turns, cost grows quadratically.

Quality and Efficiency Metrics

Evaluation usage and judge-model cost – quality checks that use LLM judges have their own token cost, which can rival or exceed the primary inference cost if unchecked.
Latency, error status and failed/partial responses – failed responses still consume tokens. High latency may indicate retries or queuing that affects both cost and user experience.

Meta-Metrics

Observability usage and cost – the observability layer itself should be measurable. If you can’t quantify what observability costs you, it may become an uncontrolled expense.

When Observability Itself Becomes a Cost Problem

Teams need observability to manage AI cost but the observability layer itself needs to be transparent, predictable and intentional.

A recent Reddit post from a developer using Azure AI Foundry illustrates the issue:

“I am noticing very high, unexpected charges coming from ‘Observability’. I do not need these logs, metrics or trace data right now and my main goal is to stop these charges completely.”

The root cause: Azure AI Foundry enables playground evaluations by default (which consume LLM tokens and are billed) and automatically configures Application Insights tracing for hosted agents. The developer was being charged for observability features they never consciously enabled.

A Microsoft PM confirmed the fix: navigate to the agents playground, select metrics in the upper right and unselect all evaluators. The developer had to disable all observability to stop the charges—trading cost visibility for cost control.

This creates a fundamental tension: you need observability to control AI costs but if observability itself is an uncontrolled cost with hidden defaults, it becomes part of the problem. The solution requires an observability platform where:

Instrumentation is explicitly opt-in (nothing runs unless you add it to your code)
Pricing is predictable and based on discrete units, not data volume
The observability model should be predictable and ROI-positive, helping teams reduce avoidable AI spend without creating a new cost surprise

Why These Metrics Matter

Understanding what each metric tells you—and what decisions it enables—is the difference between collecting data and controlling cost:

Token counts show the raw cost driver. If you don’t know how many tokens a workflow consumes, you can’t estimate or optimize its cost.
Model and provider data shows where routing decisions affect spend. Switching from a flagship model to a smaller model for classification tasks can reduce cost 10-50× for that step with minimal quality impact.
Trace-level cost shows which step in a workflow created the spike. Without this, you know that cost went up but not why.
Tags (customer, release, environment, experiment) make cost attributable. They turn “we spent $4,000 this month” into “customer X’s workflow costs 8× more than average because of context length.”
Retry and iteration counts expose hidden cost amplification. An agent that retries 3 times on 10% of requests is silently spending 30% more than expected on those requests.
Evaluation metrics prevent quality checks from becoming invisible spend. If your judge model runs on every trace and costs $0.02 per evaluation, that’s $200/month at 10,000 traces—potentially more than the inference cost you’re trying to optimize.
Latency helps teams optimize cost without degrading user experience. A cheaper model that adds 2 seconds of latency may not be an acceptable tradeoff but you need both metrics to make that decision.

Vendor-Neutral Workflow for Tracing, Observing and Optimizing AI Cost

Before introducing any specific tool, here is the general process teams should follow to move from reactive invoice surprises to proactive cost control:

Instrument – Capture AI requests, model calls, tool calls, retrieval steps and evaluations. Every step that could consume tokens or trigger billing should emit telemetry.
Capture – Record token counts, model metadata, latency, status and cost estimates for each instrumented operation.
Attribute – Add tags for customer, release, environment, team and experiment so cost can be sliced by any business dimension.
Baseline – Establish a cost baseline for normal operations. Without a baseline, you can’t distinguish a spike from expected variance.
Monitor – Watch for spikes or regressions after deployments, prompt changes, model switches or traffic shifts.
Investigate – Drill into expensive traces to identify the root cause: was it a retry loop, context bloat, a model routing error or an evaluation storm?
Optimize – Fix the identified problem: optimize prompts, adjust routing, cap retries, manage context windows, prune unnecessary tool calls or limit evaluations.
Validate – Confirm that cost improved without hurting quality or latency. Cost optimization that degrades user experience isn’t optimization—it’s a tradeoff that needs explicit approval.

This loop takes your cost discovery time from “30 days (when the invoice arrives)” to “same day (when the trace appears).”

Progress Observability as a Practical Example

Here is what this framework looks like when implemented using Progress Observability. The sections below demonstrate each capability as a concrete example of the general principles.

Instrumented Weather Agent—Example of Capturing Telemetry

A real Python agent instrumented with the Progress Observability SDK, using LangChain with OpenAI to answer weather questions:

import os
from dotenv import load_dotenv
from langchain_community.utilities import OpenWeatherMapAPIWrapper
from langchain_community.tools import OpenWeatherMapQueryRun
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from progress.observability import Observability, ObservabilityInstruments
from progress.observability import agent, workflow, task, tool

load_dotenv()

os.environ.pop("SSL_CERT_FILE", None)

# Initialize observability - explicitly opt-in, called before LLM usage
Observability.instrument(
   app_name=os.getenv("OBSERVABILITY_APP_NAME"),
   api_key=os.getenv("OBSERVABILITY_API_KEY"),
   trace_content=True,
   instruments={
        ObservabilityInstruments.OPENAI,
        ObservabilityInstruments.LANGCHAIN
   },
   additional_tags=["production", "release:2.5.1"]
)

model = ChatOpenAI(
  api_key=os.getenv("OPENAI_API_KEY"),
  model="gpt-5.4-mini"
)

# Setup Tools
weather_api = OpenWeatherMapAPIWrapper(
    openweathermap_api_key=os.getenv("OPENWEATHERMAP_API_KEY")
)
weather_tool = OpenWeatherMapQueryRun(api_wrapper=weather_api)
tools = [weather_tool]

# Create Agent
lang_agent = create_agent(model, tools=tools)

Manual Instrumentation—Example of Cost Attribution

Auto-instrumentation captures LLM calls and framework operations but your own business logic—data pipelines, custom tools orchestration—needs explicit decoration to appear in traces with cost attribution:

@tool(name="weather-lookup")
def fetch_weather_data(city: str) -> str:
    """Fetch raw weather data from OpenWeatherMap API."""
    return weather_api.run(city)

@task(name="normalize-weather", attributes={"team": "ml"}, tags=["experiment-a"])
def normalize_weather_data(raw: str) -> dict:
    """Transform raw weather string into a structured dict."""
    return {"raw_report": raw, "source": "openweathermap", "format": "normalized"}


@workflow(name="data-pipeline", version=2)
def retrieve_weather_context(query: str) -> dict:
    """Retrieve and normalize weather data for a given query."""
    raw = fetch_weather_data(query)
    return normalize_weather_data(raw)


@agent(name="weather-agent")
def handle_weather_request(query: str) -> str:
    """Top-level agent handler that orchestrates the full weather request."""
    context = retrieve_weather_context("London")
    result = lang_agent.invoke({
        "messages": [{"role": "user", "content": query}]
    })
    return result["messages"][-1].content

# Run with proper shutdown to flush telemetry
try:
    response = handle_weather_request("What is the weather in London?")
    print(response)
finally:
    Observability.shutdown()

What This Captures—Example of Trace-Level Cost Granularity

Each decorator (@agent, @workflow, @task, @tool) creates a span with a specific kind. The auto-instrumentation adds spans for every LLM call with token counts and model information. Together, the trace tree for this agent looks like:

▼ agent: weather-agent
  ▼ workflow: data-pipeline (v2)
    ▼ tool: weather-lookup
    ▼ task: normalize-weather [tags: experiment-a]
  ▼ llm_call: ChatOpenAI.chat (model: gpt-5.4-mini, 364 tokens, $0.0004)
  ▼ tool: OpenWeatherMapQueryRun
  ▼ llm_call: ChatOpenAI.chat (model: gpt-5.4-mini, 128 tokens, $0.0002)

Every span records: duration, token count (for LLM calls), model name and estimated cost. This is the granularity needed to answer “why did costs spike?”—you can see exactly which step consumed tokens and whether it was expected.

Cost Analytics Dashboard—Example of Aggregate Cost Visibility

The Cost Analytics Dashboard aggregates token usage and costs across your organization, calculated server-side by the Collector (which maps model names to current per-token pricing—meaning costs stay accurate even as providers change rates).

The dashboard surfaces two critical dimensions:

Cost by model – immediately see that gpt-5.5 accounts for 90%+ of spend while gpt-5.4-mini is orders of magnitude cheaper. This drives model routing decisions: can you use the cheaper model for initial reasoning and reserve the expensive model for final responses only?
Cost by service/application – attribute spend to specific agents and workflows. Your “weather-agent” costs $0.0092/day while your “document-processor” costs $4.30/day. Now you know where to focus optimization efforts.

The Cost Analytics dashboard breaks down spending by model and by service, with totals for cost and token usage over any selected time range.

This is the view you check weekly (or daily during rollouts) to catch cost regressions before they hit the invoice.

Trace Explorer—Example of Trace-Level Cost Investigation

When the Cost Analytics dashboard shows a spike, the Observations page lets you drill into individual traces to find the root cause.

The Observations page lists all traces with span kind, model, tokens, cost and status. Click any trace to expand the full span tree.

Each trace shows the full span tree with token counts and costs per span. You can filter by:

Time range (isolate the spike window)
Application name
Tags (e.g., customer:acme, env:production)
Success/failure status

For cost investigation, the typical workflow is:

Dashboard shows Tuesday had 40% higher cost than Monday.
Filter Observations to Tuesday, sort by token count descending.
Find the expensive traces—maybe a subset of users triggers a 5-iteration agent loop.
Open the trace, see that iteration 4 and 5 add 8,000 tokens with no useful output.
Fix the agent’s termination logic, deploy, verify cost drops Wednesday.

LLM Requests—Example of Real-Time Token Monitoring

The LLM Requests view provides a live stream of individual LLM calls. Each entry shows model, provider, token count and cost. This is your real-time cost monitor during deployments—if a new prompt template increases average tokens per call from 400 to 1,200, you’ll see it immediately, not on next month’s invoice.

Tag-Based Attribution—Example of Customer/Release/Environment Cost Breakdown

Tags enable multi-dimensional cost analysis. The SDK supports three levels:

# 1. Global tags - applied to ALL spans
Observability.instrument(
    app_name="weather-agent",
    api_key="...",
    additional_tags=["production", "release:2.5.1"]
)

# 2. Scoped tags - applied within a context block
from progress.observability import propagate_attributes

with propagate_attributes(tags=["customer:acme-corp", "request:req-abc-123"]):
    result = handle_weather_request(query)
    # All spans (including LLM calls) inside this block get these tags

# 3. Decorator tags - applied to a single function's span
@task(tags=["cohort-a", "experiment:new-prompt"])
def my_function():
    ...

With customer-level tags, you can answer: “Which customers trigger the most expensive agent paths?” With experiment tags: “Did the new prompt template reduce or increase token consumption?” With release tags: “Did v2.5.1 introduce a cost regression vs v2.5.0?”

All three levels merge and deduplicate automatically. Each tag is limited to 200 characters.

SDK Instrumentation—Example of How Teams Capture the Required Telemetry

Every SDK parameter can be overridden via environment variables, enabling the same code to run across dev/staging/prod with different cost configurations:

export OBSERVABILITY_APP_NAME="weather-agent"
export OBSERVABILITY_API_KEY="ac_p_001_..."
export OBSERVABILITY_ENDPOINT="https://collector.observability.progress.com:443"
export OBSERVABILITY_TRACE_CONTENT="true"

This means you can:

Use trace_content=True in development (full prompt/completion capture for debugging)
Use trace_content=False in production (reduces telemetry size, still captures token counts and costs)
Tag environments differently for cost comparison

For teams using .NET with IChatClient from Microsoft.Extensions.AI:

using Microsoft.Extensions.AI;
using Progress.Observability.Extensions.AI;

IChatClient chatClient = new OpenAI.Chat.ChatClient("gpt-4.1-mini", openAIApiKey)
    .AsIChatClient();

chatClient = chatClient.AddObservability(options => {
    options.AppName = "Weather Agent";
    options.ApiKey = "ac_p_001_.....";
    options.RecordInputs = true;
    options.RecordOutputs = true;
    options.AdditionalTags = new List<string> { "production", "v2.1.0" };
});

// Tool observability for cost attribution on tool calls
ChatOptions chatOptions = new() { Tools = [..tools] };
chatOptions.AddToolObservability();

The .NET SDK provides the same cost visibility—every chat completion and tool invocation generates a span with token counts, model name and cost calculated server-side by the Collector.

Units and Pricing—Example of Making Observability Usage Predictable

The Progress Observability Platform prices usage in units, which keeps the model straightforward and predictable:

1 telemetry span = 1 unit
1 evaluation = 2 units, since the judge LLM generates internal spans

The free tier includes 20,000 units/month, 1 seat and 7 days of data retention. In the weather agent example above, a single execution produces about 6 spans total (agent + workflow + task + tool + 2 LLM calls), which means each run consumes 6 units. At that rate, the free tier supports roughly 3,333 agent executions per month—enough for development, testing and smaller production workloads.

The key characteristic of this model is that cost is fixed per span, regardless of how much content that span contains. A span carrying a 10,000-token prompt still costs the same 1 unit as a span carrying a 50-token prompt. That removes the perverse incentive common in traditional APM systems, where richer AI traces become disproportionately expensive to observe—exactly the failure mode highlighted by the Azure AI Foundry example.

Shutdown and Telemetry Flushing

Always call shutdown before your process exits so that buffered spans (including cost-critical token count data) are flushed:

try:
    run_agent()
finally:
    Observability.shutdown()

Without this, the last batch of spans may be lost—especially in short-lived processes like serverless functions or CLI tools. Lost spans mean lost cost data, which defeats the purpose of instrumentation.

Closing Thoughts

AI cost surprises are visibility problems. The per-token rates are published and stable. What’s unpredictable is agent behavior—retries, context growth, multi-step reasoning, automated evaluations—and that behavior is invisible to traditional monitoring. Teams need trace-level cost context before the invoice arrives.

The workflow is straightforward: instrument, attribute, baseline, monitor, investigate, optimize, validate. Any platform that gives you token-level cost data per trace, per span and per tag will close the visibility gap. Progress Observability is one way to put that into practice—with explicit opt-in instrumentation, predictable unit-based pricing and trace-level cost visibility from the first call.

If you want to explore this approach, get started free at telerik.com/ai-observability-platform or reach out to discuss how cost visibility fits your team’s workflow.

AI, Observability, Release

About the Author

Nikolay Iliev

Nikolay Iliev is a senior technical support engineer and, as such, is a part of the Fiddler family. He joined the support team in 2016 and has been striving to deliver customer satisfaction ever since. Nick usually rests with a console game or a sci-fi book.

Comments

Comments are disabled in preview mode.

All articles

Topics

Web Mobile Desktop Design Productivity People

Latest Stories
in Your Inbox

Subscribe to be the first to get our expert-written articles and tutorials for developers!

All fields are required

Country/Territory

Blog