Summarize with AI:
Most AI teams don’t realize they’ve overspent until the invoice arrives. The math seems simple—you know the per-token rate, you estimate average usage and you multiply. But production agent behavior introduces compounding factors that make those estimates wildly inaccurate: retries, context window growth, multi-step reasoning chains, automated evaluations and framework-level overhead that’s invisible without proper instrumentation.
The real problem isn’t just cost—it’s the lack of visibility into what caused the cost.
When evaluating an LLM provider, the cost model looks deterministic:
Monthly cost = (avg_input_tokens + avg_output_tokens) × per_token_rate × monthly_invocations
For a simple weather agent: ~500 input tokens + ~200 output tokens per query, at GPT-4.1-mini rates ($0.40/1M input, $1.60/1M output), running 10,000 queries/month = roughly $3.60/month. Budget approved.
In production, that same agent exhibits behavior the spreadsheet never captured:
The result: that $3.60/month estimate becomes $25-40/month in practice. At enterprise scale with multiple agents, the gap between estimated and actual spend can reach thousands of dollars monthly.
This problem is about to get significantly worse. The AI tooling industry is rapidly shifting from flat-rate to usage-based pricing, making every token a direct billing event.
GitHub recently announced that Copilot is moving to usage-based billing starting June 2026. The reasoning is explicit: “a quick chat question and a multi-hour autonomous coding session can cost the user the same amount” under flat pricing - so they’re replacing premium request counts with “GitHub AI Credits” consumed based on token usage (input, output and cached tokens) at published API rates per model.
This isn’t an isolated decision. It reflects a structural reality: as AI tools become agentic - running multi-step sessions, invoking tools, iterating across codebases - the cost variance between “light” and “heavy” usage becomes too large for flat pricing to absorb. Providers are pushing that variance downstream to users.
The implication for engineering teams is clear: in a token-based billing world, cost optimization requires the same granular visibility as performance optimization. Every retry, long context window, tool call, model switch and evaluation becomes a measurable billing event. You can’t optimize what you can’t measure and you can’t measure token economics with infrastructure metrics.
Before choosing any tool or platform, it helps to define what metrics and dimensions actually matter for AI cost control. These are the building blocks of cost visibility regardless of implementation:
Teams need observability to manage AI cost but the observability layer itself needs to be transparent, predictable and intentional.
A recent Reddit post from a developer using Azure AI Foundry illustrates the issue:
“I am noticing very high, unexpected charges coming from ‘Observability’. I do not need these logs, metrics or trace data right now and my main goal is to stop these charges completely.”
The root cause: Azure AI Foundry enables playground evaluations by default (which consume LLM tokens and are billed) and automatically configures Application Insights tracing for hosted agents. The developer was being charged for observability features they never consciously enabled.
A Microsoft PM confirmed the fix: navigate to the agents playground, select metrics in the upper right and unselect all evaluators. The developer had to disable all observability to stop the charges—trading cost visibility for cost control.
This creates a fundamental tension: you need observability to control AI costs but if observability itself is an uncontrolled cost with hidden defaults, it becomes part of the problem. The solution requires an observability platform where:
Understanding what each metric tells you - and what decisions it enables - is the difference between collecting data and controlling cost:
Before introducing any specific tool, here is the general process teams should follow to move from reactive invoice surprises to proactive cost control:
This loop takes your cost discovery time from “30 days (when the invoice arrives)” to “same day (when the trace appears).”
Here is what this framework looks like when implemented using Progress Observability. The sections below demonstrate each capability as a concrete example of the general principles.
A real Python agent instrumented with the Progress Observability SDK, using LangChain with OpenAI to answer weather questions:
importos
fromdotenv importload_dotenv
fromlangchain_community.utilities importOpenWeatherMapAPIWrapper
fromlangchain_community.tools importOpenWeatherMapQueryRun
fromlangchain_openai importChatOpenAI
fromlangchain.agents importcreate_agent
fromprogress.observability importObservability, ObservabilityInstruments
fromprogress.observability importagent, workflow, task, tool
load_dotenv()
os.environ.pop("SSL_CERT_FILE", None)
# Initialize observability - explicitly opt-in, called before LLM usage
Observability.instrument(
app_name=os.getenv("OBSERVABILITY_APP_NAME"),
api_key=os.getenv("OBSERVABILITY_API_KEY"),
trace_content=True,
instruments={
ObservabilityInstruments.OPENAI,
ObservabilityInstruments.LANGCHAIN
},
additional_tags=["production", "release:2.5.1"]
)
model =ChatOpenAI(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-5.4-mini"
)
# Setup Tools
weather_api =OpenWeatherMapAPIWrapper(
openweathermap_api_key=os.getenv("OPENWEATHERMAP_API_KEY")
)
weather_tool =OpenWeatherMapQueryRun(api_wrapper=weather_api)
tools =[weather_tool]
# Create Agent
lang_agent =create_agent(model, tools=tools)
Auto-instrumentation captures LLM calls and framework operations but your own business logic - data pipelines, custom tools orchestration—needs explicit decoration to appear in traces with cost attribution:
@tool(name="weather-lookup")
def fetch_weather_data(city: str) -> str:
"""Fetch raw weather data from OpenWeatherMap API."""
return weather_api.run(city)
@task(name="normalize-weather", attributes={"team": "ml"}, tags=["experiment-a"])
def normalize_weather_data(raw: str) -> dict:
"""Transform raw weather string into a structured dict."""
return {"raw_report": raw, "source": "openweathermap", "format": "normalized"}
@workflow(name="data-pipeline", version=2)
def retrieve_weather_context(query: str) -> dict:
"""Retrieve and normalize weather data for a given query."""
raw = fetch_weather_data(query)
return normalize_weather_data(raw)
@agent(name="weather-agent")
def handle_weather_request(query: str) -> str:
"""Top-level agent handler that orchestrates the full weather request."""
context = retrieve_weather_context("London")
result = lang_agent.invoke({
"messages": [{"role": "user", "content": query}]
})
return result["messages"][-1].content
# Run with proper shutdown to flush telemetry
try:
response = handle_weather_request("What is the weather in London?")
print(response)
finally:
Observability.shutdown()
Each decorator (@agent, @workflow, @task, @tool) creates a span with a specific kind. The auto-instrumentation adds spans for every LLM call with token counts and model information. Together, the trace tree for this agent looks like:
â–¼ agent: weather-agent
â–¼ workflow: data-pipeline (v2)
â–¼ tool: weather-lookup
â–¼ task: normalize-weather [tags: experiment-a]
â–¼ llm_call: ChatOpenAI.chat (model: gpt-5.4-mini, 364 tokens, $0.0004)
â–¼ tool: OpenWeatherMapQueryRun
â–¼ llm_call: ChatOpenAI.chat (model: gpt-5.4-mini, 128 tokens, $0.0002)
Every span records: duration, token count (for LLM calls), model name and estimated cost. This is the granularity needed to answer “why did costs spike?” - you can see exactly which step consumed tokens and whether it was expected.
The Cost Analytics Dashboard aggregates token usage and costs across your organization, calculated server-side by the Collector (which maps model names to current per-token pricing - meaning costs stay accurate even as providers change rates).
The dashboard surfaces two critical dimensions:
The Cost Analytics dashboard breaks down spending by model and by service, with totals for cost and token usage over any selected time range.
This is the view you check weekly (or daily during rollouts) to catch cost regressions before they hit the invoice.
When the Cost Analytics dashboard shows a spike, the Observations page lets you drill into individual traces to find the root cause.
The Observations page lists all traces with span kind, model, tokens, cost and status. Click any trace to expand the full span tree.
Each trace shows the full span tree with token counts and costs per span. You can filter by: - Time range (isolate the spike window)
- Application name
- Tags (e.g., customer:acme, env:production)
- Success/failure status
For cost investigation, the typical workflow is:
1. Dashboard shows Tuesday had 40% higher cost than Monday
The LLM Requests view provides a live stream of individual LLM calls. Each entry shows model, provider, token count and cost. This is your real-time cost monitor during deployments - if a new prompt template increases average tokens per call from 400 to 1,200, you’ll see it immediately, not on next month’s invoice.
Tags enable multi-dimensional cost analysis. The SDK supports three levels:
# 1. Global tags - applied to ALL spans
Observability.instrument(
app_name="weather-agent",
api_key="...",
additional_tags=["production", "release:2.5.1"]
)
# 2. Scoped tags - applied within a context block
from progress.observability import propagate_attributes
with propagate_attributes(tags=["customer:acme-corp", "request:req-abc-123"]):
result = handle_weather_request(query)
# All spans (including LLM calls) inside this block get these tags
# 3. Decorator tags - applied to a single function's span
@task(tags=["cohort-a", "experiment:new-prompt"])
def my_function():
...
With customer-level tags, you can answer: “Which customers trigger the most expensive agent paths?” With experiment tags: “Did the new prompt template reduce or increase token consumption?” With release tags: “Did v2.5.1 introduce a cost regression vs v2.5.0?”
All three levels merge and deduplicate automatically. Each tag is limited to 200 characters.
Every SDK parameter can be overridden via environment variables, enabling the same code to run across dev/staging/prod with different cost configurations:
export OBSERVABILITY_APP_NAME="weather-agent"
export OBSERVABILITY_API_KEY="ac_p_001_..."
export OBSERVABILITY_ENDPOINT="https://collector.observability.progress.com:443"
export OBSERVABILITY_TRACE_CONTENT="true"
This means you can:
- Use trace_content=True in development (full prompt/completion capture for debugging)
- Use trace_content=False in production (reduces telemetry size, still captures token counts and costs)
- Tag environments differently for cost comparison
For teams using .NET with IChatClient from Microsoft.Extensions.AI:
using Microsoft.Extensions.AI;
using Progress.Observability.Extensions.AI;
IChatClient chatClient = new OpenAI.Chat.ChatClient("gpt-4.1-mini", openAIApiKey)
.AsIChatClient();
chatClient = chatClient.AddObservability(options => {
options.AppName = "Weather Agent";
options.ApiKey = "ac_p_001_.....";
options.RecordInputs = true;
options.RecordOutputs = true;
options.AdditionalTags = new List<string> { "production", "v2.1.0" };
});
// Tool observability for cost attribution on tool calls
ChatOptions chatOptions = new() { Tools = [..tools] };
chatOptions.AddToolObservability();
The .NET SDK provides the same cost visibility—every chat completion and tool invocation generates a span with token counts, model name and cost calculated server-side by the Collector.
The Progress Observability Platform prices usage in units, which keeps the model straightforward and predictable: - 1 telemetry span = 1 unit - 1 evaluation = 2 units, since the judge LLM generates internal spans
The free tier includes 20,000 units/month, 1 seat and 7 days of data retention. In the weather agent example above, a single execution produces about 6 spans total (agent + workflow + task + tool + 2 LLM calls), which means each run consumes 6 units. At that rate, the free tier supports roughly 3,333 agent executions per month - enough for development, testing and smaller production workloads.
The key characteristic of this model is that cost is fixed per span, regardless of how much content that span contains. A span carrying a 10,000-token prompt still costs the same 1 unit as a span carrying a 50-token prompt. That removes the perverse incentive common in traditional APM systems, where richer AI traces become disproportionately expensive to observe - exactly the failure mode highlighted by the Azure AI Foundry example.
Always call shutdown before your process exits so that buffered spans (including cost-critical token count data) are flushed:
try:
run_agent()
finally:
Observability.shutdown()
Without this, the last batch of spans may be lost - especially in short-lived processes like serverless functions or CLI tools. Lost spans mean lost cost data, which defeats the purpose of instrumentation.
AI cost surprises are visibility problems. The per-token rates are published and stable. What’s unpredictable is agent behavior—retries, context growth, multi-step reasoning, automated evaluations - and that behavior is invisible to traditional monitoring. Teams need trace-level cost context before the invoice arrives.
The workflow is straightforward: instrument, attribute, baseline, monitor, investigate, optimize, validate. Any platform that gives you token-level cost data per trace, per span and per tag will close the visibility gap. Progress Observability is one way to put that into practice - with explicit opt-in instrumentation, predictable unit-based pricing and trace-level cost visibility from the first call.
If you want to explore this approach, get started free at telerik.com/ai-observability-platform or reach out to discuss how cost visibility fits your team’s workflow.
Nikolay Iliev is a senior technical support engineer and, as such, is a part of the Fiddler family. He joined the support team in 2016 and has been striving to deliver customer satisfaction ever since. Nick usually rests with a console game or a sci-fi book.