Research

Notes on studying agents in production.

Agent Etna is also a research effort. The agents we work on are not in laboratories. They are in production, doing real jobs for real users, failing in ways no benchmark anticipated. We think the path from today's narrow agents to broader, more reliable ones runs through the unglamorous work of studying those failures — and the people doing the studying should be the same people building the fixes. This page is a partial account of how we read what we see.

1. Production is the laboratory.

There is a long tradition in computer science of arguing that systems should be evaluated under the conditions they will be deployed under. For AI agents this has not yet happened at any meaningful scale. Most evaluation is done on curated benchmarks — synthetic conversations, hand-written cases, problems with known answers. The agents that actually ship are judged, if at all, by aggregated thumbs-ups.

The gap between those two regimes is the central methodological problem in agent reliability. Closing it is the work. Every fix is a controlled experiment; every rollback a refuted hypothesis; every adversarial probe an attempt to falsify a claim about what the agent will do. The customers' production agents — with their permission — are the only ground truth a leaderboard cannot reach.

2. The agents already shipping are the only honest data.

Synthetic suites tell you what a model can do. Production tells you what it does. The distinction matters because the failure modes that get a system pulled from production are almost never the failure modes a benchmark was designed to catch. They are quieter: a slow drift in refusal behaviour after a prompt change, a regression in formatting that only matters because the output is downstream of another agent, an edge case in tool use that emerges when one user phrases a request in one particular way.

We instrument the unglamorous parts — latency at the percentile, refusal rates, drift across model versions, regressions after a prompt change — and treat them as primary research material rather than ops data. Findings shape the rubric. The rubric shapes the next fix.

3. Failure as a first-class object.

A failed reply is not a bug to be silenced. It is the most informative thing the system produces that day. Hiding it in a log is throwing away the signal that produced it.

Failures are scored on five dimensions, clustered, and surfaced rather than aggregated away. Patterns become regression tests; edge cases become rubric entries. The agent's weaknesses, once visible, are what let the next iteration be measurably better than the last. The opposite practice — averaging the failures into a single composite score — is dishonest by construction. It tells you the system is fine on average while burying the few interactions that decide whether anyone trusts it.

4. A fix is a unit of research.

A single fix is just a fix. A system of fixes, scored and signed, is a research programme. Every accepted patch should leave behind more than a working agent — it should leave a sharper definition of what "working" means in that context.

Over time the rubric becomes a more honest specification of the system than the system itself. The fixes feed the rubric, the rubric feeds the simulator, the simulator feeds the next fix. Self-improvement and the research effort are the same machine, viewed from two angles.

5. On where this might lead.

We are not building AGI, and we do not believe anyone shipping a developer tool in 2026 is either. The honest claim is more modest: the path from today's narrow agents to broader, more reliable ones runs through the unglamorous work of fixing the agents that already exist. We would rather be part of that work than a spectator to it.

Every capability the product gains — sandboxed verification, signed provenance, adversarial probes, transparent scoring — is also a research instrument. The same tooling that makes a customer's agent safer to ship is the tooling that lets us measure whether the field is, in fact, getting better.

6. On publishing what we find.

Findings that stay locked inside one company's dashboards are not research; they are trivia. The point of looking closely at how agents fail is to make the next agent — anyone's next agent — fail less.

Aggregate findings, anonymised and customer-consented, flow into public write-ups, open rubric components, and the changelog. The product gets sharper for the customer who paid for it. The field gets sharper because that customer paid for it.

7. Self-improvement is a slope.

Most platforms hand you the same agent at the end of the year you started with. Every accepted fix, every rolled-back change, every thumbs-up is a teaching signal — the patterns that worked become the starting point for the next iteration, the ones that failed become things the agent learns to avoid. None of it is wasted.

The teams that get the most from this are not the ones that set everything up perfectly on day one. They are the ones that use Agent Etna for a year. Every week of real use is another week of quality the agent did not have before. Improvement, we have come to think, is not an event. It is a slope, and the only thing that matters is its sign.

Be a data point, not a test subject.

Connect a repository, run a fix, and your agent's signals — with your permission — help shape what comes next, for you and for everyone else trying to ship something that works.

Get started