Kaizen: From 94% to Superhuman Performance on the ServiceNow Platform

Through continuous improvement, our AI agents now achieve 86% first-try accuracy (S@1) and 100% success within five attempts (S@5) on the most common ServiceNow tasks - with zero catastrophic errors observed. This means enterprise teams can trust tasks will be completed reliably without human intervention.

In the same protocol, trained contractors achieved 55% S@1 and 83% S@5 showing that our continual-learning approach produces AI Agents that reliably exceed trained humans on enterprise back-office work, while operating at a consistent, fatigue-free level.

December 1, 2025

The Leap to Reliable Autonomous Agents

The philosophy of Kaizen teaches that true excellence is not achieved through singular algorithmic breakthroughs, but through continuous, disciplined, iterative improvement. In the realm of AI automation, this principle is paramount. Reliability is an engineering outcome, not merely an algorithmic one.

We previously announced reaching the two-sigma milestone: 94% reliability. While a significant achievement, the reality of enterprise operations dictates that a 6% failure rate is unacceptable for mission-critical platforms like ServiceNow. At scale, this level of failure creates excessive exceptions, demands constant human oversight, and rapidly erodes the ROI of automation.

Where we are now:

  • S@5: 100% of the time the agent succeeds within five non-destructive attempts.
  • S@1: 86% the agent completes the task quickly the vast majority of the time.

These metrics were validated with the same externally verified protocol as our previous post: 63%->94%: Silverstream Hits 2σ Reliability for Web Agents Copy.

Human vs Agent performance

S@5
S@1
Contractor (untrained)  

33%

17%

Contractor (trained)

83%

55%

Silverstream Agent

100%

86%

Highly skilled worker

100%

91%

To compare with human performance, we hired external contractors to compete with our agent. The untrained group was given the same information the agent is given; the trained group was taught the ServiceNow platform features necessary to solve the tasks; finally, the highly skilled workers had graduate-level (or higher) education, platform mastery and relevant work experience.

Given a budget of five non-destructive attempts to complete a task, our agents now outperform trained contractors, who are prone to error, distraction, or fatigue. Agents are highly reliable and perform at a constant, predictable level.

The critical differentiator of the Silverstream managed platform is the ability to close the reliability gap for agents through the continual learning. This "delta" validates our strategic decision to own the entire agentic stack; from orchestration and memory management down to the browser environment; enabling our partners to benefit from reliable automation and realize ROI quickly. As the world puts a premium on higher speed of adoption, this is achieved without a rip-and-replace project or specialized internal hires.

Success@K as the Metrics for Agentic evals

The guiding principle of our engineering process is simple: Reliability is the gate to ROI and trust. We designed these metrics to translate directly into financial levers for your business, making the impact clear to all stakeholders. The Success@K (S@K) framework; a metric already established as a standard in AI Model evaluation, serves as the operational metric that allows us to measure progress across three axes: Efficiency, Reliability, and Safety.

S@1 – Efficiency (86%): How often the agent completes the task correctly on the very first attempt, without exploring or getting lost. Higher S@1 lowers compute costs and reduces time per task.

S@5 – Reliability (100%): Guarantees the outcome within our evaluation protocol. S@5 is the metric that ensures your critical workflows are completed without exceptions, removing the risk of an overnight failure and impacting your SLAs.

Definition: Success is achieved within five attempts without escalation to a human and without destructive actions (e.g., sending the wrong email or irreversibly overwriting a field).

Failure per Million – Safety (0 observed): Measures critical failures: leaking sensitive data, destroying information, or executing unauthorized destructive changes. These failures are the digital equivalent of a reprimand or fireable offense. We target a less than 5-sigma quality band or 233 failures per million trials. Robust escalation and guardrail systems mean that these failures vanishly rare relative to the S@5 volume.

Definition: Success is achieved within five attempts without escalation to a human and without destructive actions (e.g., sending the wrong email or irreversibly overwriting a field).

Failure per Million – Safety (0 observed): Measures critical failures: leaking sensitive data, destroying information, or executing unauthorized destructive changes. These failures are the digital equivalent of a reprimand or fireable offense. We target a less than 5-sigma quality band or 233 failures per million trials. Robust escalation and guardrail systems mean that these failures vanishly rare relative to the S@5 volume.

The "Last Mile" Challenge (94% to 100%)

Closing the final 6% to reach 100% S@5 is exponentially harder than getting to 94%.

In complex, customized environments like ServiceNow, that means dynamic forms, intricate dependencies, platform instability, customized UI configurations, rare UI glitches, and mitigating hallucinations and overfitting of the underlying models.

This gap cannot be closed simply by waiting for better foundation models. It requires owning the full stack and a commitment to engineering discipline. Agent adoption is a competitive advantage for our customers; as with hiring the best person for the job, improvements in agentic performance yields non-linear ROI, and we provide these results ahead of the curve.

Owning the full agentic stack and seamless integration

Controlling the agentic execution stack is crucial for reliability, but no single foundational model is optimal for every scenario. Our architecture is designed to be modular and adaptable. We strategically select and adapt the best foundation models for our customers, optimizing based on individual strengths identified through ad-hoc benchmarking, specific privacy and in-house deployment requirements, performance needs, and cost considerations.

This sophistication does not complicate integration. Our agents operate directly on the same UI your employees use, minimizing integration friction and infrastructure changes. You get the benefits of automation without a costly rip‑and‑replace project.

Inefficiency Analysis: Deconstructing behavior

To achieve 100% S@5, we focused intensely on understanding the 6% failures. Our automated internal evals pipeline aggregates logs across every layer of the stack, from LLM logits to telemetry cloud logs and performs Root Cause Analysis providing detailed interaction replays.

A single attempt generates: OpenTelemetry infrastructure traces, replay videos, Playwright traces, DOM snapshots, prompt–action pairs, a textual root‑cause analysis (what went wrong and why), and a closed ontology classification. We can replay behavior forward and backward in time to reproduce issues.

The Key Finding: No Critical Errors

The analysis confirmed that the agent does not hallucinate incorrect data, corrupt the underlying system, execute unsafe actions, or enter unrecoverable states. The errors were overwhelmingly transient, environmental, and recoverable.

Examples of recovered failures:
  1. FORM (7 cases): Agent forgets to fill optional elements on long, having to go back after the order review.
  2. SORT (6): A top‑performing agent over‑fit to an older UI version; it eventually switched to specialized actions.
  3. KNOWLEDGE (1): The agent did not count an element outside the browser viewport.
  4. FILTER (2): The agent got temporarily confused with dropdowns and the ServiceNow platform went offline.
  5. SERVICE_CATALOG (2): The agent tried to select the desired value with arrow keys instead of the more efficient direct clicking.

In every case, the Silverstream harness instantly detected the non‑critical error, (e.g., "element failed to load," "Interaction method failed"), and retried or replanned within the S@5 window.

An agent triaging customer support requests autonomously

Conclusion: Reliable automation in enterprise settings.

Reaching 100% S@5 marks an important milestone: autonomous agents on ServiceNow have moved from unreliable to reliable operational tools, the gate to ROI is now open for agentic AI.

You can deploy with confidence rather than run prolonged, unreliable pilots, the outcomes are made explicit and measurable by our S@K framework.

The Kaizen process continues. Next, we’re pushing S@1 higher, on public third‑party benchmarks like ServiceNow’s and on the private digital twins we set up with customers.

Access the Database

Want to explore the Pasta-1T dataset and join the Silverstream AI community? Fill out this form to request access. For collaboration inquiries or more information, feel free to contact us.

Unlock reliable automation: Book a 30‑Min Scoping Call with the founders!

Tell us about your enterprise use case. We will follow up with a tailored demo.

Made with ❤️ from all around the world

149 Montgomery Street, San Francisco, CA