
The philosophy of Kaizen teaches that true excellence is not achieved through singular algorithmic breakthroughs, but through continuous, disciplined, iterative improvement. In the realm of AI automation, this principle is paramount. Reliability is an engineering outcome, not merely an algorithmic one.
We previously announced reaching the two-sigma milestone: 94% reliability. While a significant achievement, the reality of enterprise operations dictates that a 6% failure rate is unacceptable for mission-critical platforms like ServiceNow. At scale, this level of failure creates excessive exceptions, demands constant human oversight, and rapidly erodes the ROI of automation.
Where we are now:
These metrics were validated with the same externally verified protocol as our previous post: 63%->94%: Silverstream Hits 2σ Reliability for Web Agents Copy.
S@5 | S@1 | |
|---|---|---|
Contractor (untrained) | 33% | 17% |
Contractor (trained)
| 83% | 55% |
Silverstream Agent
| 100% | 86% |
Highly skilled worker
| 100% | 91% |
To compare with human performance, we hired external contractors to compete with our agent. The untrained group was given the same information the agent is given; the trained group was taught the ServiceNow platform features necessary to solve the tasks; finally, the highly skilled workers had graduate-level (or higher) education, platform mastery and relevant work experience.
Given a budget of five non-destructive attempts to complete a task, our agents now outperform trained contractors, who are prone to error, distraction, or fatigue. Agents are highly reliable and perform at a constant, predictable level.
The critical differentiator of the Silverstream managed platform is the ability to close the reliability gap for agents through the continual learning. This "delta" validates our strategic decision to own the entire agentic stack; from orchestration and memory management down to the browser environment; enabling our partners to benefit from reliable automation and realize ROI quickly. As the world puts a premium on higher speed of adoption, this is achieved without a rip-and-replace project or specialized internal hires.
The guiding principle of our engineering process is simple: Reliability is the gate to ROI and trust. We designed these metrics to translate directly into financial levers for your business, making the impact clear to all stakeholders. The Success@K (S@K) framework; a metric already established as a standard in AI Model evaluation, serves as the operational metric that allows us to measure progress across three axes: Efficiency, Reliability, and Safety.
S@1 – Efficiency (86%): How often the agent completes the task correctly on the very first attempt, without exploring or getting lost. Higher S@1 lowers compute costs and reduces time per task.
S@5 – Reliability (100%): Guarantees the outcome within our evaluation protocol. S@5 is the metric that ensures your critical workflows are completed without exceptions, removing the risk of an overnight failure and impacting your SLAs.
Definition: Success is achieved within five attempts without escalation to a human and without destructive actions (e.g., sending the wrong email or irreversibly overwriting a field).
Failure per Million – Safety (0 observed): Measures critical failures: leaking sensitive data, destroying information, or executing unauthorized destructive changes. These failures are the digital equivalent of a reprimand or fireable offense. We target a less than 5-sigma quality band or 233 failures per million trials. Robust escalation and guardrail systems mean that these failures vanishly rare relative to the S@5 volume.
Definition: Success is achieved within five attempts without escalation to a human and without destructive actions (e.g., sending the wrong email or irreversibly overwriting a field).
Failure per Million – Safety (0 observed): Measures critical failures: leaking sensitive data, destroying information, or executing unauthorized destructive changes. These failures are the digital equivalent of a reprimand or fireable offense. We target a less than 5-sigma quality band or 233 failures per million trials. Robust escalation and guardrail systems mean that these failures vanishly rare relative to the S@5 volume.
To achieve 100% S@5, we focused intensely on understanding the 6% failures. Our automated internal evals pipeline aggregates logs across every layer of the stack, from LLM logits to telemetry cloud logs and performs Root Cause Analysis providing detailed interaction replays.
A single attempt generates: OpenTelemetry infrastructure traces, replay videos, Playwright traces, DOM snapshots, prompt–action pairs, a textual root‑cause analysis (what went wrong and why), and a closed ontology classification. We can replay behavior forward and backward in time to reproduce issues.
The analysis confirmed that the agent does not hallucinate incorrect data, corrupt the underlying system, execute unsafe actions, or enter unrecoverable states. The errors were overwhelmingly transient, environmental, and recoverable.
In every case, the Silverstream harness instantly detected the non‑critical error, (e.g., "element failed to load," "Interaction method failed"), and retried or replanned within the S@5 window.
Reaching 100% S@5 marks an important milestone: autonomous agents on ServiceNow have moved from unreliable to reliable operational tools, the gate to ROI is now open for agentic AI.
You can deploy with confidence rather than run prolonged, unreliable pilots, the outcomes are made explicit and measurable by our S@K framework.
The Kaizen process continues. Next, we’re pushing S@1 higher, on public third‑party benchmarks like ServiceNow’s and on the private digital twins we set up with customers.
Want to explore the Pasta-1T dataset and join the Silverstream AI community? Fill out this form to request access. For collaboration inquiries or more information, feel free to contact us.
Tell us about your enterprise use case. We will follow up with a tailored demo.