How One Enterprise Rolled Out AI Testing Across Ten Applications

Published: May 15, 2026

Updated: April 16, 2026

Frameworks and principles are useful, but there is no substitute for seeing how AI adoption actually unfolds in practice. This is the story of how a large organization moved from initial exploration to daily use of AI testing tools across ten applications. The path was not straightforward, and the lessons apply broadly.

Where They Started

The organization, a multinational company with a complex software portfolio, had been watching AI in testing for several years. Leadership faced the question many enterprises face: when to move from observation to action. Competitors were announcing AI initiatives. Vendors were making bold claims. There was pressure to show progress.

The initial instinct was to issue an RFP for an enterprise-wide AI testing solution. The evaluation criteria focused on speed: how quickly could AI generate tests, how much could it accelerate execution, what reduction in manual effort could be expected. These are the metrics vendors tend to emphasize.

Early conversations revealed a problem with this approach. Vendors responded with impressive numbers, but the numbers were hard to verify and even harder to compare across different contexts. Speed improvements depend heavily on the nature of the application, the quality of existing test assets, and the maturity of the testing process. Claims that sounded compelling in presentations did not map clearly to this organization’s specific situation.

Changing the Evaluation

Phil Lew helped them rethink their criteria. The client initially wanted to know how many test cases the tool could generate and how fast. We redirected: speed is not the key metric. Quality of results is.

The evaluation shifted from speed to precision. Instead of asking how fast AI could generate tests, the focus became whether those tests would be useful. Instead of measuring acceleration, the question was whether AI would catch defects that actually mattered.

This reframing changed the vendor conversation significantly. Vendors that had led with speed struggled to provide evidence about precision, false positive rates, and real-world defect detection. Others, particularly those with experience in enterprise deployments, could speak to these concerns with specific data.

“Probably the most important feature of a very successful AI testing tool is the precision or accuracy. If you’ve got 30-40% of the defects that the AI testing tool finds and they’re basically not defects, then that’s a problem.” — Phil Lew

The organization also changed how they thought about the vendor relationship. Rather than treating tool selection as a procurement exercise, they started looking for a partnership. AI testing tools are not mature enough to deploy and forget. They require ongoing tuning, integration work, and collaboration. The vendor’s ability to engage as a partner became a key criterion.

Setting Expectations

Before moving forward, leadership needed to align on what AI could and could not do. The initial vision, shaped by vendor marketing, imagined AI operating autonomously with minimal human involvement.

The more accurate picture was AI as supplement. AI would generate candidate test cases that humans would review and refine. AI would execute tests and flag potential issues that humans would investigate. AI would reduce maintenance through self-healing, but humans would still handle significant application changes.

Lew describes the resulting mindset: the client does not expect AI to totally replace their old automation and throw it away. It supplements what they have and adds coverage they did not have before.

This adjustment was critical. It prevented the disappointment that comes when reality fails to match inflated projections. It also shaped planning for staffing and process changes. Rather than reducing headcount, the plan called for shifting how testing resources spent their time.

Starting Small

Rather than rolling out across all ten applications at once, the organization started with two that were relatively stable and well-understood. This let the team learn how the tool behaved, calibrate expectations, and develop processes for reviewing AI output before scaling up.

Early results were mixed, which is typical. AI generated test cases covering scenarios the existing suite had missed. It also generated cases that were redundant, imprecise, or based on misunderstanding of how the application worked. Testers reviewing the output needed time to develop judgment about which suggestions to accept, modify, or reject.

They discovered that some of their existing test documentation required interpretation that AI could not provide. Converting those test cases to AI-ready formats became an early focus.

Precision was tracked rigorously. Initial false positive rates were higher than hoped but improved as the tool was calibrated to the specific applications and as testers learned to write inputs the AI could interpret more reliably.

Expanding

With lessons from the first two applications, the organization expanded to four more, then eventually to all ten. Each expansion surfaced new challenges. Applications with more complex workflows required more sophisticated prompting. Applications with unusual interfaces tested the limits of self-healing.

The team developed internal expertise. Certain testers proved skilled at translating requirements into AI-friendly formats. Others became good at reviewing AI output and providing feedback that improved performance over time. These people became internal champions who helped colleagues adopt the tool effectively.

Integration required ongoing attention. The organization used specific test management and CI/CD tools. Getting AI output to flow smoothly into these systems without manual data transfer took more effort than initially expected.

Where They Are Now

The organization now uses AI testing daily across all ten applications. The tool has become part of normal workflow rather than a separate initiative. Testers move between manual testing, traditional automation, and AI-based testing depending on what each situation calls for.

Lew describes their operating model: three overlapping circles of manual testing, automation scripts, and AI-based testing. Engineers work across all three. Each iteration, they look at where AI can take over work that was previously manual or scripted.

Coverage has increased. AI finds issues in areas where the team previously lacked bandwidth, like accessibility, performance edge cases, and security surface checks. Testers have been freed from repetitive work to focus on exploratory testing and complex workflow validation.

The vendor relationship evolved into genuine partnership. Weekly calls address issues and feature requests. Problems get fixed quickly. The tool improves based on real usage rather than theoretical requirements.

What They Learned

Several lessons emerged that apply beyond this particular organization.

Evaluation criteria shape outcomes. Focusing on precision rather than speed changed which vendors could compete and what evidence they needed to provide. Organizations evaluating AI tools should define success metrics before engaging vendors, then demand evidence against those specific metrics.

Expectations matter. Framing AI as supplement rather than replacement let them see value that would have been invisible through a replacement lens. Teams expecting autonomous AI see failure. Teams expecting augmented capability see success.

Preparing for AI has value regardless. The work of converting ambiguous test cases to clear documentation improved testing even before AI was involved.

People adapt. Initial skepticism among testers gave way to appreciation as they experienced AI handling tedious work. The shift required patience and transparency about intentions. Lew emphasizes that you have to position AI clearly, explaining what it will be used for, so testers feel confident they are still needed.

Partnership with vendors matters. AI tools are not finished products. They require collaboration to tune, integrate, and improve. Selecting a vendor willing to engage as a partner proved important.

Timelines exceed projections. What was imagined as a six-month rollout took over a year. This is typical for technology adoption of this complexity.

The XBOSoft Perspective

This engagement taught us as much as it taught the client. We refined our approach to helping organizations evaluate AI tools, set expectations, and manage rollouts based on what worked and what did not. The partnership continues, and the organization continues to expand how they use AI in testing. What made the difference was willingness to learn, adjust, and treat adoption as a process rather than an event.

Next Steps

See the full framework This case illustrates principles covered throughout the pillar guide, including evaluation, readiness, and team dynamics.

Explore AI-Informed QA: Going Beyond the Hype

Talk through your situation Every organization’s context is different. A conversation can help clarify how lessons from this case apply to your circumstances.

Contact XBOSoft

Client Success Stories

Learn from our years of experience