Published: April 16, 2026
Updated: April 16, 2026
There is real pressure to move quickly on AI. Boards want to see AI initiatives. Competitors are announcing them. The temptation is to pick a tool, roll it out broadly, and demonstrate progress. That approach usually produces poor results. A better path is to start with a well-designed pilot that generates actual learning about how AI works in your specific context. The learning from a good pilot makes everything that comes after more likely to succeed.
A pilot serves a different purpose than a full implementation. The goal is not to prove that AI works, which vendor demos have already suggested. The goal is to learn how AI works with your applications, your data, your processes, and your team. What adjustments does it need? Where does it deliver value? Where does it fall short?
These questions cannot be answered in the abstract. You need direct experience with the technology under conditions that resemble how you would actually use it. Vendor sandboxes and demo environments do not provide this. They are set up to showcase capabilities under ideal conditions, not to reveal how the tool performs when things are messy.
A pilot also gives you the opportunity to fail cheaply. According to S&P Global research, 42 percent of companies abandoned most of their AI initiatives in 2025, up from 17 percent the year before. If your tool does not work as expected, or integration is harder than anticipated, or your team struggles to incorporate AI into their workflow, it is much better to discover that in a contained pilot than in an enterprise-wide rollout.
Phil Lew’s advice on scope is straightforward: start small and simple. Pick your simplest application or website. Start with your simplest set of test cases.
A practical starting point is a single application with somewhere between twenty and fifty test cases. That is enough to see patterns in how the AI behaves without getting overwhelmed. The application should be one your team knows well, so you can evaluate AI output against known expectations. It should also be reasonably stable, so that application changes do not confuse your assessment of the tool.
Resist the temptation to pilot against your most complex or troubled application. The goal right now is to learn how the tool works, not to solve your hardest problems immediately. Complex applications introduce too many variables. You will not be able to tell whether issues are coming from the tool, the application, or the interaction between them.
Lew describes a current client engagement where they started with about 24 test cases. The purpose was to learn how to take existing test cases, which were written in ways that AI could not interpret well, and convert them into formats that would work. That foundational learning set them up for success when they expanded scope later.
“The purpose of doing this is learning how to take their existing test cases, which are not written in a way that can be interpreted by AI, and being able to transpose these into test cases that can be.” — Phil Lew
Set a defined timeline. Open-ended pilots drift and lose focus. Four to eight weeks is usually enough to get through setup, run the tool, review results, and draw conclusions.
Before you start, establish what success looks like. This sounds obvious, but many pilots proceed without clear criteria, which makes it impossible to know whether they succeeded or failed.
Success criteria should be specific and measurable. “Evaluate whether AI improves our testing” is too vague to be useful. Better criteria might include: test case generation time reduced by a specific percentage, false positive rate below a defined threshold, successful integration with your test management system, or positive feedback from the testers using the tool.
Lew emphasizes three metrics in particular for measuring AI results. First, coverage increase: what percentage of your testing expanded because of AI? Second, AI versus human ratio: what share of work is now being done by AI compared to humans or traditional automation? Third, precision: the ratio of true positives to total positives flagged.
He also suggests tracking a trust metric: what percentage of defects found by AI do you manually re-verify because you do not trust the results? If you are duplicating everything the AI does, you are doubling your work, which defeats the purpose.
Include criteria that address your concerns, not just your hopes. If false positives worry you, define an acceptable rate and measure against it. If you are concerned about integration complexity, define what successful integration looks like.
The first few weeks will be about setup and calibration. Getting the tool connected to your systems, configured for your environment, and tuned to produce useful output takes longer than vendor estimates usually suggest. Build in extra time for troubleshooting.
Once the tool is running, focus on learning rather than proving. Run it against your target application and review the results carefully. For test case generation, look at whether the suggestions are valid, whether they cover meaningful scenarios, and whether they would actually catch defects. For execution tools, examine what gets flagged and whether those flags turn out to be real issues.
Lew warns against being impressed by volume. A tool that generates a large number of test cases is not necessarily useful. What matters is quality and coverage. A hundred login tests with nothing covering your reporting functionality is poor coverage, regardless of how quickly those tests were generated.
Track metrics that align with your success criteria. If you are measuring false positive rate, log each flagged issue and whether it turned out to be a genuine problem. If you are measuring time savings, track how long tasks take with and without AI assistance.
Collect qualitative feedback from your team as well. Are testers finding the tool useful? Where are they running into friction? What works well, and what feels awkward? This kind of feedback often surfaces issues that metrics alone would miss.
Do not declare success too early. Initial results sometimes look better than they are because the tool is handling the easy cases first, or because novelty is generating enthusiasm. Give the pilot enough time for real patterns to emerge.
Expecting replacement instead of supplementation. If you frame the pilot as AI replacing your existing testing, you will be disappointed when AI output requires review, when false positives take time to investigate, and when the tool misses things human testers would catch. The right framing is AI supplementing human work, handling certain tasks faster so people can focus on judgment-intensive activities.
Piloting against production data without thinking about privacy. AI tools may send data to external services for processing. If that data includes personally identifiable information or other sensitive content, you may be creating compliance exposure. Understand where your data goes before you start.
Not involving testers in the pilot design. Pilots imposed from above without input from the people who will use the tool often run into resistance. Testers may feel threatened or have legitimate concerns. Involving them early improves buy-in and produces better feedback.
At the end of the pilot, you should have data and feedback that address your success criteria.
If the pilot met its criteria, that is a good signal, but not a guarantee that broader rollout will go smoothly. Pilots operate under favorable conditions: focused attention, motivated participants, limited scope. Expand with the awareness that you were seeing a best-case scenario.
If the pilot fell short, resist the temptation to explain it away. Sometimes the tool genuinely does not fit your context, or your organization is not ready. Honest interpretation of disappointing results is more valuable than spin.
Mixed results are common. The tool may work well for some things and poorly for others. That is useful information. Maybe you adopt it for specific use cases while continuing to evaluate alternatives for others.
We have run enough AI testing pilots to recognize what works and what does not. The organizations that get the most value approach pilots as genuine learning exercises rather than demonstrations of predetermined conclusions. They define success criteria honestly, involve the right people, and interpret results without bias toward the answer they were hoping for. What we bring to pilot design is experience with the questions that actually matter and the mistakes that tend to derail things.
See pilots in context A pilot is one step in a larger adoption process. The pillar guide covers what comes before and after.
Explore AI-Informed QA: Going Beyond the Hype
Get help with pilot design The choices you make in designing a pilot significantly affect what you learn. A conversation can help structure something that produces actionable conclusions.
Looking for more insights on Agile, DevOps, and quality practices? Explore our latest articles for practical tips, proven strategies, and real-world lessons from QA teams around the world.