Machine learning and generative AI are at the heart of our Gravity product development and roadmap. In this post, we outline three recent AI explorations for Gravity and provide concrete examples of the value of AI for E2E testing.
AI for Gravity – A look back at some of our explorations
- E2E test cases not aligned with actual user behavior and usage patterns
- Inadequate criteria for selecting and prioritizing E2E test cases
- Too much time and effort spent fixing broken E2E test scripts
- Inefficiency in maintaining the relevance of our E2E test suite
Gravity is an AI-based tool by nature: the usage traces collected provide the data on which the AI algorithms/models deliver value to the testers.
A word about this usage data:
- It consists of user interactions with the web application in production, which are completely anonymized. This collector is completely free of Personally Identifiable Information (PII), ensuring total privacy and legal compliance.
- Continuous collection of user traces (or traces representative of usage, it’s unnecessary to collect everything) to monitor changes in usage patterns and flows.
Observing anonymous usage flows allows us to fulfill every tester’s dream: to ensure that our tests effectively cover the real uses of the application. But it also opens up the possibility of AI supporting and automating time-consuming e2e testing activities, including test design, automation, and maintenance tasks.
- Suggesting E2E test cases from usage traces
- Repairing broken E2E test code
- Detecting functional anomalies on usage pattern changes
Suggesting E2E test cases from usage traces
The goal is to ensure that the E2E tests are representative of the actual usage of the application by these different categories of users. To achieve this, we have defined a metric that characterizes this representativeness: the Usage-centric Coverage (UcC) metric.
The UcC metric is based on calculating usage patterns from session data. The goal is to assess the coverage of usage patterns in relation to their frequency of occurrence. A usage pattern is a sequence of user actions that occur with a certain frequency in user traces. For example, for sessions on an eCommerce site, the user sequence “click on an item, select a quantity, add an item to cart” occurs in 13% of user sessions in our dataset. This sequence is a pattern calculated by our usage pattern extraction algorithm.
To suggest the test cases, we conducted experiments using various AI models and algorithms. These included classic machine learning algorithms and deep learning algorithms based on neural networks. One notable example of the latter is the Transformers algorithm, which is utilized to create large language models like GPT-4.
Clustering and selecting candidate sessions guided by the UcC metric
We are currently working on an implementation based on clustering and selecting candidate sessions guided by the UcC metric. The implementation choices are linked to several parameters:
- The quality of the results: we are building a test proposal that guarantees over 90% coverage of the usage patterns by the proposed test cases on our evaluation datasets.
- The execution time of the machine learning algorithm is an important factor. As the suggestion will be invoked during development iterations to update the E2E test cases and ensure that the tests are continuously relevant. Currently, the algorithm refines suggestions in less than a minute on representative datasets.
These results are a big step forward: within Gravity, we can use AI to continuously help the Agile team and testers maintain the relevance of their E2E test cases by analyzing usage data and suggesting tests based on the UcC metric.
In this work, we have learned the following crucial lessons about the success of AI for testing: clean, representative data (provided by the Gravity collector) and a guiding metric to evaluate AI algorithm results (our UcC metric) are essential. Gravity fulfills these conditions for test case suggestions.
Repairing broken E2E test scripts
In this exploration of repairing E2E test scripts, we focused on repairing broken GUI object selectors. Various factors often cause broken selectors in End-to-End (E2E) test scripts. These include selectors dependent on language, time, or description, which can change and cause tests to fail. Selectors that rely on generated or cloned framework IDs can also cause issues, as these IDs can differ for every user. Unspecific selectors that return more than one DOM element can lead to test failures. Of course, changes in the application or the addition of new features or UI changes can lead to outdated test suites and long-term test suite degradation.
Our AI agent
Our agent doesn’t just analyze your test cases; it runs the tests to check and validate the modified code. The process begins with the agent replaying the test steps that still succeed. This step is crucial as it puts the application in the state before the breakage occurs. Once the application is in this state, our AI agent swings into action. It generates candidate fixes and evaluates them by trying to complete the test case. If a fix allows the test case to be replayed entirely, we have a winner! The agent then submits an updated test version, effectively repairing the broken selector.
This method represents a significant shift in approaching E2E test automation. By allowing for dynamic interaction with the application and real-time evaluation of potential fixes, our repair agent finds the solution step by step. It validates this solution by running the test and returns the corrected code.
We focused our tests on repairing a broken selector. The next stage will be to extend the AI agent to data repair, learning from the context of the test runs carried out, and inventing new data when necessary.
Detecting functional anomalies through analysis of usage patterns
Gravity extracts usage patterns from anonymized user action sequences. As we saw earlier, this data is used to generate and maintain relevant tests. But the evolution of these usage patterns, for example, after a release, can reveal newly-introduced defects in your product.
Let’s take an example. During the latest release of your product, a change is made to the GUI. Analysis of the usage patterns shows that in 12% of the anonymized user sessions, a new pattern appeared. This pattern performs a usual task but doubles the user actions. Why is this? You will notice that this 12% of users have one characteristic in common: a small screen, which hides the form validation button. They have found a workaround, but your product has degraded the User Experience.
This detection is automated by data mining and the AI algorithms we have explored:
- calculating usage patterns and their representativeness from anonymized usage data
- clustering of usage sessions over a period (e.g., weekly) to determine usage classes
- comparison of usage metrics (patterns, clusters) by period to highlight aberrations, likely to indicate defects in the application
Pattern Mining and Machine Learning
The AI algorithms used here combine Pattern Mining (a category of Data Mining techniques that apply to sequences of actions), and Machine Learning (in this case, a usage session clustering algorithm). Detection thresholds can be adjusted to pinpoint potential defects. The algorithms carry out these data analysis tasks, but at the end of the day, it is the human who will confirm whether there is a defect to be corrected.
This is our vision at Smartesting: to develop Machine Learning-based services that help test managers, testers, and test automation engineers to move faster, reduce rework, and retain control of the testing process.
This blog article presents three recent explorations by Smartesting’s AI team, aimed at strengthening the Gravity product. If this interests you, and you’d like to improve your end-to-end testing, contact us so you can explore Gravity and our AI features in more detail.
Smoke Testing, Sanity Testing, and Regression testing: the Trifecta Understanding the differences between Smoke, Sanity, and Regression Testing is crucial…
Historically, software testing was confined to pre-production environments to validate the application against the written requirements. (also known as requirement-based…
The surge of Large Language Models (LLM) like GPT has undoubtedly revolutionized the way we approach natural language understanding and…