Mastering AI Prompting for Effective Software Testing

Bruno Legeard

Published on 30.07.24

Increasing Productivity and Speed with Generative AI
LLM Models: Rapid Progress Useful for Testing Activities
Learning to Prompt for Test Activities
Here are the results obtained with the LLM Claude-3.5-Sonnet (dated July 2024):
Analysis of the Results Obtained by the AI
Meta-Prompting: Improving the Prompt by the LLM
Improved Version of the prompt
Conclusion

Increasing Productivity and Speed with Generative AI

Mastering AI Prompting has become indispensable! The expectations for generative AI in software testing primarily focus on gains in productivity and speed. The goal is to perform testing tasks faster and with less effort thanks to AI. These expectations are well illustrated by the annual “World Quality Report 2023,” from which the following figure is extracted: improvements in productivity and greater speed are highlighted by 65% and 53% of the survey respondents, respectively.

Source : World Quality Report 2023-24 – Page 29 https://www.opentext.com/resources/world-quality-report-2023-24

These results also align with the expectations expressed by participants in the training sessions “Accelerate Your Testing Processes with Generative AI” offered by Smartesting and its partners. Through practice, participants acquire the essential skills to productively use generative AI, master AI Prompting, and achieve the expected gains in productivity and speed.

To achieve these productivity and speed gains using generative AI, it is essential to learn how to prompt with AI models. This is what we will present here, but acquiring the required skills requires practicing and evaluating one’s practice, as we do during the training.

LLM Models: Rapid Progress Useful for Testing Activities

Generative AI can be used directly through an AI Chatbot via conversational prompts with a Large Language Model (LLM) or for predefined tasks through a testing tool that uses an LLM.

This article focuses on the first mode, i.e., querying an LLM for testing tasks, as this mode is directly accessible via available AI Chatbots and offers great flexibility for testers.

Since the release of ChatGPT by OpenAI in November 2022, the range of LLMs has significantly expanded, and their capabilities are rapidly progressing.

These advancements provide usage opportunities for software testing. Here are a few examples:

The vision module, present in the main current LLMs such as GPT-4, Claude-3, or Gemini-1.5, enables image analysis, for example, screenshots of the application to be tested. We can thus associate textual data (user stories, acceptance criteria, bug reports, etc.) with screenshots (of the application, error messages, etc.) for more accurate data provided to the LLM.
The optimization of LLM models has improved, allowing faster and less costly processing of our queries with equal performance. Claude-3.5-Sonnet or GPT-4-Mini are examples of recent releases of LLMs that are both efficient, less resource-intensive, and faster than previous versions in terms of response time.
The increased availability of LLM models with open-source or community licenses such as Mixtral-8x22B or LLama-3 facilitates both the integration of generative AI on limited organizational computing resources and avoids costly subscription models.

With rapid performance improvements, new features, and enhanced accessibility, it is now up to us to exploit the capabilities of LLMs by mastering the best practices of querying – also known as “Prompt Engineering” or “Prompting.”

Learning to Prompt for Test Activities

Prompting techniques aim to design high-quality queries that guide LLM models to produce accurate and relevant outputs for the task at hand. This involves working on the structure, style, and content of the queries, and optimizing their length based on the test task to be performed by the LLM. Your results will depend on the LLM model used and your prompts.

Here are the techniques we review in this article, illustrated with test generation tasks:

Structured Prompt in 6 Parts: Role, Context, Instructions, Constraints, Format, Data;
Breaking Down Instructions into Steps to assist the model;
Asking the Model to Verify Its Results;
Meta-Prompting: Asking the model to evaluate, improve, or generate your prompt.

These techniques are simple to use but essential for achieving quality outcomes.

AI For testing - AI Prompting by Smartesting

Implementation for Generating Test Cases for a User Story and Acceptance Criteria

Our test subject is the demo application of the open-source Spree Commerce platform, which provides a front-end/back-end structure for developing an online store. The User Story that serves as the basis for our test case generation is the management of the shopping cart, as shown in the screenshot below.

Shopping Cart Screenshot – obtained with the Spree Commerce demo application – https://spree-multi-vendor-demo.herokuapp.com/ – July 2024

Here is the User Story “Managing Cart Items” that we will be testing, along with its acceptance criteria:

As a customer of the e-commerce site,

I want to be able to change the quantity of items in my cart,

So that I can see the prices and the total of the cart update accordingly.

Acceptance Criteria:

The quantity of an item should be modifiable using the “-” and “+” buttons.
The update of an item’s quantity should be immediate.
The total per item should be updated immediately after changing the quantity (unit price * quantity).
The cart subtotal (sum of item totals) should be updated immediately after any change in the quantity of one or more items.
Each item line should have a trash icon to allow the item to be removed from the cart.
Removing an item from the cart should update the cart subtotal immediately.

Here is the structured prompt we are using:

#Role and Context

You are my assistant for writing test scenarios in Gherkin format (Given / When / Then) based on a given User Story and acceptance criteria. A screenshot corresponding to the tested page is also provided.

#Instructions, Format, and Constraints

By analyzing these elements, perform the following tasks:

– Determine Exhaustively the Equivalence Classes to Test: Provide some examples of typical and boundary values for valid and invalid classes. Explain your reasoning.

– Create Test Scenarios in Gherkin Format: Save them in a <feature_name>.feature file. Cover all the given acceptance criteria and use the screenshot to determine the test data.

#Self-Verification of Results

Ensure that all acceptance criteria are fully covered and complete them if they are not.

#Data

Download the screenshot.

Here are the results obtained with the LLM Claude-3.5-Sonnet (dated July 2024):

For this User Story “Managing Cart Items,” the analysis of equivalence classes and the corresponding test scenarios are as follows:

Equivalence Classes

1. Quantity Modification

Valid classes:
- Positive integers (1, 2, 3, …)
- Typical values: 1, 5, 10
- Boundary values: 1 (minimum), maximum allowed (e.g., 99 or 999, depending on system limits)
Invalid classes:
- Zero (0)
- Negative integers (-1, -2, …)
- Non-integer values (1.5, 2.3, …)
- Extremely large values (e.g., 1000000, depending on system limits)
Reasoning: The quantity should be a positive integer. Zero or negative quantities don’t make sense for a shopping cart, and non-integer values are not typically allowed for item quantities.

2. Price

Valid classes:
- Positive numbers with two decimal places
- Typical values: $10.00, $50.00, $100.00
- Boundary values: $0.01 (minimum), maximum allowed (e.g., $9999.99, depending on system limits)
Invalid classes:
- Negative numbers
- Numbers with more than two decimal places
Reasoning: Prices are typically positive numbers with up to two decimal places. Negative prices or prices with more precision are usually not valid in e-commerce systems.

3. Total per Item

Valid classes:
- Positive numbers with two decimal places
- Calculated as (unit price * quantity)
Invalid classes:
- Any value that doesn’t match the calculation (unit price * quantity)
Reasoning: The total per item should always be the result of multiplying the unit price by the quantity.

4. Subtotal

Valid classes:
- Positive numbers with two decimal places
- Calculated as the sum of all item totals
Invalid classes:
- Any value that doesn’t match the sum of all item totals
Reasoning: The subtotal should always be the sum of all item totals in the cart.

5. Item Removal

Valid classes:
- Removing an existing item
Invalid classes:
- Attempting to remove a non-existent item
Reasoning: Users should only be able to remove items that are actually in the cart.

Now, let’s create the test scenarios in Gherkin format:

Feature: Cart Item Management

  Background:
    Given the user is on the shopping cart page
    And the shopping cart contains the following items:
      | Product       | Price  | Quantity | Total  |
      | Checked Shirt | $53.99 | 1        | $53.99 |
      | Polo T Shirt  | $36.99 | 2        | $73.98 |
    And the subtotal is $127.97

  Scenario: Increase item quantity
    When the user clicks the "+" button for the "Checked Shirt"
    Then the quantity for "Checked Shirt" should immediately update to 2
    And the total for "Checked Shirt" should immediately update to $107.98
    And the subtotal should immediately update to $181.96

  Scenario: Decrease item quantity
    When the user clicks the "-" button for the "Polo T Shirt"
    Then the quantity for "Polo T Shirt" should immediately update to 1
    And the total for "Polo T Shirt" should immediately update to $36.99
    And the subtotal should immediately update to $90.98

  Scenario: Remove item from cart
    When the user clicks the trash can icon for the "Checked Shirt"
    Then the "Checked Shirt" should be immediately removed from the cart
    And the subtotal should immediately update to $73.98

  Scenario: Attempt to decrease quantity below 1
    Given the quantity of "Checked Shirt" is 1
    When the user clicks the "-" button for the "Checked Shirt"
    Then the quantity for "Checked Shirt" should remain 1
    And the total for "Checked Shirt" should remain $53.99
    And the subtotal should remain $127.97


  Scenario Outline: Update quantity using multiple clicks
    When the user clicks the "<button>" button for the "<product>" <clicks> times
    Then the quantity for "<product>" should immediately update to <final_quantity>
    And the total for "<product>" should immediately update to $<final_total>
    And the subtotal should immediately update to $<final_subtotal>

    Examples:
      | product       | button | clicks | final_quantity | final_total | final_subtotal |
      | Checked Shirt | +      | 3      | 4              | 215.96      | 289.94         |
      | Polo T Shirt  | -      | 1      | 1              | 36.99       | 90.98          |
      | Polo T Shirt  | +      | 5      | 7              | 258.93      | 312.92         |

  Scenario: Verify immediate update of item quantity
    When the user clicks the "+" button for the "Checked Shirt"
    Then the quantity for "Checked Shirt" should update to 2 without page refresh

  Scenario: Verify immediate update of item total
    When the user clicks the "+" button for the "Checked Shirt"
    Then the total for "Checked Shirt" should update to $107.98 without page refresh

  Scenario: Verify immediate update of subtotal
    When the user clicks the "+" button for the "Checked Shirt"
    Then the subtotal should update to $181.96 without page refresh

  Scenario: Verify presence of trash can icon
    Then each item in the cart should have a trash can icon for removal

  Scenario: Attempt to add invalid quantity
    When the user attempts to enter "-1" as the quantity for "Checked Shirt"
    Then the quantity for "Checked Shirt" should remain 1
    And the total for "Checked Shirt" should remain $53.99
    And the subtotal should remain $127.97

  Scenario: Verify correct calculation of item total
    When the user changes the quantity of "Polo T Shirt" to 3
    Then the total for "Polo T Shirt" should be exactly $110.97

  Scenario: Verify correct calculation of subtotal
    Given the shopping cart contains the following items:
      | Product       | Price  | Quantity | Total   |
      | Checked Shirt | $53.99 | 2        | $107.98 |
      | Polo T Shirt  | $36.99 | 3        | $110.97 |
    Then the subtotal should be exactly $218.95

  Scenario: Remove all items from cart
    When the user removes all items from the cart
    Then the cart should be empty
    And the subtotal should be $0.00

  Scenario: Verify maximum quantity limit
    When the user attempts to set the quantity of "Checked Shirt" to 1000
    Then an error message should be displayed
    And the quantity for "Checked Shirt" should remain at the maximum allowed

  Scenario: Verify minimum quantity limit
    Given the quantity of "Checked Shirt" is 1
    When the user attempts to set the quantity of "Checked Shirt" to 0
    Then the quantity for "Checked Shirt" should remain 1
    And the total for "Checked Shirt" should remain $53.99
    And the subtotal should remain $127.97

These additional scenarios cover various aspects of the user story and acceptance criteria, including:

Multiple quantity updates
Immediate updates without page refresh
Presence of trash can icons
Handling of invalid inputs
Correct calculation of item totals and subtotal
Removing all items from the cart
Testing maximum and minimum quantity limits

Self-verification:

Modifying quantity using “-” and “+” buttons: Covered in multiple scenarios
Immediate update of item quantity: Specifically covered in “Verify immediate update of item quantity” scenario
Immediate update of item total: Specifically covered in “Verify immediate update of item total” scenario
Immediate update of basket subtotal: Specifically covered in “Verify immediate update of subtotal” scenario
Presence of trash

Analysis of the Results Obtained by the AI

Several observations emerge from the results obtained from the structured prompt and the provided data. We offer an analysis of various aspects of the quality of the tests.

Test Data and Scenario Variability

We can see that adding the screenshot of the page to be tested (the cart) in the prompt allows the model to determine precise test data, referencing what is visualized in the scenario. This is highly relevant and enables dynamic management of this variability because changing the screenshot, displaying different items in the cart, will suffice to obtain Gherkin scenarios with different variability.

Writing Gherkin Scenarios and Completeness of Acceptance Criteria Coverage

Here are some observations from analyzing the Gherkin scenarios generated by the LLM:

Coverage of Acceptance Criteria: The coverage of the acceptance criteria is good and in line with what is requested in the prompt. The division of instructions into two parts, one for equivalence classes and the other for generating scenarios, is effective.
Calculation of Expected Results: When manually verifying this result, we find that the calculations of the expected results are correct for all test scenarios. The calculations of the “Total” and “Subtotal” fields for establishing the verdict are accurate.
Use of the “Context” Clause: The “Context” in Gherkin defines the initial conditions common to all scenarios of a feature, avoiding repetition and improving readability. It sets up the basic test environment, allowing individual scenarios to focus solely on their specific actions and verifications. The context generated by the LLM relies entirely on the screenshot. This would allow for dynamic updates by providing another screenshot for the LLM to analyze in the prompt data.

The result is seven correct, precise scenarios that are directly aligned with the provided data (User Story with its acceptance criteria, Screenshot).

This assessment is positive in terms of our productivity and velocity expectations: the proposed prompt is generic, and the data from the User Story and screenshot are easy to obtain. The result is achieved in a few seconds with an LLM, and verification takes a few minutes. This result allows for iteration with the LLM, for example, to request additional scenarios and expand the coverage of acceptance criteria if desired.

Meta-Prompting: Improving the Prompt by the LLM

Meta-prompting involves asking the LLM for improvements on an initial prompt. This technique saves time by providing suggestions that facilitate refining the prompt through successive trials. The LLM’s suggestions help clarify and specify the instructions, reducing ambiguities and optimizing results by identifying and correcting gaps for more relevant outcomes.

Here is a meta-prompt for our previously discussed Gherkin scenario generation prompt:

#Instructions

Here is a prompt for generating test scenarios in Gherkin format. Analyze this prompt and suggest any possible improvements.

#Prompt to Analyze

With Claude-3.5-Sonnet, here is the result of this meta-prompt:

The prompt provided is generally well-structured and comprehensive. However, there are a few areas where it could be improved to make it even more effective:

Clarity on Test Data Source:

Explicitly state that test data should be derived from both the user story/acceptance criteria and the provided screenshot.
Encourage the use of real product names, prices, and quantities from the screenshot in the test scenarios.

2. Performance Considerations:

Add a note about including scenarios that test the responsiveness of updates (e.g., testing with a large number of items or rapid consecutive changes).

3. Error Handling:

Include a point about creating scenarios for error conditions (e.g., network issues, server errors) and how the system should behave in these cases.

4. Usability Testing:

Suggest including scenarios that test the user experience, such as keyboard navigation or accessibility features.

5. Edge Cases:

Explicitly mention the need to consider edge cases, such as maximum allowed quantity or price precision issues.

6. Mobile Responsiveness:

If applicable, add a note about creating scenarios to test the cart’s behavior on different device sizes.

7. Internationalization:

If relevant, suggest including scenarios for different currency formats or language settings.

8. Security Considerations:

Add a point about including scenarios that test for basic security issues, such as input validation.

9. Background Context:

Suggest including a “Background” section in the Gherkin scenarios to set up the initial state of the cart.

10. Consistency Checks:

Add a point about verifying consistency between different parts of the UI (e.g., individual item totals matching the subtotal).

11. Output Format:

Here’s a suggested addition to the prompt:

#Additional Considerations
- Include scenarios for performance testing, error handling, and usability.
- Consider edge cases, mobile responsiveness, and internationalization if applicable.
- Include basic security testing scenarios.
- Use a "Background" section in your Gherkin scenarios to set up the initial cart state.
- Include scenarios that verify consistency across the UI.
- Provide your scenarios in a single Markdown code block.

By incorporating these suggestions, the prompt will guide the creation of more comprehensive and robust test scenarios.

Improved Version of the prompt

# Role and Context
You are my assistant for writing test scenarios in Gherkin format (Given / When / Then) based on a given User Story, acceptance criteria, and a provided screenshot of the page under test.

# Instructions, Format, and Constraints
By analyzing these elements, perform the following tasks:

1. Determine exhaustively the equivalence classes to be tested, giving examples of typical and boundary values for valid and invalid classes. Explain your reasoning.

2. Create test scenarios in Gherkin format for a <feature_name>.feature file. Cover the given set of acceptance criteria, and use both the screenshot and the user story to determine the test data.

3. Use real product names, prices, and quantities from the screenshot in your test scenarios.

4. Include a "Background" section in your Gherkin scenarios to set up the initial cart state.

5. Provide your scenarios in a single Markdown code block.

# Additional Considerations
- Include scenarios for performance testing (e.g., large number of items, rapid consecutive changes), error handling (e.g., network issues, server errors), and usability (e.g., keyboard navigation, accessibility).
- Consider edge cases such as maximum allowed quantity or price precision issues.
- If applicable, add scenarios for mobile responsiveness and different device sizes.
- If relevant, include scenarios for internationalization (different currency formats or language settings).
- Add basic security testing scenarios, such as input validation.
- Include scenarios that verify consistency across different parts of the UI (e.g., individual item totals matching the subtotal).

# Self-verification
- Check that all acceptance criteria are fully covered and complete them if not.
- Ensure that scenarios use data from both the user story and the screenshot.
- Verify that edge cases, error conditions, and usability aspects are addressed.

# Data
Here's the "Cart item management" User Story we're going to test, with its acceptance criteria:

As an e-commerce customer,
I want to be able to modify the quantity of items in my shopping cart,
In order to see the prices and cart total update accordingly.

Acceptance criteria:
1. It must be possible to modify the quantity of an item using the "-" and "+" buttons.
2. Update of item quantity must be immediate.
3. The total per item must be updated immediately after quantity modification (unit price * quantity).
4. The basket subtotal (sum of totals per item) must be updated immediately after any change in the quantity of one or more items.
5. Each item line must have a trash can icon to allow the item to be removed from the shopping cart.
6. Deleting an item from the basket should update the basket subtotal immediately.

[Insert shopping cart screenshot here]

This improved prompt provides a more comprehensive framework for generating test scenarios. It incorporates all the suggested improvements, including clearer instructions on data sources, additional testing considerations, and specific output formatting requirements. This should result in more thorough and well-structured test scenarios that cover a wider range of use cases and potential issues.

Explanatory Note on the Choice of LLM and the Experiments Conducted:

The choice of LLM fell on Claude-3.5-Sonnet because it is a generative AI model that offers good performance for testing activities by combining text and screenshot in the query. The version used is dated June 20, 2024. We used the Smartesting LLM portal for the various queries made. This internal Smartesting portal has an educational purpose and is used for learning AI prompting techniques. The portal provides access to more than 12 LLMs, the most relevant for automating testing activities.

Conclusion

In this article, we illustrated the importance of learning how to master AI prompting to obtain good results with generative AI for software testing activities. The example we used is a common activity of designing and writing test scenarios in Gherkin format. For other testing tasks, such as analyzing User Stories, optimizing existing test cases, generating automated test scripts, and analyzing bug reports, the same observation applies: to achieve good results and effectively use generative AI, mastering prompting techniques is essential.

Mastering Prompting techniques is accessible to all testers and will be increasingly used as AI Chatbots become available in companies and their usage becomes part of testers’ daily routines to gain productivity and speed.

It is a skill acquired through training and practice. Learning and master AI prompting is one of the goals of the training course “Accelerate Your Testing Processes with Generative AI.” Our training offers 8 workshops on software testing use cases, provides access to 12 different LLMs for practice, under open-source or commercial licenses, and presents the various prompting techniques useful for software testing, studied and experimented with during the workshops. Nearly two-thirds of the training is dedicated to guided practice, allowing the acquisition of this necessary know-how for the proper use of generative AI in software testing.