ui-test-automation-agent

AI-Powered UI Test Automation Agent

This project is a Java-based agent that leverages Generative AI models and Retrieval-Augmented Generation (RAG) to execute test cases written in a natural language form at the graphical user interface (GUI) level. It understands explicit test case instructions (both actions and verifications), performs corresponding actions using its tools (like the mouse and keyboard), locates the required UI elements on the screen (if needed), and verifies whether actual results correspond to the expected ones using computer vision capabilities.

Package Project

Here the corresponding article on Medium: AI Agent That’s Rethinking UI Test Automation

This agent can be a part of any distributed testing framework which uses A2A protocol for communication between agents. An example of such a framework is Agentic QA Framework. This agent has been tested as a part of this framework for executing a sample test case inside Google Cloud.

Key Features

Test Case Execution Workflow

The test execution process, orchestrated by the Agent class, follows these steps:

  1. Test Case Processing: The agent loads the test case defined in a JSON file (e.g., this one). This file contains the overall test case name, optional preconditions (natural language description of the required state before execution), and a list of TestSteps. Each TestStep includes a stepDescription (natural language instruction), optional testData (inputs for the step), and expectedResults (natural language description of the expected state after the step).
  2. Precondition Verification: If preconditions are defined, the agent verifies them against the current UI state using a vision model. If preconditions are not met, the test case execution fails.
  3. Test Case Execution Plan Generation: The agent generates a TestCaseExecutionPlan using an instruction model, outlining the specific tool calls and arguments for each TestStep.
  4. Step Iteration: The agent iterates through each TestStep sequentially, executing the planned tool calls.
  5. Action Processing (for each Action step):
    • Tool Execution: The appropriate tool method with the arguments provided by the execution plan is invoked.
    • Element Location (if required by the tool): If the requested tool needs to interact with a specific UI element (e.g., clicking an element), the element is located using the ElementLocator class based on the element’s description (provided as a parameter for the tool). (See “UI Element Location Workflow” below for details).
    • Retry/Rerun Logic: If a tool execution reports that retrying makes sense (e.g., an element was not found on the screen), the agent retries the execution after a short delay, up to a configured timeout (test.step.execution.retry.timeout.millis). If the error persists after the deadline, the test case execution is marked as ERROR.
  6. Verification Processing (for each Verification step):
    • Delay: A short delay (action.verification.delay.millis) is introduced to allow the UI state to change after the preceding action.
    • Screenshot: A screenshot of the current screen is taken.
    • Vision Model Interaction: A verification prompt containing the expected results description and the current screenshot is sent to the configured vision AI model. The model analyzes the screenshot and compares it against the expected results description.
    • Result Parsing: The model’s response contains information indicating whether the verification passed, and extended information with the justification for the result.
    • Retry Logic: If the verification fails, the agent retries the verification process after a short interval ( test.step.execution.retry.interval.millis) until a timeout (verification.retry.timeout.millis) is reached. If it still fails after the deadline, the test case execution is marked as FAILED.
  7. Completion/Termination: Execution continues until all steps are processed successfully or an interruption (error, verification failure, user termination) occurs. The final TestExecutionResult (including TestExecutionStatus and detailed TestStepResult for each step) is returned.

UI Element Location Workflow

The ElementLocator class is responsible for finding the coordinates of a target UI element based on its natural language description provided by the instruction model during an action step. This involves a combination of RAG, computer vision, analysis, and potentially user interaction (if run in attended mode):

  1. RAG Retrieval: The provided UI element’s description is used to query the vector database, where the top N (retriever.top.n) most semantically similar UiElement records are retrieved based on their stored names, using embeddings generated by the all-MiniLM-L6-v2 model. Results are filtered based on configured minimum similarity scores (element.retrieval.min.target.score for high confidence, element.retrieval.min.general.score for potential matches) and element.retrieval.min.page.relevance.score for relevance to the current page.
  2. Handling Retrieval Results:
    • High-Confidence Match(es) Found: If one or more elements exceed the MIN_TARGET_RETRIEVAL_SCORE and/or MIN_PAGE_RELEVANCE_SCORE:
      • Hybrid Visual Matching:
        • A vision model is used to identify potential bounding boxes for UI elements that visually resemble the target element on the current screen.
        • Concurrently, OpenCV’s ORB and Template Matching algorithms are used to find additional visual matches of the element’s stored screenshot on the current screen.
        • The results from both the vision model and algorithmic matching are combined and analyzed to find common or best-fitting bounding boxes.
      • Disambiguation (if needed): If multiple candidate bounding boxes are found, the vision model is employed to select the single best match that corresponds to the target element’s description and the description of surrounding elements (anchors), based on a screenshot showing all candidate bounding boxes highlighted with distinctly colored labels.
    • Low-Confidence/No Match(es) Found: If no elements meet the MIN_TARGET_RETRIEVAL_SCORE or MIN_PAGE_RELEVANCE_SCORE, but some meet the MIN_GENERAL_RETRIEVAL_SCORE:
      • Attended Mode: The agent displays a popup showing a list of the low-scoring potential UI element candidates. The user can choose to:
        • Update one of the candidates by refining its name, description, anchors, or page summary and save the updated information to the vector DB.
        • Delete a deprecated element from the vector DB.
        • Create New Element (see below).
        • Retry Search (useful if elements were manually updated).
        • Terminate the test execution (e.g., due to an AUT bug).
      • Unattended Mode: The location process fails.
    • No Matches Found: If no elements meet even the MIN_GENERAL_RETRIEVAL_SCORE:
      • Attended Mode: The user is guided through the new element creation flow:
        1. The user draws a bounding box around the target element on a full-screen capture.
        2. The captured element screenshot with its description are sent to the vision model to generate a suggested detailed name, self-description, surrounding elements (anchors) description, and page summary.
        3. The user reviews and confirms/edits the information suggested by the model.
        4. The new UiElement record (with UUID, name, descriptions, page summary, screenshot) is stored into the vector DB.
      • Unattended Mode: The location process fails.

Setup Instructions

Prerequisites

Maven Setup

This project uses Maven for dependency management and building.

  1. Clone the Repository:
    git clone <repository_url>
    cd <project_directory>
    
  2. Build the Project:
    mvn clean package
    

    This command downloads dependencies, compiles the code, runs tests (if any), and packages the application into a standalone JAR file in the target/ directory.

Vector DB Setup

Instructions for setting up the currently only one supported vector database Chroma DB could be found on its official website.

Configuration

Configure the agent by editing the config.properties file or by setting environment variables. * *Environment variables override properties file settings.**

Key Configuration Properties:

How to Run

Standalone Mode

Runs a single test case defined in a JSON file.

  1. Ensure the project is built (mvn clean package).
  2. Create a JSON file containing the test case (see this one for an example).
  3. Run the Agent class directly using Maven Exec Plugin (add configuration to pom.xml if needed):
    mvn exec:java -Dexec.mainClass="org.tarik.ta.Agent" -Dexec.args="<path/to/your/testcase.json>"
    

    Or run the packaged JAR:

    java -jar target/<your-jar-name.jar> <path/to/your/testcase.json>
    

Server Mode

Starts a web server that listens for test case execution requests.

  1. Ensure the project is built.
  2. Run the Server class using Maven Exec Plugin:
    mvn exec:java -Dexec.mainClass="org.tarik.ta.Server"
    

    Or run the packaged JAR:

    java -jar target/<your-jar-name.jar>
    
  3. The server will start listening on the configured port (default 8005).
  4. Send a POST request to the root endpoint (/) with the test case JSON in the request body.
  5. The server will respond immediately with 200 OK if it accepts the request (i.e., not already running a test case) or 429 Too Many Requests if it’s busy. The test case execution runs asynchronously.

Deployment

This section provides detailed instructions for deploying the UI Test Automation Agent, both to Google Cloud Platform (GCP) and locally using Docker.

Cloud Deployment (Google Compute Engine)

The agent can be deployed as a containerized application on a Google Compute Engine (GCE) virtual machine, providing a robust and scalable environment for automated UI testing. Because the agent needs at least 2 ports to be exposed (one for communicating with other agents and one for noVNC connection), using Google Cloud Run as a financially more efficient alternative is not possible. However, using Spot VMs is also a formidable option.

Prerequisites for Cloud Deployment

Deploying Chroma DB (Vector Database)

The agent relies on a vector database, Chroma DB is currently the only supported option. You can deploy Chroma DB to Google Cloud Run using the provided cloudbuild_chroma.yaml configuration.

  1. Configure cloudbuild_chroma.yaml:
    • Update _CHROMA_BUCKET with the name of a Google Cloud Storage bucket where Chroma DB will store its data.
    • Update _CHROMA_DATA_PATH if you want a specific path within the bucket.
    • Update _PORT if you want Chroma DB to run on a different port (default is 8000).
  2. Deploy using Cloud Build:
    gcloud builds submit . --config deployment/cloudbuild_chroma.yaml --substitutions=_CHROMA_BUCKET=<your-chroma-bucket-name>,_CHROMA_DATA_PATH=chroma,_PORT=8000 --project=<your-gcp-project-id>
    

    After deployment, note the URL of the deployed Chroma DB service; this will be your VECTOR_DB_URL which you need to set as a secret.

Building and Deploying the Agent on GCE

  1. Navigate to the project root:
    cd <project_root_directory>
    
  2. Adapt the deployment script: deployment/cloud/deploy_gce.sh script has some predefined values which need to be adapted, e.g. network name, exposed ports etc. if you want to use the agent as the part of already existing network (e.g. together with Agentic QA Framework ), you must carefully adapt all parameters to not destroy any existing settings.
  3. Execute the deployment script:
    ./deployment/cloud/deploy_gce.sh
    

    This script will:

    • Enable necessary GCP services.
    • Build the agent application using Maven.
    • Build the Docker image for the agent using deployment/cloud/Dockerfile.cloudrun and push it to Google Container Registry.
    • Set up VPC network and firewall rules (if they don’t exist).
    • Create a GCE Spot VM instance
    • Start the agent container inside created VM using a corresponding startup script.

    Note: The script uses default values for region, zone, instance name, etc. You can override these by setting them in gcloud CLI.

Accessing the Deployed Agent

Local Docker Deployment

For local development and testing, you can run the agent within a Docker container on your machine.

Prerequisites for Local Docker Deployment

Building and Running the Docker Image

The build_and_run_docker.bat script (for Windows) simplifies the process of building the Docker image and running the container.

  1. Build the project: The maven must be used for that, be sure to use the maven profiles “server” and “linux” for the build.
  2. Adapt deployment/local/Dockerfile:
    • IMPORTANT: Before running the script, open deployment/local/Dockerfile and replace the placeholder VNC_PW environment variable with a strong password of your choice. For example:
      ENV VNC_PW="your_strong_vnc_password"
      

      (Note: The build_and_run_docker.bat script also sets VNC_PW to 123456 for convenience, but it’s recommended to set it directly in the Dockerfile for consistency and security.)

  3. Execute the batch script:
    deployment\local\build_and_run_docker.bat
    

    This script will:

    • Build the Docker image named ui-test-automation-agent using deployment/local/Dockerfile.
    • Stop and remove any existing container named ui-agent.
    • Run a new Docker container, mapping ports 5901 (VNC), 6901 (noVNC), and 8005 (agent server) to your local machine.

Accessing the Local Agent

Remember to use the VNC password you set in the Dockerfile when prompted.

Contributing

Please refer to the CONTRIBUTING.md file for guidelines on contributing to this project.

TODOs

Final Notes