This project is a Java-based agent that leverages Generative AI models and Retrieval-Augmented Generation (RAG) to execute test cases written in a natural language form at the graphical user interface (GUI) level. It understands explicit test case instructions (both actions and verifications), performs corresponding actions using its tools (like the mouse and keyboard), locates the required UI elements on the screen (if needed), and verifies whether actual results correspond to the expected ones using computer vision capabilities.
Here the corresponding article on Medium: AI Agent That’s Rethinking UI Test Automation
This agent can be a part of any distributed testing framework which uses A2A protocol for communication between agents. An example of such a framework is Agentic QA Framework. This agent has been tested as a part of this framework for executing a sample test case inside Google Cloud.
config.properties
and AgentConfig.java
, allowing specification of providers, model names (instruction.model.name
, vision.model.name
), API
keys/tokens, endpoints, and generation parameters (temperature, topP, max output tokens, retries).model.logging.enabled
) and outputting the model’s thinking process (thinking.output.enabled
).AgentConfig.getVectorDbProvider
-> chroma
), configured via vector.db.url
in config.properties
.UiElement
records, which include a name, self-description, description of surrounding
elements (anchors), a page summary, and a screenshot (UiElement.Screenshot
).retriever.top.n
in config) most relevant UI elements based on semantic similarity between the query (derived
from the test step action) and based on the stored element names. Minimum similarity scores (element.retrieval.min.target.score
,
element.retrieval.min.general.score
, element.retrieval.min.page.relevance.score
in config) are used to filter results for target
element identification and potential refinement suggestions.ModelFactory.getVisionModel
) to:
org.bytedeco.opencv
) for visual pattern matching (ORB and Template Matching) to find occurrences of an element’s
stored screenshot on the current screen.ElementLocator
combines results from the vision model and algorithmic matching, considering intersections and
relevance, to determine the best match.Robot
class.unattended.mode
flag in config.properties
.unattended.mode=false
): Designed for initial test case runs or when execution in unattended mode
fails for debugging/fixing purposes. In this mode the agent behaves as a trainee, who needs assistance from the human tutor/mentor
in order to identify all the information which is required for the unattended (without supervision) execution of the test case.unattended.mode=true
): The agent executes the test case without any human assistance. It relies entirely on the
information stored in the RAG database and the AI models’ ability to interpret instructions and locate elements based on stored data.
Errors during element location or verification will cause the execution to fail. This mode is suitable for integration into CI/CD
pipelines./
) (port configured via port
in config.properties
).
The server accepts only one test case execution at a time (the agent has been designed as a static utility for simplicity
purposes).
Upon receiving a valid request when idle, it returns 200 OK
and starts the test case execution. If busy, it returns
429 Too Many Requests
.The test execution process, orchestrated by the Agent
class, follows these steps:
preconditions
(natural language description of the required state before
execution), and a list of TestStep
s. Each TestStep
includes a stepDescription
(natural language
instruction), optional testData
(inputs for the step), and expectedResults
(natural language description of the expected state after
the step).TestCaseExecutionPlan
using an instruction model, outlining the specific
tool calls and arguments for each TestStep
.TestStep
sequentially, executing the planned tool calls.test.step.execution.retry.timeout.millis
). If the
error persists after the deadline, the test case execution is marked as ERROR
.action.verification.delay.millis
) is introduced to allow the UI state to change after the preceding
action.test.step.execution.retry.interval.millis
) until a timeout (verification.retry.timeout.millis
) is reached. If it still fails after
the deadline, the test case execution is marked as FAILED
.TestExecutionResult
(including TestExecutionStatus
and detailed TestStepResult
for
each step) is returned.The ElementLocator class is responsible for finding the coordinates of a target UI element based on its natural language description provided by the instruction model during an action step. This involves a combination of RAG, computer vision, analysis, and potentially user interaction (if run in attended mode):
retriever.top.n
) most
semantically similar UiElement
records are retrieved based on their stored names, using embeddings generated by
the all-MiniLM-L6-v2
model. Results are filtered based on configured minimum similarity scores (element.retrieval.min.target.score
for high confidence, element.retrieval.min.general.score
for potential matches) and element.retrieval.min.page.relevance.score
for
relevance to the current page.MIN_TARGET_RETRIEVAL_SCORE
and/or
MIN_PAGE_RELEVANCE_SCORE
:
MIN_TARGET_RETRIEVAL_SCORE
or MIN_PAGE_RELEVANCE_SCORE
, but some
meet the
MIN_GENERAL_RETRIEVAL_SCORE
:
MIN_GENERAL_RETRIEVAL_SCORE
:
UiElement
record (with UUID, name, descriptions, page summary, screenshot) is stored into the vector DB.This project uses Maven for dependency management and building.
git clone <repository_url>
cd <project_directory>
mvn clean package
This command downloads dependencies, compiles the code, runs tests (if any), and packages the application into a standalone JAR file in
the target/
directory.
Instructions for setting up the currently only one supported vector database Chroma DB could be found on its official website.
Configure the agent by editing the config.properties file or by setting environment variables. * *Environment variables override properties file settings.**
Key Configuration Properties:
unattended.mode
(Env: UNATTENDED_MODE
): true
for unattended execution, false
for attended (trainee) mode. Default: false
.debug.mode
(Env: DEBUG_MODE
): true
enables debug mode, which saves intermediate screenshots (e.g., with bounding boxes drawn)
during element location for debugging purposes. false
disables this. Default: false
.port
(Env: PORT
): Port for the server mode. Default: 8005
.host
(Env: AGENT_HOST
): Host address for the server mode. Default: localhost
.vector.db.provider
(Env: VECTOR_DB_PROVIDER
): Vector database provider. Default: chroma
.vector.db.url
(Env: VECTOR_DB_URL
): Required URL for the vector database connection. Default: http://localhost:8020
.retriever.top.n
(Env: RETRIEVER_TOP_N
): Number of top similar elements to retrieve from the vector DB based on semantic element name
similarity. Default: 5
.instruction.model.provider
(Env: INSTRUCTION_MODEL_PROVIDER
): AI model provider for instruction model (google
, openai
, or groq
). Default: google
.vision.model.provider
(Env: VISION_MODEL_PROVIDER
): AI model provider for vision model (google
, openai
, or groq
). Default: google
.instruction.model.name
(Env: INSTRUCTION_MODEL_NAME
): Name/deployment ID of the model for processing test case actions and
verifications. Default: gemini-2.5-flash
.vision.model.name
(Env: VISION_MODEL_NAME
): Name/deployment ID of the vision-capable model. Default: gemini-2.5-flash
.model.max.output.tokens
(Env: MAX_OUTPUT_TOKENS
): Maximum amount of tokens for model responses. Default: 8192
.model.temperature
(Env: TEMPERATURE
): Sampling temperature for model responses. Default: 0.0
.model.top.p
(Env: TOP_P
): Top-P sampling parameter. Default: 1.0
.model.max.retries
(Env: MAX_RETRIES
): Max retries for model API calls. Default: 10
.model.logging.enabled
(Env: LOG_MODEL_OUTPUT
): Enable/disable model logging. Default: false
.thinking.output.enabled
(Env: OUTPUT_THINKING
): Enable/disable thinking process output. Default: true
.gemini.thinking.budget
(Env: GEMINI_THINKING_BUDGET
): Budget for Gemini thinking process. Default: 0
.google.api.provider
(Env: GOOGLE_API_PROVIDER
): Google API provider (studio_ai
or vertex_ai
). Default: studio_ai
.google.api.token
(Env: GOOGLE_AI_TOKEN
): API Key for Google AI Studio. Required if using AI Studio.google.project
(Env: GOOGLE_PROJECT
): Google Cloud Project ID. Required if using Vertex AI.google.location
(Env: GOOGLE_LOCATION
): Google Cloud location (region). Required if using Vertex AI.azure.openai.api.key
(Env: OPENAI_API_KEY
): API Key for Azure OpenAI. Required if using OpenAI.azure.openai.endpoint
(Env: OPENAI_API_ENDPOINT
): Endpoint URL for Azure OpenAI. Required if using OpenAI.groq.api.key
(Env: GROQ_API_KEY
): API Key for Groq. Required if using Groq.groq.endpoint
(Env: GROQ_ENDPOINT
): Endpoint URL for Groq. Required if using Groq.test.step.execution.retry.timeout.millis
(Env: TEST_STEP_EXECUTION_RETRY_TIMEOUT_MILLIS
): Timeout for retrying failed test case
actions. Default: 5000 ms
.test.step.execution.retry.interval.millis
(Env: TEST_STEP_EXECUTION_RETRY_INTERVAL_MILLIS
): Delay between test case action retries.
Default: 1000 ms
.verification.retry.timeout.millis
(Env: VERIFICATION_RETRY_TIMEOUT_MILLIS
): Timeout for retrying failed verifications. Default:
5000 ms
.action.verification.delay.millis
(Env: ACTION_VERIFICATION_DELAY_MILLIS
): Delay after executing a test case action before performing
the corresponding verification. Default: 500 ms
.element.bounding.box.color
(Env: BOUNDING_BOX_COLOR
): Required color name (e.g., green
) for the bounding box drawn during element
capture in attended mode. This value should be tuned so that the color contrasts as much as possible with the average UI element color.element.retrieval.min.target.score
(Env: ELEMENT_RETRIEVAL_MIN_TARGET_SCORE
): Minimum semantic similarity score for vector DB UI
element retrieval. Elements reaching this score are treated as target element candidates and used for further disambiguation by a vision
model. Default: 0.8
.element.retrieval.min.general.score
(Env: ELEMENT_RETRIEVAL_MIN_GENERAL_SCORE
): Minimum semantic similarity score for vector DB UI
element retrieval. Elements reaching this score will be displayed to the operator in case they decide to update any of them (e.g., due to
UI changes, etc.). Default: 0.5
.element.retrieval.min.page.relevance.score
(Env: ELEMENT_RETRIEVAL_MIN_PAGE_RELEVANCE_SCORE
): Minimum page relevance score for vector
DB UI element retrieval. Default: 0.5
.element.locator.visual.similarity.threshold
(Env: VISUAL_SIMILARITY_THRESHOLD
): OpenCV template matching threshold. Default: 0.8
.element.locator.top.visual.matches
(Env: TOP_VISUAL_MATCHES_TO_FIND
): Maximum number of visual matches of a single UI element from
OpenCV to pass to the AI model for disambiguation. Default: 6
.element.locator.min.intersection.area.ratio
(Env: MIN_INTERSECTION_PERCENTAGE
): Minimum intersection area ratio for a visual match to
be considered valid. Default: 0.8
.element.locator.found.matches.dimension.deviation.ratio
(Env: FOUND_MATCHES_DIMENSION_DEVIATION_RATIO
): Maximum allowed deviation
ratio for the dimensions of a found visual match compared to the original element. Default: 0.3
.element.locator.visual.grounding.model.vote.count
(Env: VISUAL_GROUNDING_MODEL_VOTE_COUNT
): The number of times the visual grounding
model is asked to identify potential locations of a UI element on the screen. A higher number can increase accuracy through consensus but
also increases processing time and cost. Default: 5
.element.locator.validation.model.vote.count
(Env: VALIDATION_MODEL_VOTE_COUNT
): The number of times the validation model is asked to
confirm the best match from a set of candidates. This is used to create a quorum and improve the reliability of element identification.
Default: 3
.element.locator.bbox.clustering.min.intersection.ratio
(Env: BBOX_CLUSTERING_MIN_INTERSECTION_RATIO
): When using multiple votes from
the visual grounding model, this value determines the minimum intersection-over-union (IoU) ratio for clustering bounding boxes. It
controls how close bounding boxes need to be to be grouped into a single, averaged bounding box. Default: 0.7
.dialog.default.horizontal.gap
, dialog.default.vertical.gap
, dialog.default.font.type
,
dialog.user.interaction.check.interval.millis
, dialog.default.font.size
: Cosmetic and timing settings for interactive dialogs.Runs a single test case defined in a JSON file.
mvn clean package
).Agent
class directly using Maven Exec Plugin (add configuration to pom.xml
if needed):
mvn exec:java -Dexec.mainClass="org.tarik.ta.Agent" -Dexec.args="<path/to/your/testcase.json>"
Or run the packaged JAR:
java -jar target/<your-jar-name.jar> <path/to/your/testcase.json>
Starts a web server that listens for test case execution requests.
Server
class using Maven Exec Plugin:
mvn exec:java -Dexec.mainClass="org.tarik.ta.Server"
Or run the packaged JAR:
java -jar target/<your-jar-name.jar>
8005
).POST
request to the root endpoint (/
) with the test case JSON in the request body.200 OK
if it accepts the request (i.e., not already running a test case) or
429 Too Many Requests
if it’s busy. The test case execution runs asynchronously.This section provides detailed instructions for deploying the UI Test Automation Agent, both to Google Cloud Platform (GCP) and locally using Docker.
The agent can be deployed as a containerized application on a Google Compute Engine (GCE) virtual machine, providing a robust and scalable environment for automated UI testing. Because the agent needs at least 2 ports to be exposed (one for communicating with other agents and one for noVNC connection), using Google Cloud Run as a financially more efficient alternative is not possible. However, using Spot VMs is also a formidable option.
gcloud
command-line tool installed and configured.GROQ_API_KEY
: Your API key for Groq platform.GROQ_ENDPOINT
: The endpoint URL for Groq platform.VECTOR_DB_URL
: The URL of your vector DB instance (see deployment instructions below).VNC_PW
: The password for accessing the noVNC session using browser.You can create these secrets using GCP Console.
The agent relies on a vector database, Chroma DB is currently the only supported option. You can deploy Chroma DB to Google Cloud Run
using the provided cloudbuild_chroma.yaml
configuration.
cloudbuild_chroma.yaml
:
_CHROMA_BUCKET
with the name of a Google Cloud Storage bucket where Chroma DB will store its data._CHROMA_DATA_PATH
if you want a specific path within the bucket._PORT
if you want Chroma DB to run on a different port (default is 8000
).gcloud builds submit . --config deployment/cloudbuild_chroma.yaml --substitutions=_CHROMA_BUCKET=<your-chroma-bucket-name>,_CHROMA_DATA_PATH=chroma,_PORT=8000 --project=<your-gcp-project-id>
After deployment, note the URL of the deployed Chroma DB service; this will be your VECTOR_DB_URL
which you need to set as a secret.
cd <project_root_directory>
deployment/cloud/deploy_gce.sh
script has some predefined values which need to be adapted, e.g. network name, exposed ports etc. if
you want to use the agent as the part of already existing network (e.g. together
with Agentic QA Framework ), you must carefully adapt all parameters to not
destroy any existing settings../deployment/cloud/deploy_gce.sh
This script will:
deployment/cloud/Dockerfile.cloudrun
and push it to Google Container Registry.Note: The script uses default values for region, zone, instance name, etc. You can override these by setting them in gcloud
CLI.
AGENT_SERVER_PORT
(default 443
). The internal hostname can be
retrieved by executing curl "http://metadata.google.internal/computeMetadata/v1/instance/hostname" -H "Metadata-Flavor: Google"
inside
the VM. This hostname can later be used for communication inside the network with other agents of the framework.https://<EXTERNAL_IP>:<NO_VNC_PORT>
, where <EXTERNAL_IP>
is the external IP of your GCE instance and <NO_VNC_PORT>
is the noVNC
port (default 6901
). The VNC password is set via the VNC_PW
secret. The SSL/TLS certificate is self-signed, so you’ll have to
confirm visiting the page for the first time.For local development and testing, you can run the agent within a Docker container on your machine.
The build_and_run_docker.bat
script (for Windows) simplifies the process of building the Docker image and running the container.
deployment/local/Dockerfile
:
deployment/local/Dockerfile
and replace the placeholder VNC_PW
environment variable
with a strong password of your choice. For example:
ENV VNC_PW="your_strong_vnc_password"
(Note: The build_and_run_docker.bat
script also sets VNC_PW
to 123456
for convenience, but it’s recommended to set it directly
in the Dockerfile for consistency and security.)
deployment\local\build_and_run_docker.bat
This script will:
ui-test-automation-agent
using deployment/local/Dockerfile
.ui-agent
.5901
(VNC), 6901
(noVNC), and 8005
(agent server) to your local machine.localhost:5901
.http://localhost:6901/vnc.html
.http://localhost:8005
.Remember to use the VNC password you set in the Dockerfile when prompted.
Please refer to the CONTRIBUTING.md file for guidelines on contributing to this project.
all-MiniLM-L6-v2
) as a dependency of LangChain4j, and the native OpenCV libraries required for
visual element location.availableBoundingBoxColors
field in ElementLocator). If more visual matches are found than available colors,
an error will occur. This might happen if the element.locator.visual.similarity.threshold
is too low or if there are many visually
similar elements on the screen (e.g., the same check-boxes for a list of items). You might need to use a different labelling method for
visual matches in this case (the primary approach during development of this project was to use numbers located outside the bounding box
as labels, which, however, proved to be less efficient compared to using different bounding box colors, but is still a good option if the
latter cannot be applied).main
branch should
include relevant unit tests. Contributing by adding new unit tests to existing code is, as always, welcome.