Evaluating the Accuracy of GUI Testing Using Multi-Modal Large Language Models
2026 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesisAlternative title
Utvärdering av noggrannheten i GUI-testning med multimodala stora språkmodeller (Swedish)
Abstract [en]
Recent advances in Multi-modal Large Language Models (M-LLMs) have created new opportunities for automating Graphical User Interface (GUI) testing through screenshot-based interaction guided by natural-language instructions. This thesis investigates how accurately these models can execute GUI test actions, how sensitive they are to controlled variations in GUI layout and instruction wording, how reliable their final verdicts are, and how model size affects execution efficiency.
To study this, a proof-of-concept pipeline called GUIOracle was implemented. The pipeline combines specification, interaction, and verification stages to execute natural-language GUI test instructions using screenshot capture, GUI parsing, local multi-modal model inference, and automated final-state assessment. The approach was evaluated in two environments: Tacsi, an industrially relevant simulator at Saab, and OpenScope, a controlled GUI environment used for comparative experimentation. Five Qwen 3.5 model sizes, from 0.8B to 27B parameters, were included in the evaluation, although the 27B model was tested only in OpenScope.
The results show that GUI action accuracy, scenario success, and verification-stage reliability improved clearly with model size. The smallest models were often unreliable and frequently timed out, whereas the larger models were better at executing correct GUI actions and completing scenarios successfully. Controlled GUI layout changes generally had a stronger negative effect than instruction wording changes, indicating that the approach was more sensitive to layout variation than to simplified phrasing. Runtime analysis further showed that smaller models were not automatically the most practical, since weaker action selection often led to longer interaction traces. The 27B model achieved the highest action accuracy, but the 9B model provided the best balance between correctness and execution efficiency. These findings suggest that local M-LLM-based GUI testing has promising potential in controlled environments, although the approach is best viewed as a complement to existing testing methods rather than as a complete replacement.
Place, publisher, year, edition, pages
2026. , p. 83
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:liu:diva-224405ISRN: LIU-IDA/LITH-EX-A--26/020--SEOAI: oai:DiVA.org:liu-224405DiVA, id: diva2:2064667
External cooperation
Saab
Supervisors
Examiners
2026-06-022026-06-022026-06-02Bibliographically approved