SmartFlow: Robotic Process Automation using LLMs

Arushi Jain,Shubham Paliwal,Monika Sharma,Lovekesh Vig,Gautam Shroff
2024-05-21
Abstract:Robotic Process Automation (RPA) systems face challenges in handling complex processes and diverse screen layouts that require advanced human-like decision-making capabilities. These systems typically rely on pixel-level encoding through drag-and-drop or automation frameworks such as Selenium to create navigation workflows, rather than visual understanding of screen elements. In this context, we present SmartFlow, an AI-based RPA system that uses pre-trained large language models (LLMs) coupled with deep-learning based image understanding. Our system can adapt to new scenarios, including changes in the user interface and variations in input data, without the need for human intervention. SmartFlow uses computer vision and natural language processing to perceive visible elements on the graphical user interface (GUI) and convert them into a textual representation. This information is then utilized by LLMs to generate a sequence of actions that are executed by a scripting engine to complete an assigned task. To assess the effectiveness of SmartFlow, we have developed a dataset that includes a set of generic enterprise applications with diverse layouts, which we are releasing for research use. Our evaluations on this dataset demonstrate that SmartFlow exhibits robustness across different layouts and applications. SmartFlow can automate a wide range of business processes such as form filling, customer service, invoice processing, and back-office operations. SmartFlow can thus assist organizations in enhancing productivity by automating an even larger fraction of screen-based workflows. The demo-video and dataset are available at
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by current Robotic Process Automation (RPA) systems when dealing with complex processes and diverse screen layouts. Existing RPA systems usually rely on pixel - level coding through drag - and - drop or automation frameworks (such as Selenium) to create navigation workflows, rather than visually understanding screen elements. The limitations of these systems lie in the lack of advanced human decision - making capabilities, making it difficult to adapt to changes in the user interface and handle tasks that require complex visual analysis and natural language understanding. To this end, the paper proposes a new AI - driven RPA system named SmartFlow. SmartFlow combines pre - trained large - scale language models (LLMs) and deep - learning - based image - understanding techniques, which can automatically identify and locate screen elements and generate navigation workflows using the information provided by HTML source code. This system can adapt to new scenarios without human intervention, including changes in the user interface and differences in input data. Through computer vision and natural language processing technologies, SmartFlow can perceive the visible elements on the graphical user interface (GUI), convert them into text representations, and then generate a series of action instructions by LLMs, which are finally executed by the script engine to complete the assigned tasks. This enables SmartFlow to show strong adaptability in different layouts and applications, thereby helping organizations improve productivity by automating a larger proportion of screen - based work processes.