Leveraging Generative AI for Extracting Process Models from Multimodal Documents

Marvin Voelter,Raheleh Hadian,Timotheus Kampik,Marius Breitmayer,Manfred Reichert
2024-06-07
Abstract:This paper presents an investigation of the capabilities of Generative Pre-trained Transformers (GPTs) to auto-generate graphical process models from multi-modal (i.e., text- and image-based) inputs. More precisely, we first introduce a small dataset as well as a set of evaluation metrics that allow for a ground truth-based evaluation of multi-modal process model generation capabilities. We then conduct an initial evaluation of commercial GPT capabilities using zero-, one-, and few-shot prompting strategies. Our results indicate that GPTs can be useful tools for semi-automated process modeling based on multi-modal inputs. More importantly, the dataset and evaluation metrics as well as the open-source evaluation code provide a structured framework for continued systematic evaluations moving forward.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to use generative AI (especially multi - modal generative models) to automatically generate graphical business process models from multi - modal documents containing text and images**. Specifically, researchers hope to systematically evaluate and verify the capabilities of multi - modal generative models (such as GPTs) in processing text and image inputs to generate business process models by introducing a multi - modal dataset and an evaluation metric system. ### Main problems 1. **Processing of multi - modal inputs**: Traditional business process model generation methods usually only handle a single type of input (such as pure text or pure image), while this paper aims to explore the ability to process text and images simultaneously. 2. **Establishment of an evaluation framework**: In order to scientifically compare the performance of different models, an evaluation framework based on real - data is required. Therefore, the author created a small dataset containing 123 models and defined a set of evaluation metrics. 3. **Verification of model performance**: Through zero - shot, one - shot and few - shot prompting strategies, evaluate the performance of commercial GPT models (such as GPT - 4V) under multi - modal inputs to verify their feasibility in practical applications. ### Solutions - **Dataset construction**: The author created a multi - modal document dataset containing text and images based on the SAP - SAM dataset and provided the corresponding ground truth in JSON format. - **Evaluation framework**: Proposed an evaluation framework based on element decomposition and the adjusted Sørensen–Dice coefficient to quantify the similarity between the generative model and the ground truth. - **Experimental verification**: Experiments were carried out on GPT - 4V using zero - shot, one - shot and few - shot prompting strategies. The results show that multi - modal GPT performs excellently in some tasks (such as task names and types), but still has certain challenges when dealing with gateway labels and flows. ### Conclusions Research shows that multi - modal generative models can to a certain extent achieve the task of automatically generating business process models from multi - modal documents, but still need further improvement, especially in dealing with complex relationships and detailed information. In addition, the author also emphasizes that this research provides a structured evaluation framework for future research, which is helpful to promote the further development of this field.