Experimenting with modeling-specific word embeddings

López, José Antonio Hernández,Durá, Carlos,Cuadrado, Jesús Sánchez
DOI: https://doi.org/10.1007/s10270-024-01250-5
2024-12-13
Software & Systems Modeling
Abstract:The application of machine learning techniques to address MDE problems often requires transforming raw information (e.g., software models) to a numerical representation which can be used by machine learning algorithms. To this end, pretrained embeddings are a key technology to facilitate the construction of such applications. However, previous works have demonstrated that these embeddings struggle to generalize effectively in the MDE domain due to their training on general-purpose corpora. To tackle this issue, we developed WordE4MDE , which are specialized word embeddings trained specifically on modeling documents. In this study, we aim to overcome several limitations of WordE4MDE and conduct additional experiments to assess its efficacy. Key limitations we address include: (1) mitigating the out-of-vocabulary issue through the utilization of sub-word embeddings, (2) adding contextualization to the embeddings by training a BERT model on our specific modeling corpus and (3) addressing the constraint of limited training data by investigating the augmentation of our modeling corpus with StackOverflow and StackExchange data.
computer science, software engineering
What problem does this paper attempt to address?