J. Ignacio Deza,Hisham Ihshaish,Lamine Mahdjoubi
Abstract:We introduce the first automated models for classifying natural language descriptions provided in cost documents called "Bills of Quantities" (BoQs) popular in the infrastructure construction industry, into the International Construction Measurement Standard (ICMS). The models we deployed and systematically evaluated for multi-class text classification are learnt from a dataset of more than 50 thousand descriptions of items retrieved from 24 large infrastructure construction projects across the United Kingdom. We describe our approach to language representation and subsequent modelling to examine the strength of contextual semantics and temporal dependency of language used in construction project documentation. To do that we evaluate two experimental pipelines to inferring ICMS codes from text, on the basis of two different language representation models and a range of state-of-the-art sequence-based classification methods, including recurrent and convolutional neural network architectures. The findings indicate a highly effective and accurate ICMS automation model is within reach, with reported accuracy results above 90% F1 score on average, on 32 ICMS categories. Furthermore, due to the specific nature of language use in the BoQs text; short, largely descriptive and technical, we find that simpler models compare favourably to achieving higher accuracy results. Our analysis suggest that information is more likely embedded in local key features in the descriptive text, which explains why a simpler generic temporal convolutional network (TCN) exhibits comparable memory to recurrent architectures with the same capacity, and subsequently outperforms these at this task.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically classify the natural - language descriptions in engineering cost documents into the International Construction Measurement Standard (ICMS) through machine - learning methods in the case of non - uniform construction industry standards. Specifically, the goal of the paper is to develop automated models that can efficiently and accurately classify the descriptive texts in "Bills of Quantities" (BoQs) into the corresponding categories of ICMS, thereby promoting the wide adoption of the ICMS standard and providing an effective tool to support project comparison and cost - performance analysis.
### Background and Objectives of the Paper
1. **Standardization Challenges**:
- The degree of standardization in the construction industry is much lower than that in other industries (such as manufacturing, software, financial services, and medical services), which has led to problems of project budget overruns and delays.
- Globally, 98% of infrastructure projects have budget overruns or delays, with an average budget overrun of 80% and at least a 20 - month delay.
- The productivity in the construction industry is more than 30% lower than the global average.
2. **ICMS Standard**:
- The International Cost Management Standard (ICMS) aims to provide global - consistent standards for construction project cost classification, definition, measurement, analysis, and reporting.
- ICMS is a high - level cost classification system designed to make projects in different countries and different fields comparable.
3. **Research Motivation**:
- Manually classifying BoQs into the ICMS standard is a time - consuming and error - prone task, which hinders the wide adoption of ICMS.
- By introducing machine - learning models, this paper aims to automate this process, improve the accuracy and efficiency of classification, and thus promote the popularization of the ICMS standard.
### Research Methods
1. **Data Acquisition and Pre - processing**:
- 124,000 material and cost items from 24 projects were collected from a large - scale infrastructure construction company in the UK, and each project contains thousands of lines of cost descriptions.
- The data set was pre - processed, including removing duplicate samples, special characters, and numbers, etc.
- Finally, 51,906 samples were retained, covering 32 ICMS categories.
2. **Models and Methods**:
- **Language Representation**: Two methods were used to represent texts, one is the "Bag - of - Words" (BoW) model based on word frequency, and the other is the model based on word embeddings.
- **Classification Models**: Multiple classification models were evaluated, including Support Vector Machines (SVM), Random Forest, Multi - Layer Perceptron (MLP), Bidirectional Long - Short - Term Memory Network (BiLSTM), Bidirectional Gated Recurrent Unit (BiGRU), and Temporal Convolutional Network (TCN).
### Experimental Results
1. **Performance Evaluation**:
- Most models performed well in inferring the ICMS standard from BoQs texts, especially simple models such as TCN and MLP.
- The Random Forest model performed well in dealing with unbalanced data and noise, and its performance was close to that of the best model.
- The MLP model achieved an F1 - score of more than 90% in most ICMS categories, indicating the effectiveness of simple models in handling BoQs texts.
2. **Key Findings**:
- Information is mainly embedded in the local key features of BoQs texts, and the influence of context information on classification is small.
- Simple models such as TCN and MLP perform excellently in handling such short - text tasks, even better than complex Recurrent Neural Network (RNN) architectures.
### Conclusion
The paper successfully demonstrates the feasibility of automatically classifying BoQs into the ICMS standard through machine - learning methods, providing strong support for promoting the wide adoption of the ICMS standard. The research results show that simple models have high accuracy and efficiency in handling BoQs text classification tasks, laying the foundation for further research and application in the future.