Estimating oil and gas recovery factors via machine learning: Database-dependent accuracy and reliability

Alireza Roustazadeh,Behzad Ghanbarian,Mohammad B. Shadmand,Vahid Taslimitehrani,Larry W. Lake
DOI: https://doi.org/10.1016/j.engappai.2023.107500
2022-10-23
Abstract:With recent advances in artificial intelligence, machine learning (ML) approaches have become an attractive tool in petroleum engineering, particularly for reservoir characterizations. A key reservoir property is hydrocarbon recovery factor (RF) whose accurate estimation would provide decisive insights to drilling and production strategies. Therefore, this study aims to estimate the hydrocarbon RF for exploration from various reservoir characteristics, such as porosity, permeability, pressure, and water saturation via the ML. We applied three regression-based models including the extreme gradient boosting (XGBoost), support vector machine (SVM), and stepwise multiple linear regression (MLR) and various combinations of three databases to construct ML models and estimate the oil and/or gas RF. Using two databases and the cross-validation method, we evaluated the performance of the ML models. In each iteration 90 and 10% of the data were respectively used to train and test the models. The third independent database was then used to further assess the constructed models. For both oil and gas RFs, we found that the XGBoost model estimated the RF for the train and test datasets more accurately than the SVM and MLR models. However, the performance of all the models were unsatisfactory for the independent databases. Results demonstrated that the ML algorithms were highly dependent and sensitive to the databases based on which they were trained. Statistical tests revealed that such unsatisfactory performances were because the distributions of input features and target variables in the train datasets were significantly different from those in the independent databases (p-value < 0.05).
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately estimate the Recovery Factor (RF) of oil and gas reservoirs through machine - learning methods and evaluate the impact of different databases on the accuracy of the model. Specifically, the research aims to: 1. **Apply the eXtreme Gradient Boosting (XGBoost) algorithm**: Develop a machine - learning - based model to estimate the Recovery Factor of oil and gas reservoirs. 2. **Compare the performance of different machine - learning algorithms**: Compare XGBoost with Multiple Linear Regression (MLR) and Support Vector Machine (SVM) to evaluate their performance in estimating oil and gas RF. 3. **Explore the database - dependence issue**: Analyze the impact of different database combinations on the accuracy of machine - learning models and evaluate the reliability and uncertainty of the models on independent databases. ### Research Background Traditional methods for estimating the Recovery Factor of oil and gas reservoirs, such as history matching and volume reserve estimation, have relatively large uncertainties and are time - consuming. With the development of artificial intelligence and data analysis technologies, machine - learning methods provide a new approach for estimating the Recovery Factor of oil and gas reservoirs and can more efficiently use data in the early stage for prediction. ### Main Objectives - **Develop an XGBoost model**: For estimating the Recovery Factor of oil and gas reservoirs. - **Performance comparison**: Train models with multiple database combinations and compare the performance of XGBoost, MLR, and SVM. - **Evaluate database - dependence**: Use independent databases to further evaluate the accuracy and reliability of the models and reveal the impact of different databases on model performance. ### Key Findings - The XGBoost model performs better than the SVM and MLR models on both the training set and the test set. - The performance of the model is highly dependent on the database used for training, and the differences in feature distributions among different databases significantly affect the generalization ability of the model. - Statistical tests show that there are significant differences in the distribution of input features and target variables between the training data and the independent database (p - value < 0.05), resulting in poor performance of the model on the independent database. Through these studies, the author hopes to provide a more efficient and accurate method for estimating the Recovery Factor of oil and gas reservoirs and reveal the important impact of database selection on model performance.