Abstract:As the world's energy demand continues to expand, shale oil has a substantial influence on the global energy reserves. The third submember of the Mbr 3 of the Shahejie Fm, characterized by complicated mudrock lithofacies, is one of the significant shale oil enrichment intervals of the Bohai Bay Basin. The classification and identification of lithofacies are key to shale oil exploration and development. However, the efficiency and reliability of lithofacies identification results can be compromised by qualitative classification resulting from an incomplete workflow. To address this issue, a comprehensive technical workflow for mudrock lithofacies classification and logging prediction was designed based on machine learning. Principal component analysis (PCA) and hierarchical cluster analysis (HCA) were conducted to realize the automatic classification of lithofacies, which can classify according to the internal relationship of the data without the disturbance of human factors and provide an accurate lithofacies result in a much shorter time. The PCA and HCA results showed that the third submember can be split into five lithofacies: massive argillaceous limestone lithofacies (MAL), laminated calcareous claystone lithofacies (LCC), intermittent lamellar argillaceous limestone lithofacies (ILAL), continuous lamellar argillaceous limestone lithofacies (CLAL), and laminated mixed shale lithofacies (LMS). Then, random forest (RF) was performed to establish the identification model for each of the lithofacies and the obtained model is optimized by grid search (GS) and K-fold cross validation (KCV), which could then be used to predict the lithofacies of the non-coring section, and the three validation methods showed that the accuracy of the GS–KCV–RF model were all above 93%. It is possible to further enhance the performance of the models by resampling, incorporating domain knowledge, and utilizing the mechanism of attention. Our method solves the problems of the subjective and time-consuming manual interpretation of lithofacies classification and the insufficient generalization ability of machine-learning methods in the previous works on lithofacies prediction research, and the accuracy of the model for mudrocks lithofacies prediction is also greatly improved. The lithofacies machine-learning workflow introduced in this study has the potential to be applied in the Bohai Bay Basin and comparable reservoirs to enhance exploration efficiency and reduce economic costs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the efficiency and accuracy of mudstone sequence classification and logging prediction, especially in the shale oil reservoirs of the third member of the Shahejie Formation in the Bo'an Depression in eastern Bohai Bay Basin, China. Specifically, the paper focuses on how to achieve automatic classification of mudstone sequences and logging prediction through machine learning methods, especially principal component analysis (PCA), hierarchical cluster analysis (HCA) and optimized random forest (RF) algorithms. The solution of these problems is helpful for shale oil exploration and development, reducing the subjectivity and time - consuming of manual interpretation, and at the same time improving the generalization ability and prediction accuracy of the model. ### Main research contents: 1. **Background introduction**: - The importance of shale oil in global energy reserves is increasing day by day. - The mudstone sequence of the third member of the Shahejie Formation is complex and is an important shale - oil - rich interval in the Bohai Bay Basin. - The classification and identification of mudstone sequences are crucial for shale oil exploration and development, but the traditional manual classification methods have the problems of low efficiency and poor reliability. 2. **Research methods**: - **Data collection**: Geological parameters are obtained through means such as drilling core observation, thin - section making, and X - ray diffraction (XRD) analysis. - **Principal component analysis (PCA)**: Used for dimension reduction to extract the principal components reflecting the characteristics of mudstone sequences. - **Hierarchical cluster analysis (HCA)**: Based on the principal components extracted by PCA for automatic classification, generate a hierarchical tree diagram, and determine different mudstone sequence types. - **Random forest (RF) model**: Establish a mudstone sequence identification model, and optimize the model parameters through grid search (GS) and K - fold cross - validation (KCV) to improve the prediction accuracy. 3. **Results**: - **Mudstone sequence classification**: Through PCA and HCA, the mudstone sequence of the third member is divided into five types: massive argillaceous limestone (MAL), layered calcareous mudstone (LCC), discontinuous layered argillaceous limestone (ILAL), continuous layered argillaceous limestone (CLAL) and layered mixed shale (LMS). - **Model performance**: The optimized RF model performs excellently in the mudstone sequence prediction of non - cored sections, with an accuracy rate of more than 93%. ### Significance of the paper: - **Improve efficiency**: The automated method significantly reduces the time and cost of manual classification. - **Improve accuracy**: Through machine learning methods, the accuracy of mudstone sequence classification and logging prediction is improved. - **Application prospects**: This method can be applied to the Bohai Bay Basin and similar reservoirs to improve exploration efficiency and reduce economic costs. ### Formula display: - **Principal component analysis (PCA)**: - Calculation of eigenvalue and variance contribution rate: \[ \text{PC1} = 0.194\times\text{clay}- 0.208\times\text{carb}+ 0.195\times\text{felsic}+ 0.113\times\text{chlorite}+ 0.089\times\text{porosity}+ 0.097\times\text{permeability}+ 0.007\times\text{So}- 0.196\times\text{density}- 0.001\times\text{structure}+ 0.183\times\text{TOC} \] \[ \text{PC2} = - 0.027\times\text{clay}+ 0.035\times\text{carb}+ 0.023\times\text{felsic}+ 0.335\times\text{chlorite}- 0.390\times\text{porosity}+ 0.066\times\text{permeability}+ 0.412\times\text{So}+ 0.069\times\text{de}

Data-Driven Classification and Logging Prediction of Mudrock Lithofacies Using Machine Learning: Shale Oil Reservoirs in the Eocene Shahejie Formation, Bonan Sag, Bohai Bay Basin, Eastern China

Data-driven lithofacies prediction in complex tight sandstone reservoirs: a supervised workflow integrating clustering and classification models

Application and Comparison of Machine Learning Methods for Mud Shale Petrographic Identification

Lithofacies identification of shale formation based on mineral content regression using LightGBM algorithm: A case study in the Luzhou block, South Sichuan Basin, China

Integrated Carbonate Lithofacies Modeling Based on the Deep Learning and Seismic Inversion and Its Application

Lithofacies logging identification for strongly heterogeneous deep-buried reservoirs based on improved Bayesian inversion: The Lower Jurassic sandstone, Central Junggar Basin, China

Evaluation Techniques for Shale Oil Lithology and Mineral Composition Based on Principal Component Analysis Optimized Clustering Algorithm

The application of machine learning under supervision in identification of shale lamina combination types — A case study of Chang 73 sub-member organic-rich shales in the Triassic Yanchang Formation, Ordos Basin, NW China

The controlling factors and prediction model of pore structure in global shale sediments based on random forest machine learning

A novel hybrid CNN–SVM method for lithology identification in shale reservoirs based on logging measurements

Well-Logging-Based Lithology Classification Using Machine Learning Methods for High-Quality Reservoir Identification: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China

Prediction of igneous lithology and lithofacies based on ensemble learning with data optimization

Lithofacies identification of shale reservoirs using a tree augmented Bayesian network: A case study of the lower Silurian Longmaxi formation in the changning block, South Sichuan basin, China

Data-driven diagenetic facies classification and well-logging identification based on machine learning methods: A case study on Xujiahe tight sandstone in Sichuan Basin

An Approach for the Classification of Rock Types Using Machine Learning of Core and Log Data

Multi-scale classification and evaluation of shale reservoirs and ‘sweet spot’ prediction of the second and third members of the Qingshankou Formation in the Songliao Basin based on machine learning

Integrating deep learning and logging data analytics for lithofacies classification and 3D modeling of tight sandstone reservoirs

The Application of Geostatistical Inversion in Shale Lithofacies Prediction: a Case Study of the Lower Silurian Longmaxi Marine Shale in Fuling Area in the Southeast Sichuan Basin, China

Lithofacies Classification and Origin of the Eocene Lacustrine Fine-Grained Sedimentary Rocks in the Jiyang Depression, Bohai Bay Basin, Eastern China

Data-driven machine learning approaches for precise lithofacies identification in complex geological environments

Prediction of TOC in Lishui–Jiaojiang Sag Using Geochemical Analysis, Well Logs, and Machine Learning