Machine Learning-Based Identification of Contaminated Images in Light Curves Data Preprocessing

Hui Li,Rong-Wang Li,Peng Shu,Yu-Qiang Li
DOI: https://doi.org/10.1088/1674-4527/ad339e
2024-04-02
Abstract:Attitude is one of the crucial parameters for space objects and plays a vital role in collision prediction and debris removal. Analyzing light curves to determine attitude is the most commonly used method. In photometric observations, outliers may exist in the obtained light curves due to various reasons. Therefore, preprocessing is required to remove these outliers to obtain high quality light curves. Through statistical analysis, the reasons leading to outliers can be categorized into two main types: first, the brightness of the object significantly increases due to the passage of a star nearby, referred to as "stellar contamination," and second, the brightness markedly decreases due to cloudy cover, referred to as "cloudy contamination." Traditional approach of manually inspecting images for contamination is time-consuming and labor-intensive. However, We propose the utilization of machine learning methods as a substitute. Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) are employed to identify cases of stellar contamination and cloudy contamination, achieving F1 scores of 1.00 and 0.98 on test set, respectively. We also explored other machine learning methods such as Residual Network-18 (ResNet-18) and Light Gradient Boosting Machine (lightGBM), then conducted comparative analyses of the results.
Instrumentation and Methods for Astrophysics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in astronomical observations, how to automatically identify and remove contaminated images during the pre - processing stage of light - curve data. Specifically, the paper focuses on two main types of contamination: 1. **Stellar contamination**: When a star passes near the target celestial body, it will cause a significant increase in the image brightness. 2. **Cloudy contamination**: Due to cloud cover, the image brightness is significantly reduced. Traditional manual inspection methods are time - consuming and labor - intensive, especially when dealing with large amounts of data. Therefore, the paper proposes to use machine - learning methods to replace manual inspection in order to improve efficiency and accuracy. Specifically, the following machine - learning models are used for classification: - **Convolutional Neural Network (CNN)**: Used for binary classification to distinguish between stellar - contaminated images and normal images. - **Support Vector Machine (SVM)**, **Light Gradient Boosting Machine (lightGBM)** and **ResNet - 18**: Used for binary classification to distinguish between cloudy - contaminated images and normal images. Through these methods, the paper aims to achieve high - precision automated image classification, thereby ensuring the acquisition of high - quality light - curve data and providing a reliable basis for subsequent research and analysis. ### Key Formulas - **Calculation formula of GLCM (Gray - Level Co - occurrence Matrix)**: \[ g(i, j)=\#\{f(x_1, y_1) = i, f(x_2, y_2)=j\mid(x_1, y_1),(x_2, y_2)\in M\times N\} \] where \(x\) and \(y\) are the coordinates in the image, \(i\) and \(j\) are the row and column indices of the matrix \(g\), \(M\) and \(N\) are the number of rows and columns of the image, \(g\) is the gray - level co - occurrence matrix of the image \(f\), and \(\#\) represents the number of elements in the set. - **Contrast**: \[ CON=\sum_{i = 1}^{M_g}\sum_{j = 1}^{N_g}(i - j)^2g(i, j) \] - **Inverse Difference Moment (IDM)**: \[ IDM=\sum_{i = 1}^{M_g}\sum_{j = 1}^{N_g}\frac{g(i, j)}{1+(i - j)^2} \] - **Energy**: \[ ENE=\sum_{i = 1}^{M_g}\sum_{j = 1}^{N_g}g(i, j)^2 \] - **Correlation**: \[ COR=\sum_{i = 1}^{M_g}\sum_{j = 1}^{N_g}\frac{(i-\mu)(j - \mu)g(i, j)}{\sigma^2} \] - **Entropy**: \[ ENT=-\sum_{i = 1}^{M_g}\sum_{j = 1}^{N_g}g(i, j)\log(g(i, j)) \] - **Gray - level non - uniformity (G)**: \[ G = 10\lg\left(\frac{P_s}{P_n}\right) \] where \(P_s\) and \(P_n\) are the maximum and minimum standard deviations of the local image respectively. These formulas are used to extract features from images and then train and support the classification tasks of the vector machine model.