Boost recall in QSO selection from highly imbalanced photometric datasets

Giorgio Calderone,Francesco Guarneri,Matteo Porru,Stefano Cristiani,Andrea Grazian,Luciano Nicastro,Manuela Bischetti,Konstantina Boutsia,Guido Cupani,Valentina D'Odorico,Chiara Feruglio,Fabio Fontanot
2023-12-21
Abstract:Context. The identification of bright QSOs is of great importance to probe the intergalactic medium and address open questions in cosmology. Several approaches have been adopted to find such sources in currently available photometric surveys, including machine learning methods. However, the rarity of bright QSOs at high redshifts compared to contaminating sources (such as stars and galaxies) makes the selection of reliable candidates a difficult task, especially when high completeness is required. Aims. We present a novel technique to boost recall (i.e., completeness within the considered sample) in the selection of QSOs from photometric datasets dominated by stars, galaxies, and low-z QSOs (imbalanced datasets). Methods. Our method operates by iteratively removing sources whose probability of belonging to a noninteresting class exceeds a user-defined threshold, until the remaining dataset contains mainly high-z QSOs. Any existing machine learning method can be used as underlying classifier, provided it allows for a classification probability to be estimated. We applied the method to a dataset obtained by cross-matching PanSTARRS1, Gaia, and WISE, and identified the high-z QSO candidates using both our method and its direct multi-label counterpart. Results. We ran several tests by randomly choosing the training and test datasets, and achieved significant improvements in recall which increased from 50% to 85% for QSOs with z>2.5, and from 70% to 90% for QSOs with z>3. Also, we identified a sample of 3098 new QSO candidates on a sample of 2.6x10^6 sources with no known classification. We obtained follow-up spectroscopy for 121 candidates, confirming 107 new QSOs with z>2.5. Finally, a comparison of our candidates with those selected by an independent method shows that the two samples overlap by more than 90% and that both methods are capable of achieving a high level of completeness.
Instrumentation and Methods for Astrophysics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the difficult problems faced when selecting high - redshift quasars (QSOs) from highly imbalanced photometric datasets. Specifically, the authors propose a new method to improve the recall rate (i.e., the completeness within the sample) of identifying high - redshift QSOs in datasets that contain a large number of stars, galaxies, and other low - redshift QSOs. The following are the main objectives of this study: 1. **Improve the recall rate**: In highly imbalanced datasets, the number of high - redshift QSOs is very small, so traditional classification methods are difficult to effectively identify these rare sources. The method proposed in this paper automatically re - balances the input dataset by iteratively removing sources whose probability of belonging to non - interested classes exceeds a user - defined threshold, thereby improving the recall rate. 2. **Address the dataset imbalance problem**: Since the number of high - redshift QSOs is far less than that of other types of celestial bodies (such as stars and galaxies), this leads to a high degree of imbalance in the dataset. This imbalance makes machine - learning models tend to be biased towards the majority class, thereby reducing the recognition ability for the minority class (such as high - redshift QSOs). The method proposed in this paper reduces the imbalance of the dataset by gradually removing sources that do not belong to high - redshift QSOs. 3. **Optimize multi - label classification**: This method is not only applicable to binary classification problems but can also be extended to multi - label classification problems. By applying this method to datasets containing multiple categories (such as stars, galaxies, low - redshift QSOs, and high - redshift QSOs), high - redshift QSOs can be more effectively identified. 4. **Verify new candidates**: To verify the effectiveness of this method, the authors applied this method to a dataset obtained by cross - matching PanSTARRS1 (DR2), Gaia (DR3), and WISE databases, and successfully identified 3,098 new QSO candidates. Among them, 121 candidates were subsequently spectroscopically observed, and 107 new QSOs with redshifts greater than 2.5 were confirmed. In summary, the core problem of this paper is to significantly improve the recall rate of selecting high - redshift QSOs from highly imbalanced photometric datasets through a novel reverse - selection method, thereby providing more high - redshift QSO samples for cosmological research.