Boost recall in QSO selection from highly imbalanced photometric datasets

Giorgio Calderone,Francesco Guarneri,Matteo Porru,Stefano Cristiani,Andrea Grazian,Luciano Nicastro,Manuela Bischetti,Konstantina Boutsia,Guido Cupani,Valentina D'Odorico,Chiara Feruglio,Fabio Fontanot

2023-12-21

Abstract:Context. The identification of bright QSOs is of great importance to probe the intergalactic medium and address open questions in cosmology. Several approaches have been adopted to find such sources in currently available photometric surveys, including machine learning methods. However, the rarity of bright QSOs at high redshifts compared to contaminating sources (such as stars and galaxies) makes the selection of reliable candidates a difficult task, especially when high completeness is required. Aims. We present a novel technique to boost recall (i.e., completeness within the considered sample) in the selection of QSOs from photometric datasets dominated by stars, galaxies, and low-z QSOs (imbalanced datasets). Methods. Our method operates by iteratively removing sources whose probability of belonging to a noninteresting class exceeds a user-defined threshold, until the remaining dataset contains mainly high-z QSOs. Any existing machine learning method can be used as underlying classifier, provided it allows for a classification probability to be estimated. We applied the method to a dataset obtained by cross-matching PanSTARRS1, Gaia, and WISE, and identified the high-z QSO candidates using both our method and its direct multi-label counterpart. Results. We ran several tests by randomly choosing the training and test datasets, and achieved significant improvements in recall which increased from 50% to 85% for QSOs with z>2.5, and from 70% to 90% for QSOs with z>3. Also, we identified a sample of 3098 new QSO candidates on a sample of 2.6x10^6 sources with no known classification. We obtained follow-up spectroscopy for 121 candidates, confirming 107 new QSOs with z>2.5. Finally, a comparison of our candidates with those selected by an independent method shows that the two samples overlap by more than 90% and that both methods are capable of achieving a high level of completeness.

Instrumentation and Methods for Astrophysics

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the difficult problems faced when selecting high - redshift quasars (QSOs) from highly imbalanced photometric datasets. Specifically, the authors propose a new method to improve the recall rate (i.e., the completeness within the sample) of identifying high - redshift QSOs in datasets that contain a large number of stars, galaxies, and other low - redshift QSOs. The following are the main objectives of this study: 1. **Improve the recall rate**: In highly imbalanced datasets, the number of high - redshift QSOs is very small, so traditional classification methods are difficult to effectively identify these rare sources. The method proposed in this paper automatically re - balances the input dataset by iteratively removing sources whose probability of belonging to non - interested classes exceeds a user - defined threshold, thereby improving the recall rate. 2. **Address the dataset imbalance problem**: Since the number of high - redshift QSOs is far less than that of other types of celestial bodies (such as stars and galaxies), this leads to a high degree of imbalance in the dataset. This imbalance makes machine - learning models tend to be biased towards the majority class, thereby reducing the recognition ability for the minority class (such as high - redshift QSOs). The method proposed in this paper reduces the imbalance of the dataset by gradually removing sources that do not belong to high - redshift QSOs. 3. **Optimize multi - label classification**: This method is not only applicable to binary classification problems but can also be extended to multi - label classification problems. By applying this method to datasets containing multiple categories (such as stars, galaxies, low - redshift QSOs, and high - redshift QSOs), high - redshift QSOs can be more effectively identified. 4. **Verify new candidates**: To verify the effectiveness of this method, the authors applied this method to a dataset obtained by cross - matching PanSTARRS1 (DR2), Gaia (DR3), and WISE databases, and successfully identified 3,098 new QSO candidates. Among them, 121 candidates were subsequently spectroscopically observed, and 107 new QSOs with redshifts greater than 2.5 were confirmed. In summary, the core problem of this paper is to significantly improve the recall rate of selecting high - redshift QSOs from highly imbalanced photometric datasets through a novel reverse - selection method, thereby providing more high - redshift QSO samples for cosmological research.

Boost recall in QSO selection from highly imbalanced photometric datasets

Efficient Selection of Quasar Candidates Based on Optical and Infrared Photometric Data Using Machine Learning

Identifying type II quasars at intermediate redshift with few-shot learning photometric classification

Quasar Photometric Redshifts and Candidate Selection: A New Algorithm Based on Optical and Mid-infrared Photometric Data

Machine Learning-based Search of High-redshift Quasars

Finding Quasars behind the Galactic Plane. I. Candidate Selections with Transfer Learning

Optimal Time-Series Selection of Quasars

Blind QSO reconstruction challenge: Exploring methods to reconstruct the Ly$α$ emission line of QSOs

Efficient Identification of Broad Absorption Line Quasars using Dimensionality Reduction and Machine Learning

DISCOVERING BRIGHT QUASARS AT INTERMEDIATE REDSHIFTS BASED ON OPTICAL/NEAR-INFRARED COLORS

Blind QSO reconstruction challenge: exploring methods to reconstruct the Ly α emission line of QSOs

An automated method for finding the most distant quasars

Discovering the missing 2.2 < z < 3 quasars by combining optical variability and optical/near-infrared colors

Detecting the Highest Redshift (z > 8) QSOs in a Wide, Near Infrared Slitless Spectroscopic Survey

Fine-grained Photometric Classification Using Multi-model Fusion Method with Redshift Estimation

Photometric Selection of type 1 Quasars in the XMM-LSS Field with Machine Learning and the Disk-Corona Connection

Photometric Redshift Estimation of BASS DR3 Quasars by Machine Learning

Multi-wavelength properties of three new radio-powerful $z\sim5.6$ QSOs discovered from RACS

Measuring galaxy abundance and clustering at high redshift from incomplete spectroscopic data: Tests on mock catalogs

Measuring photometric redshifts for high-redshift radio source surveys