An automated method for finding the most distant quasars

Lena Lenz,Daniel J. Mortlock,Boris Leistedt,Rhys Barnett,Paul C. Hewett
2024-08-23
Abstract:Upcoming surveys such as Euclid, the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST) and the Nancy Grace Roman Telescope (Roman) will detect hundreds of high-redshift (z > 7) quasars, but distinguishing them from the billions of other sources in these catalogues represents a significant data analysis challenge. We address this problem by extending existing selection methods by using both i) Bayesian model comparison on measured fluxes and ii) a likelihood-based goodness-of-fit test on images, which are then combined using an Fbeta statistic. The result is an automated, reproduceable and objective high-redshift quasar selection pipeline. We test this on both simulations and real data from the cross-matched Sloan Digital Sky Survey (SDSS) and UKIRT Infrared Deep Sky Survey (UKIDSS) catalogues. On this cross-matched dataset we achieve an AUC score of up to 0.795 and an F3 score of up to 0.79, sufficient to be applied to the Euclid, LSST and Roman data when available.
Instrumentation and Methods for Astrophysics,Astrophysics of Galaxies
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically identify the most distant quasars (high - redshift quasars, \(z\gtrsim7\)) in the upcoming wide - area sky surveys (such as Euclid, LSST and Roman). Specifically, the paper aims to develop an automated method that can reliably distinguish these extremely rare high - redshift quasars from hundreds of millions of celestial sources. ### Main problems: 1. **Huge amount of data**: Future sky surveys will generate vast amounts of data, which contain billions of celestial sources. How to efficiently screen out a small number of high - redshift quasars is a huge challenge. 2. **Rarity of quasars**: High - redshift quasars are very rare. For example, in each square degree of the sky, there are only about \(2.5\times 10^{- 2}\) quasars with redshift greater than 7 (\(J = 23\)). Therefore, traditional simple methods such as color - cutting are difficult to effectively identify these rare targets. 3. **Background noise and interference**: In addition to rarity, high - redshift quasars also face serious interference from other celestial bodies (such as M/L/T/Y dwarfs and early - type galaxies) and non - astronomical artifacts. The number of these interference sources far exceeds that of the target quasars, increasing the difficulty of identification. ### Solutions: To address the above challenges, the paper proposes an automated selection method that combines Bayesian model comparison and image goodness - of - fit testing. The specific steps are as follows: 1. **Bayesian model comparison**: Analyze the photometric data through Bayesian statistical methods and calculate the probability that each source is a quasar. This step utilizes multi - band photometric data and takes into account the color characteristics of different celestial populations. 2. **Image goodness - of - fit testing**: Conduct pixel - level analysis on the image data of each candidate source and evaluate its goodness - of - fit with the quasar model. This step can further exclude sources that do not conform to the quasar characteristics morphologically. 3. **\(F_{\beta}\) statistic**: Combine the results of the above two methods and use the \(F_{\beta}\) statistic to define the final selection threshold. The \(F_{\beta}\) statistic can balance between precision and recall, ensuring that the scientific value is maximized under limited observation resources. ### Testing and verification: To verify the effectiveness of this method, the authors tested it on simulated data and the actual SDSS - UKIDSS cross - matched data set. The results show that this method achieved an AUC score of up to 0.795 and an \(F_{3}\) score of 0.79 on the cross - matched data set, indicating its high accuracy and reliability and its suitability for future large - scale sky survey data. ### Formula explanation: - **\(F_{\beta}\) statistic**: \[F_{\beta}=\frac{(1 + \beta^{2})\cdot\text{precision}\cdot\text{recall}}{\beta^{2}\cdot\text{precision}+\text{recall}}\] - When \(\beta = 1\), \(F_{1}\) is the harmonic mean of precision and recall. - When \(\beta>1\), more emphasis is placed on recall. Through this method, the paper provides a systematic and automated high - redshift quasar selection pipeline, providing strong support for future wide - area sky surveys.