Abstract:Many large raptor species are currently rare and most of them are endangered, and thus details of their distribution, abundance, and survival are the most important indicators for planning conservation and restoration measures and assessing the impacts of anthropogenic transformation of the environment and/or climate change on the populations of these species. Abundance and spatial distribution of the birds under study are determined during field surveys. At the result, we obtain the distribution density in individuals, pairs, nests per unit area (for example, pairs/100 km2), or the distance between nearest or all neighbors (represented as mathematical values (1–5, on average 3.5±1.1 km) and/ or in graphical form (ranging from simple lines connecting observation points to Delaunay triangulation and a network of polygons built from observation points). Further, to generate an estimate of abundance, one must understand the area over which these data can be extrapolated. This is often challenging for many researchers – incorrect assessment of the area of the species’ habitat distorts estimated abundance and neutralizes censusing efforts. How can one correctly determine the area, over which it is possible to extrapolate censusing data? The answer to this question can be found by modeling in a GIS environment using geographic layers of environmental and spatial information, or, in current terminology, species distribution modeling (SDM). When using SDM (also known as habitat or species range modeling), environmental data (climatic and spatial variables such as temperature, humidity, wind load, topography, land cover, soils, etc. – predictor or independent variables) are calculated for geographically referenced points of a species’ presence (dependent variable) and species distribution is predicted using computer algorithms and mathematical methods. SDM is carried out in six stages: (1) idea conceptualization, (2) data preparation (presence and absence points or background points), (3) method selection (4) model fitting, (5) model evaluation and (6) habitat or area map construction. 1. Conceptualization. At this stage, we formulate the main goal of the study and decide on the modeling process design based on our knowledge of the species and the study. Data selection about the species and the environment is an important point at the initial stage. We decide whether to use only our data, or use other available data. Doing so will require some adjustments to the sample design. Next, we need to test the basic assumptions underlying the SDM, such as whether the species is in equilibrium with available environmental variables, whether the data is biased in any way (sampling bias, spatial autocorrelation, etc.), whether there are any environmental changes relative to the time of data collection, etc. Selection of adequate environmental and spatial variables, modeling algorithm, and model complexity should be based on study goals and the hypothesis regarding the relationship between the species under study and the environment in the area selected for study. 2. Data preparation. At this stage, we collect and process factual data about the species (both points of presence and points of absence) and the environment. When preparing data, particular attention should be paid to any inconsistencies in spatial and temporal scaling of dependent and independent variables, i.e. cases where there is a large spatial or temporal difference between species and environmental data, or between environmental data (spatial and climate variables). Also, special attention should be paid to the quality of georeferencing of points of presence and the quality of species identification, which, as a rule, suffers greatly if data is collected by amateurs. In these cases, we need to make decisions about adjusting the data or discarding it. All SDM algorithms require species absence information. If such information is not available, it is replaced by background points or “pseudo-absence” data, which naturally has a negative impact on the quality of the simulation, especially on a large scale. Consideration should be given in advance to how species data will be separated for model training and model testing if the simulation uses all data collected and there are no plans for further testing of the model in the field. 3. Method selection. At this stage, we select one or several modeling methods to combine into ensemble models. While simple factor or cluster analyses integrated into desktop GIS were used in early stages of modeling, today the selection of algorithms has expanded significantly: Linear regression methods: – Generalized linear model (GLM) (Nelder, Wedderburn, 1972), – Generalized additive model (GAM) (Hastie, Tibshirani, 1990); Machine learning methods: – Maximum entropy method implemented in the MaxEnt program (Soberson, Peterson, 2005; Phillips et al., 2006; Phillips, Dudik, 2008), – Random Forest (RF) is an ensemble learning method for classification and regression that works by constructing multiple decision trees during training (Breiman, 2001), – Boosted Regression Trees (BRT), – Convolutional Neural Networks (CNN) (LeCun et al., 1989), – Genetic algorithm for Rule Set Production (GARP) (Stockwell, 1999; Stockwell, Peters, 1999), – Machine learning supporting vector networks (Support Vector Machines, SVM) (Cortes, Vapnik, 1995; Vapnik et al., 1997), – XGBoost (eXtreme Gradient Boosting, XGB) (Chen, Guestrin, 2016). MaxEnt and Random Forest are integrated into ArcGIS, supported in R, and available online for Google Earth Engine (GEE) users. In recent years, GEE has become increasingly popular as a resource for SDM (Crego et al., 2022). 4. Fitting the model. This stage is key in SDM. Having received preliminary modeling data, we evaluate the contribution of multicollinearity and decide how to deal with it, determine how many variables can be included in the model without retraining, evaluate spatial or temporal autocorrelation and decide how to deal with it, determine the settings of the model or several models and choose which one provides the result, best or average. At the same stage, we check the plausibility of the selected relationships between species’ points of presence and environmental variables by comparing coefficients and visually inspecting the plotted curves on the graphs. 5. Model evaluation. At this stage, we evaluate the forecast performance of the final model using a set of validation or test data: AUC (ROC) (Fielding, Bell, 1997; Fawcett, 2006; Hosmer, Lemeshow, 2013), TSS (Liu et al., 2005; Allouche et al., 2006); R2 and Kappa (Brownlee, 2016; Zhang et al., 2021). Cross-validation (spatial blocks) is commonly used for this purpose (Roberts et al., 2017; Valavi et al., 2019; Crego et al., 2022). We also select thresholds to binarize predicted probabilities based on cross-validated predictions. Cross-validation (spatial blocks) is commonly used for this purpose (Roberts et al., 2017; Valavi et al., 2019; Crego et al., 2022). We also select thresholds to binarize predicted probabilities based on cross-validated predictions. 6. Constructing a map of habitats or range. This is the final stage of SDM, during which we convert our predictive model into a raster and obtain a classified image with the percentage probability of the species occurring in the study area for each pixel. We calculate a probability threshold for the species’ presence on pixels that we include in the final range map, and the size the area of habitat. The expediency of using a buffer depends on the scale of the resulting raster; the smaller the scale, the lower the relevance of the buffer. Buffer size is usually determined by the mean nearest neighbor distance (MND) and, depending on the modeling’s goals and objectives, is half, exactly, or twice the MND. One must always critically evaluate the underlying assumptions in SDM and be aware of the potential limitations associated with a variety of factors: the ability to detect the species, uneven sampling, limitations in the selection of environmental variables, ignorance regarding certain aspects of the species’ biology to identify patterns in its biotopic and territorial preferences, etc. SDM assumes that the species is in equilibrium with its environment, that we know and have carefully selected both the species' point of presence and environmental data, and that we have included all the major factors that determine the species' range limits. It should be understood that these aspects are not stable for several reasons. First, species, especially predators, respond dynamically to changes in the environment, so they will exhibit certain spatial and temporal dynamics and need to be properly taken into account in the modeling. Important factors that determine a species' response to changes in its habitat are its physiology, demography, ability to disperse, degree of tolerance to urbanization, degree of adaptation to changes in environmental factors, and interspecific interactions. All these factors engage seemingly constantly over time, including here and now, and ignoring them can significantly distort modeling results. Therefore, the ideal option for SDM is to check results in the field and adjust them. Unfortunately, most ornithologists have difficulty using R and desktop GIS, a fact that prevents them from processing the results of their field research in accordance with modern standards. For better implementation of modeling in practice when working with rare species, we have created a software product that allows bird specialists with minimal knowledge of GIS and programming languages, but who have a certain understanding of SDM algorithms and abundance assessment, to solve problems related to modeling distribution and abundance and survival of rare species. This software product is designed for processing various geodata containing observations of species; obtaining data from GEE rasters; classification of biotopes; population estimates, survival rates, etc. The main interface of the product is a web interface that allows the user to select the process of interest, enter the necessary data, and receive a link to an archive containing processing results. For geodata (points, polygons, etc.), it is possible to enter csv, shp, geojson files, as well as manual input using a map. To run algorithms in which it is necessary to add data from GEE rasters, a selection field is provided from the list of available earth remote sensing (ERS) products: NASADEM (NASA JPL, 2020), MOD13A1.061 Terra Vegetation Indices 16-Day Global 500m (Didan, 2021), Geomorpho90m (Amatulli et al., 2020), Global Habitat Heterogeneity (Tuanmu, Jetz, 2015), Global Wind Atlas (Badger et al., 2021), World Clim (Fick, Hijmans, 2017), ERA5-Land Monthly Aggregated – ECMWF Climate Reanalysis (Muñoz Sabater, 2019), ESA WorldCover 10m v100 (Zanaga et al., 2021), Dynamic World V1 (Brown et al., 2022), unclassified satellite data such as surface reflectivity (SR) collection 2 Landsat 8 atmospheric-corrected (blue, red, green, near-infrared and shortwave infrared 1 bands with 30 m spatial resolution) and ALOS-2 PALSAR L-band dual-polarization (HH and HV) SAR data, and NDVI and EVI calculation data from Landsat 8 images using the GEE (normalizedDifference) function. To run algorithms using various thirdparty libraries, data is entered in csv files in the formats required by the corresponding libraries. At the current stage, the product includes the following modules: 1) Obtaining data from GEE rasters for given points (result presented in a table with data selected for points from rasters included in the GEE collection); 2) Obtaining a classified raster for a given area and a set of points of presence and absence of a view (training points) using the RF and MaxEnt classifiers based on GEE (both classifiers allow, for a given area of interest, a set of training points and selected remote sensing products from GEE, to obtain a classified one with using appropriate GEE raster methods of the area of interest. It is possible to cross-validate the selected models and evaluate their predictive effectiveness); 3) Three different methods to stimulate population size: 3.1) Generation of random points in a regular network – a heuristic algorithm that, based on data on the points of presence of the species and on the studied areas, generates random points, simulating species’ distribution in the general area of interest; 3.2) Distance – a method based on the Distance Sampling model (Thomas et al., 2010; Buckland et al., 2015; Miller et al., 2019), that accepts input of a file with the necessary variables for points and areas and displays detailed statistics as a result; 3.3) Simple site surveys using calculation of a weighted average indicator for species distribution density (Karyakin, 2004) with an calculation of asymmetric confidence interval (Ravkin, Chelintsev, 1990); 4) Estimation of nest survival based on the RMARK library (Laake, 2013). The survival calculation module includes processing of nest survival data using the nest method of the RMARK library, which can account for various variables in remote sensing data and infers the importance of variables for nest survival. The software product is hosted on the servers of organizations recognized as undesirable in Russia, access to which is blocked by Roskomnadzor. The authors are considering options, including creating a clone on a Russian internet resource. This work is carried out with financial support from the Critical Ecosystem Partnership Fund (CEPF)38 within the framework of the project “Endangered Raptors Conservation on the Indo-Palaearctic Flyway”).

Species distribution modelling for plant communities: stacked single species or multivariate modelling approaches?

Mapping multiscale breeding bird species distributions across the United States and evaluating their conservation applications

Multispecies deep learning using citizen science data produces more informative plant community models

Species Distribution, Abundance and Survival Modeling: New Opportunities and Methods

Mapping with height and spectral remote sensing implies that environment and forest structure jointly constrain tree community composition in temperate coniferous forests of eastern Washington, United States

Spatial Joint Species Distribution Modeling using Dirichlet Processes

Joint Species Distribution Modeling of Percentage Cover Data with Exclusive Competition for Space

The Best of Two Worlds: Using Stacked Generalisation for Integrating Expert Range Maps in Species Distribution Models

Algal community structure prediction by machine learning

Very High Resolution Species Distribution Modeling Based on Remote Sensing Imagery: How to Capture Fine-Grained and Large-Scale Vegetation Ecology With Convolutional Neural Networks?

Beyond a diagnostic tool: Validating standardized Mahalanobis distance as a species distribution model for invasive alien species in North America

The best of two worlds: using stacked generalization for integrating expert range maps in species distribution models

Inferring Single- and Multi-Species Distributional Aggregation Using Quadrat Sampling

A comparison of macroecological and stacked species distribution models to predict future global terrestrial vertebrate richness

Improving prediction of rare species’ distribution from community data

Introduction to deep learning methods for multi‐species predictions

The role of remote sensing in species distribution models: a review

Assessing the reliability of species distribution projections in climate change research

Bayesian joint species distribution model selection for community‐level prediction

A new distributional model coupling environmental and biotic factors

Terrestrial or marine species distribution model: Why not both? A case study with seabirds