Abstract:Patterns of species distribution have long been one of the important topics of ecological study (Brown and Lomonilo 1998). In this brief communication, we introduce a new program— GeoSVM—that uses support vector machine (SVM) to predict species’ potential distributions. (GeoSVM is now available at http://www.unm.edu/;wyzuo/GEO.htm.) Here, we also give the results of our evaluation of the performance of GeoSVM. We used data for 30 species of Rhododendron in China as a case study to compare GeoSVM and Genetic Algorithm for Rule-Set Prediction (GARP), one of the most popular models to predict species’ potential distributions. We found that GeoSVM is more accurate and efficient than GARP. Furthermore, GeoSVM can handle more environmental information, which significantly improves the prediction accuracy. Patterns of species distribution can potentially answer a bunch of fundamental questions in ecology, such as where are the original habitats of the species; how do the species distribute on earth; how do species achieve their distribution patterns; what is the relationship between distribution patterns of different species and how to set up a policy to conserve endangered species. The development of computer technology and machine learning methods enables the use of environmental factors to simulate species’ potential distribution. Various statistical models have been explored in previous works for predicting species distributions, e.g. generalized linear models, generalized additive models, logistic regression, neural networks, decision trees, principle components analysis (PCA), Mahalanobis distance, maximum entropy method, genetic algorithm and regression tree analysis (see a survey in Zuo et al. 2007). These statistical models have been commonly used in wide range of other applications. However, when applied to the prediction of potential species distributions, a common problem arises—the high dimensionality and small sample size problem. This problem is caused by the nature of the task—the prediction of potential species distributions generally depends on the specimen data. These data are accumulated by fieldwork. Fieldwork, being an expensive and difficult process, limits the quantity of data available. We have >400 species of Rhododendron in China, but only 161 of them have >20 location samples (the lower limit of sample size for GARP). On the other hand, there are >100 environmental factors that can potentially affect species distribution, such as meteorological factors like annual, monthly, maximum and minimum values of temperature, precipitation and relative humidity as well as geographical factors like altitude and slope and soil and vegetation type. Most statistical methods rely on the big sample assumption that ‘the number of samples is much larger than the number of parameters’. As we can see, however, this assumption does not hold anymore for species distribution data. Under this situation, these models usually perform well on training samples, but badly on new testing data. This phenomenon is called ‘over training’. Some dimension-reducing methods, such as PCA, can mitigate this problem but only to some extent. SVM is a model for classification and regression based on statistical learning theory created by Vapnik (1995) at AT&T Bell Labs. It is based on structural risk minimization principle, an improvement over the traditional empirical risk minimization principle. Because of its outstanding empirical performance, SVM has been well accepted by many scientific communities (Gunn 1998). We implemented a potential species distribution predicting system, called GeoSVM, based on SVM. Detailed system architecture of GeoSVM is described in Zuo et al. (2007). First, GeoSVM randomly generates negative sample points that are five times the number of positive ones. GeoSVM assumes that the species do not exist at negative sample points. Weight 1/5 is given to each negative sample and Weight 1 is given to each positive sample. Environmental features are extracted from the environmental digital map based on the training samples’ locations. These environmental

Support Vector Machines for Predicting Distribution of Sudden Oak Death in California

Predicting Distribution of a New Forest Disease Using One-Class SVMs.

Modeling the risk for a new invasive forest disease in the United States: An evaluation of five environmental niche models

Modeling Risk for SOD Nationwide: What are the Effects of Model Choice on Risk Prediction?

A comparison of standard and hybrid classifier methods for mapping hardwood mortality in areas affected by "sudden oak death"

A Comparison of Standard and Hybrid Classifier Methods for Mapping Hardwood Mortality in Areas Affected by Sudden Oak Death

A spatial–temporal approach to monitoring forest disease spread using multi-temporal high spatial resolution imagery

Characterizing spatial–temporal tree mortality patterns associated with a new forest disease

An Object-Based Classification Approach in Mapping Tree Mortality Using High Spatial Resolution Imagery

GeoSVM: an Efficient and Effective Tool to Predict Species' Potential Distributions

PREDICTING SPECIES' POTENTIAL DISTRIBUTION—SVM COMPARED WITH GARP

Spatiotemporal Distribution of Sudden Oak Death in the US and Europe

Predicting Potential Distributions of Geographic Events Using One-Class Data: Concepts and Methods

Application of machine learning models for risk estimation and risk prediction of classical swine fever in Assam, India

Landslide susceptibility mapping using support vector machines

Spatial prediction of plant invasion using a hybrid of machine learning and geostatistical method

Using Casi Hyperspectral Imagery to Detect Mortality and Vegetation Stress Associated with A New Hardwood Forest Disease

Support vector machine modeling of earthquake-induced landslides susceptibility in central part of Sichuan province, China

Spatial Analysis of Plague in California: Niche Modeling Predictions of the Current Distribution and Potential Response to Climate Change

Detection of oak decline using radiative transfer modelling and machine learning from multispectral and thermal RPAS imagery

Climatic variability, spatial heterogeneity and the presence of multiple hosts drive the population structure of the pathogen Phytophthora ramorum and the epidemiology of Sudden Oak Death