LGEO2185: Designing a Geoproject

Author

Kristof Van Oost Antoine Stevens & Valentin Charlier

Today’s program

Basic Project Assignment: discussion
Advanced geoprocessing: Geoprojects, structure, implementation
Open Science

Today’s Learning Objectives

Developing a geo-spatial model

Additional resources

Axolotl
Biodiversity data
Daniele’s GitLab repo
https://geco-bern.github.io/agds/introduction.html#what-is-applied-geodata-science

1. Introduction

Today, we are interested in the factors that control the spatial distribution of a given species (habitat and suitability maps a.k.a. species distribution models), and predict how these will evolve under predicted climate change. Look at the example below from Kraemer et al. (2015). Dots represent the observed occurrence of mosquito’s in Europe and the color scale represents the suitability (ranging from 1 highly suited to 0 not suitable) of the climatic conditions for the given species:

The above maps are the result of a statistical model where the species presence/absence is modelled as:

\[ \text{Species Presence/Absence} \sim \text{Pred}_1 + \text{Pred}_2 + \dots + \text{Pred}_n \]

Such model can be easily constructed based a database with observations and predictor variables. Let’s see how we could do this by exploring the following data.frame You can download the data manually from here Download the dataset (RDS) and copy it into your data/raw folder (or load it directly from R as done here):

pacman::p_load(tidyverse, here, corrplot, modEvA)
occurrence_df <- readRDS(url("https://uclouvain-geog.github.io/LGEO2185/Data/occurrence_data.Rds","rb"))
#glimpse(occurrence_df)

Now explore occurence_df in the explorer. The variables wc2.1_10m_bio_?` are derived from the worldclim Bioclimatic variables, more info here. You could further explore this dataset as follows:

# Exploratory analysis
summary(occurrence_df)

      pres     wc2.1_10m_bio_1  wc2.1_10m_bio_2  wc2.1_10m_bio_3
 Min.   :0.0   Min.   : 9.697   Min.   : 8.791   Min.   :43.38  
 1st Qu.:0.0   1st Qu.:15.402   1st Qu.:15.059   1st Qu.:59.02  
 Median :0.5   Median :16.228   Median :15.593   Median :68.30  
 Mean   :0.5   Mean   :17.964   Mean   :15.446   Mean   :63.72  
 3rd Qu.:1.0   3rd Qu.:20.387   3rd Qu.:16.393   3rd Qu.:68.30  
 Max.   :1.0   Max.   :28.201   Max.   :19.625   Max.   :78.58  
 wc2.1_10m_bio_4  wc2.1_10m_bio_5 wc2.1_10m_bio_6  wc2.1_10m_bio_7
 Min.   : 74.44   Min.   :19.18   Min.   :-4.774   Min.   :13.32  
 1st Qu.:174.52   1st Qu.:26.60   1st Qu.: 3.771   1st Qu.:22.83  
 Median :179.77   Median :28.13   Median : 3.771   Median :22.83  
 Mean   :287.15   Mean   :29.96   Mean   : 5.360   Mean   :24.60  
 3rd Qu.:372.98   3rd Qu.:33.66   3rd Qu.: 6.185   3rd Qu.:27.05  
 Max.   :798.73   Max.   :39.87   Max.   :20.015   Max.   :37.67  
 wc2.1_10m_bio_8  wc2.1_10m_bio_9  wc2.1_10m_bio_10 wc2.1_10m_bio_11
 Min.   : 7.099   Min.   : 7.106   Min.   :11.99    Min.   : 4.77   
 1st Qu.:16.409   1st Qu.:13.130   1st Qu.:17.48    1st Qu.:13.06   
 Median :17.919   Median :13.946   Median :19.02    Median :13.13   
 Mean   :20.020   Mean   :15.985   Mean   :21.29    Mean   :14.26   
 3rd Qu.:24.269   3rd Qu.:18.132   3rd Qu.:25.42    3rd Qu.:14.89   
 Max.   :31.271   Max.   :29.048   Max.   :31.27    Max.   :27.02   
 wc2.1_10m_bio_12 wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15
 Min.   :  95.0   Min.   : 16.0    Min.   : 0.000   Min.   : 44.81  
 1st Qu.: 598.0   1st Qu.:134.0    1st Qu.: 5.000   1st Qu.: 84.40  
 Median : 790.0   Median :163.0    Median : 7.000   Median : 94.89  
 Mean   : 784.5   Mean   :163.7    Mean   : 9.265   Mean   : 90.26  
 3rd Qu.: 807.5   3rd Qu.:183.0    3rd Qu.: 8.000   3rd Qu.: 94.89  
 Max.   :3171.0   Max.   :621.0    Max.   :84.000   Max.   :132.59  
 wc2.1_10m_bio_16 wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19
 Min.   :  42.0   Min.   :  0.00   Min.   : 23.0    Min.   :  4.0   
 1st Qu.: 334.2   1st Qu.: 23.00   1st Qu.:192.5    1st Qu.: 29.0   
 Median : 457.0   Median : 28.00   Median :233.0    Median : 29.0   
 Mean   : 441.4   Mean   : 35.68   Mean   :261.7    Mean   : 54.8   
 3rd Qu.: 482.5   3rd Qu.: 30.00   3rd Qu.:264.0    3rd Qu.: 52.0   
 Max.   :1634.0   Max.   :298.00   Max.   :908.0    Max.   :477.0

correlation <- cor((occurrence_df), use = "complete.obs") # there are some NA
corrplot::corrplot(correlation)

# Model
myMod <- glm(pres ~ ., data = occurrence_df, family = "binomial")
summary(myMod)


Call:
glm(formula = pres ~ ., family = "binomial", data = occurrence_df)

Coefficients:
                   Estimate Std. Error    z value Pr(>|z|)    
(Intercept)       8.444e+15  1.960e+08   43079725   <2e-16 ***
wc2.1_10m_bio_1   4.465e+14  2.575e+07   17339519   <2e-16 ***
wc2.1_10m_bio_2   5.506e+14  1.483e+07   37139611   <2e-16 ***
wc2.1_10m_bio_3  -1.027e+14  2.884e+06  -35626035   <2e-16 ***
wc2.1_10m_bio_4  -2.189e+14  1.304e+06 -167874437   <2e-16 ***
wc2.1_10m_bio_5  -1.494e+20  5.718e+12  -26128375   <2e-16 ***
wc2.1_10m_bio_6   1.494e+20  5.718e+12   26128245   <2e-16 ***
wc2.1_10m_bio_7   1.494e+20  5.718e+12   26128291   <2e-16 ***
wc2.1_10m_bio_8  -6.085e+13  3.033e+06  -20066054   <2e-16 ***
wc2.1_10m_bio_9  -1.063e+14  3.336e+06  -31849589   <2e-16 ***
wc2.1_10m_bio_10  8.554e+15  5.342e+07  160139279   <2e-16 ***
wc2.1_10m_bio_11 -8.251e+15  6.270e+07 -131600469   <2e-16 ***
wc2.1_10m_bio_12 -5.690e+12  9.138e+04  -62266261   <2e-16 ***
wc2.1_10m_bio_13 -7.067e+13  3.531e+05 -200156841   <2e-16 ***
wc2.1_10m_bio_14 -1.317e+14  2.747e+06  -47939478   <2e-16 ***
wc2.1_10m_bio_15 -1.497e+13  5.389e+05  -27773210   <2e-16 ***
wc2.1_10m_bio_16  3.045e+13  1.803e+05  168893340   <2e-16 ***
wc2.1_10m_bio_17  7.677e+13  1.088e+06   70574912   <2e-16 ***
wc2.1_10m_bio_18  4.376e+11  6.719e+04    6512111   <2e-16 ***
wc2.1_10m_bio_19 -7.408e+12  2.105e+05  -35196123   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  701.46  on 505  degrees of freedom
Residual deviance: 1441.75  on 486  degrees of freedom
AIC: 1481.7

Number of Fisher Scoring iterations: 23

modEvA::Dsquared(myMod, adjust = TRUE) # proportion of deviance explained by a GLM

[1] 0.7853886

# #check the imporance of the variables: dropterm function removes one-by-one one of the variable and test its importance in the model selection procedure.
 library(MASS)
# dropterm(myMod, test = "Chisq")
 #now modify your model, or remove highly correlated predictors and then a stepwise AIC-based selection
library(caret)
# remove highly correlated numeric predictors (cutoff = 0.8)
num <- sapply(occurrence_df, is.numeric)
num["pres"] <- FALSE
drop <- findCorrelation(cor(occurrence_df[, num], use = "pairwise.complete.obs"),
                        cutoff = 0.8, names = TRUE)

df_red <- occurrence_df[, !(names(occurrence_df) %in% drop)]
# GLM + AIC stepwise
myMod_selection <- step(
  glm(pres ~ ., data = df_red, family = binomial()),
  direction = "both",
  trace = 0
)

summary(myMod_selection)


Call:
glm(formula = pres ~ wc2.1_10m_bio_2 + wc2.1_10m_bio_3 + wc2.1_10m_bio_5 + 
    wc2.1_10m_bio_15 + wc2.1_10m_bio_18, family = binomial(), 
    data = df_red)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -14.183563  10.980151  -1.292  0.19645    
wc2.1_10m_bio_2    0.805971   0.286009   2.818  0.00483 ** 
wc2.1_10m_bio_3    0.483176   0.101225   4.773 1.81e-06 ***
wc2.1_10m_bio_5   -1.207718   0.189578  -6.371 1.88e-10 ***
wc2.1_10m_bio_15   0.127579   0.026768   4.766 1.88e-06 ***
wc2.1_10m_bio_18  -0.036733   0.007235  -5.077 3.84e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 701.46  on 505  degrees of freedom
Residual deviance: 155.04  on 500  degrees of freedom
AIC: 167.04

Number of Fisher Scoring iterations: 9

modEvA::Dsquared(myMod_selection, adjust = TRUE) # proportion of deviance explained by a GLM

[1] 0.7789729

If you have spatial data describing the predictors, you could apply the model and create spatial maps. That is what we will do below!

2. Objective

The objective of our project is to map the Habitat distribution of a given species, in a given area, at a given time: current and future climatic conditions. The output should be a change analysis current -> to future (2050 -> 2070 etc)… Something like this:

Cool! Now let’s break this down and work step by step

3. Setting up a workflow

3.1 Summary

A geoproject’s workflow can be summarized as follows:

Identify a clear goal
Decompose the framework into small steps:
- the conceptual aspect
- the practical aspect : the data I need and the methodology
- the coding aspect.

As always, consider reproducibility and clear coding ;-) We can conceptualize the project as follows: Concept

3.2 Define data needs

So what kind of data do we need?

+   Species occurrences (GBIF)
+   Species absences: randomly generated
+   Climatic variables (Worldclim) + Vegetation
+   A given species
+   An Area of Interest

Before jumping into the workflow, take your time to understand your data, here we will work with Worldclim (see session 1 to 3) and GBIF. Note that we can extract occurrences (sightings) of tspecies using the sp_occurrence function of the geodata package. This links to the Global Biodiversity data see here for more info. Make sure you understand the different options of the sp_occurrence function. Create a model which is able to produce as output a species suitability map (in-class, this will be developed here below). Modify your code in order to get predictions for multiple species and/or future scenarios (assignment).

3.3 Visualize the workflow

flowchart TD
  A[["<b>INPUT</b>
    1. Set ROI (SpatVector)
    2. Set Species: name, genus (string)
    "]]
    
  B[["<b>MAIN STEPS</b>
    1. Get predictors (SpatRaster)
    2. Get occurrences data (Data.Frame)
    3. Extract predictor values at occurrence and absence locations (Data.Frame)
    4. Fit a model (glm object)
    5. Spatial predictions using fitted model (SpatRaster)
    "]]

  B1[["<b>Get Predictors</b><br>1. Download worldclim & cmip6 (SpatRaster) <br>2. Stack <br>3. Crop & mask <br>"]]
  B2[["<b>Get occurrence</b><br>1. Download GBIF (Data.Frame) <br>2. Add random absences <br>3. Create a dependent variable (0/1) <br>"]]

  C[["<b>OUTPUT</b>
        1. Suitability maps
        2. Plots"]]

  A --> PROCESSING
  PROCESSING --> C
  subgraph PROCESSING
   B --> |STEP 1 Details| B1 
   B --> |STEP 2 Details| B2 
    end
 
  style A text-align:left
  style B text-align:left
  style C text-align:left  
  style B1 text-align:left
  style B2 text-align:left

Note

In species distribution modeling (SDM), we typically need both presence and absence data to train a predictive model. However, for many species, true absence data is not available (since most datasets only record where a species has been found, not where it does not occur). This is where pseudo-absence selection comes in: we generate artificial “absence” points in places where the species might not exist.

4. Coding first steps

At first, let’s try to create a code for a specific example, i.e. without any loops and/or function calls. We try to model for a specific species, the axolotl which can be found in Mexico.

4.0 Environment

# Load libraries and set parameters
pacman::p_load(geodata, terra, leaflet, mapview)
setwd("path/to/my/folder")
options(geodata_default_path = file.path("data", "raw"))

4.1 Load Predictors

From the start, think about the parameters of the function (e.g., country name), this will allow to functionalize rapidly your code later… Let’s start with downloading predictor variables for a given AOI Note that we define just a resolution, model and scenario for now. We download a long list of potential climatic predictors using the worldclim database, and we use the bio variables. See see here for more info

# Create AOI
region <- "MEX"
region <- gadm(country = region, level = 0)

## STEP 1 - create RasterStack with predictors
res <- 10
model <- "ACCESS-CM2"
ssp <- "126"

predictors <- worldclim_global(var = "bio", res = res, ssp = ssp, path = "data/raw/") # https://worldclim.org/data/bioclim.html
predictors <- predictors %>%
    crop(region) %>%
    mask(region)
plot(predictors)

4.2 Load occurrences

Now, let’s download occurrence for a given species

## STEP 2 - Load sightings data and verification
genus <- "ambystoma"
species <- "mexicanum"
# Download observations
sightings <- sp_occurrence(genus = genus, species = species, ext = region, sp = TRUE, download = TRUE, geo = TRUE)
sightings <- vect(sightings)
crs(sightings) <- crs(region)
plot(sightings)
plot(region, add = T)

4.3 Prepare occurrences for modelling

## STEP 3 - Create presence/absence database
set.seed(1975)
absence <- terra::spatSample(predictors, size = length(sightings), method = "random", na.rm = T)
absence$pres <- 0
presence <- terra::extract(predictors, sightings, df = TRUE)
presence$pres <- 1
occurrence_df <- dplyr::bind_rows(absence, presence) %>% dplyr::select(-ID)
#glimpse(occurrence_df)

4.4 Logistic Model

## STEP 4 - Fit a model
myMod <- glm(pres ~ ., data = occurrence_df, family = "binomial")
summary(myMod)


Call:
glm(formula = pres ~ ., family = "binomial", data = occurrence_df)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -9.224e+01  8.745e+01  -1.055 0.291511    
wc2.1_10m_bio_1  -4.252e+00  1.095e+01  -0.388 0.697734    
wc2.1_10m_bio_2  -1.752e+01  5.942e+00  -2.948 0.003194 ** 
wc2.1_10m_bio_3   4.519e+00  1.519e+00   2.975 0.002931 ** 
wc2.1_10m_bio_4  -2.020e+00  5.998e-01  -3.368 0.000757 ***
wc2.1_10m_bio_5   1.162e+06  1.190e+06   0.976 0.328921    
wc2.1_10m_bio_6  -1.162e+06  1.190e+06  -0.976 0.328925    
wc2.1_10m_bio_7  -1.162e+06  1.190e+06  -0.976 0.328928    
wc2.1_10m_bio_8   9.413e+00  5.544e+00   1.698 0.089529 .  
wc2.1_10m_bio_9  -3.640e+00  1.252e+00  -2.908 0.003636 ** 
wc2.1_10m_bio_10  5.880e+01  1.938e+01   3.035 0.002408 ** 
wc2.1_10m_bio_11 -7.737e+01  2.554e+01  -3.029 0.002450 ** 
wc2.1_10m_bio_12 -4.326e-01  1.138e-01  -3.802 0.000144 ***
wc2.1_10m_bio_13 -1.301e+00  3.964e-01  -3.282 0.001030 ** 
wc2.1_10m_bio_14  1.885e+00  1.011e+00   1.864 0.062382 .  
wc2.1_10m_bio_15 -1.343e+00  3.826e-01  -3.511 0.000446 ***
wc2.1_10m_bio_16  1.184e+00  3.298e-01   3.592 0.000329 ***
wc2.1_10m_bio_17 -8.868e-01  4.138e-01  -2.143 0.032110 *  
wc2.1_10m_bio_18 -1.400e-01  6.957e-02  -2.013 0.044154 *  
wc2.1_10m_bio_19  3.394e-01  1.371e-01   2.475 0.013313 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 665.421  on 479  degrees of freedom
Residual deviance:  54.914  on 460  degrees of freedom
AIC: 94.914

Number of Fisher Scoring iterations: 15

4.5 Predict spatially

# STEP 5 - Creating suitability layers
suitability <- predict(predictors, myMod, type = "response") ### Rastermap

4.6 Create a map

And finally, visualize the map

# STEP 6 - Visualize
# you can blank the regions below a threshold
suitability <- ifel(suitability < 0.01, NA, suitability)
p <- mapview(suitability, col.regions = "red", na.color = "transparent", layer.name = "Suitability", legend = FALSE) +
    mapview(sightings, cex = 2, popup = FALSE, layer.name = "Observations")
p

Note that the pattern reflected by the suitability map is not matching perfectly observations. There are certainly many ways to improve this! Look e.g. here for a good reference on Species Distribution Modelling.

5. Generalizing the code

Now that it works for one example, we can now create a function which basically is a copy paste of the above, but where we add variables for the parameters. Note the use of a list as the output contains the output in spatRaster format, the model performance as well as maps. A list is very useful when combining different types of data. Note that naming the list items helps to reference the items more easily afterwards (see below).

#### Wrap this in a function
my_suitability_fct <- function(region, genus, species, res) {
    ### Description: Function to model the suitability (present and future) of an area for a species.
    ### Inputs:
    ## region of interest (spatVector),
    ## genus and species of interest (see https://www.gbif.org/)
    ## res (numeric): resolution 2.5 5 or 10,
    ### Output: 1 suitability map as a map, a spatRaster suitability map, metric for model performance (Dsquared)
    require(terra)
    require(tidyverse)
    require(leaflet)
    require(geodata)
    require(modEvA)

    ## STEP 1 - create RasterStack with predictors
    predictors <- worldclim_global(
        var = "bio",
        res = res,
        path = "data/raw/"
    ) # https://worldclim.org/data/bioclim.html
    predictors <- predictors %>%
        crop(region) %>%
        mask(region)

    ## STEP 2 - Load sightings data and verification
    sightings <- sp_occurrence(
        genus = genus, species = species,
        ext = region, sp = TRUE, download = TRUE, geo = TRUE
    )
    sightings <- vect(sightings)
    crs(sightings) <- crs(region)

    ## STEP 3 - Create presence/absence database
    set.seed(1975)
    absence <- terra::spatSample(predictors,
        size = length(sightings), method = "random", na.rm = T
    )
    absence$pres <- 0
    presence <- terra::extract(predictors, sightings, df = TRUE)
    presence$pres <- 1
    occurrence_df <- dplyr::bind_rows(absence, presence) %>%
        dplyr::select(-ID)

    ## STEP 4 - Fit a model
    myMod <- glm(pres ~ ., data = occurrence_df, family = "binomial")
    # proportion of deviance explained by a GLM
    Dsq <- modEvA::Dsquared(myMod, adjust = TRUE)
    # STEP 5 - Creating suitability layers
    suitability <- predict(predictors, myMod, type = "response") ### Rastermap

    # STEP 6 - Visualize
    suitability <- ifel(suitability < 0.01, NA, suitability)

    # we use leaflet here
    pal <- colorNumeric(c("#FFFFCC", "#41B6C4", "#0C2C84"),
        domain = c(0, 1),
        na.color = "transparent"
    )
    map <- leaflet::leaflet() %>%
        leaflet::addTiles() %>%
        leaflet::addRasterImage(suitability, colors = pal, opacity = 0.8) %>%
        leaflet::addLegend(
            pal = pal, values = raster::values(suitability),
            title = paste("Suitability")
        )
    results <- (list(
        "suitability_leaflet" = map,
        "suitability_raster" = suitability,
        "Dsq" = Dsq
    ))
    return(results)
}

We can now test this function:

# test function
ROI <- gadm(country = "MEX", level = 0, path = "data/raw/")
output <- my_suitability_fct(
    region = ROI,
    genus = "ambystoma", species = "mexicanum", res = 10
)

Explore what is contained in the variable, note that you can a $ to extract from the list

str(output, 1)

List of 3
 $ suitability_leaflet:List of 8
  ..- attr(*, "class")= chr [1:2] "leaflet" "htmlwidget"
  ..- attr(*, "package")= chr "leaflet"
 $ suitability_raster :S4 class 'SpatRaster' [package "terra"]
 $ Dsq                : num 0.917

output$suitability_leaflet

Note

The above function is far from perfect. Further developments could be to (see also next session):

Improve modelling by:
- Adding more predictors: Consider using other predictors linked to habitat suitability, like altitude, human activity, etc.
- Enhancing predictor selection: The function currently uses all bioclimatic variables from worldclim_global(), but feature selection could improve model accuracy by reducing multi-collinearity and improving interpretability (e.g., perform variance inflation factor (VIF) analysis).
- Using more advanced modeling techniques: Instead of a simple glm, consider using more flexible machine learning models like random forests, MaxEnt, or boosted regression trees to better capture non-linear relationships.

Estimate and report model performance: Perform cross-validation and validation techniques
Incorporating spatial autocorrelation: The function does not account for spatial dependency in species occurrences, which can lead to biased estimates.
Improving absence data selection: The current random sampling for pseudo-absences may not be optimal¹. Instead, consider using environmental filtering or buffer-based approach to create more meaningful absence points:
- Target-Group Background Sampling: Select pseudo-absences from locations where the same survey effort has been conducted (e.g., using background species occurrence records).
- Environmental Filtering: Instead of choosing pseudo-absences randomly, exclude areas with very different environmental conditions from presence locations.
- Buffer-Based Selection: Select pseudo-absences outside a certain radius around presence points to reduce bias.
Perform data quality checks: E.g., check if there is enough occurrences for robust modelling returned by the function, such as:

if (length(occ) < 40) {
    stop("Not enough occurrences (ie < 40) in the selected region
        to model the suitability map.")
}

We could also check the quality of the occurrences retrieved. See e.g. here or here

Improve code robustness and maintenance by:
- Improving function modularity: The function currently performs multiple tasks at different levels of abstraction. Breaking the function into smaller helper functions for AOI creation, predictor extraction, occurrence processing, model fitting, and visualization would make it easier to debug and maintain.
- Adding error handling:
  - Check predictors’ coordinate reference system (CRS) align with the species data and extraction of the raster information at occurrence location return something
  - Use tryCatch()² to handle potential failures when loading external data (e.g., failed GBIF queries, missing climate data)

predictors <- tryCatch(
    {
        worldclim_global(var = "bio", res = res)
    },
    error = function(e) {
        stop("Warning: Failed to retrieve WorldClim data. Exiting.")
    }
)

Check input types and values: The function assumes valid inputs but does not validate them. Add assertions such as:
- region should be a valid ISO country code or a spatial object.
- genus and species should be character strings.
- res should be a numeric value (2.5, 5, or 10), with a warning if an unsupported value is provided.

E.g.:

assertthat::assert_that(inherits(region, "SpatVector"), msg = "Region must be a 'SpatVector' object.")
assertthat::assert_that(res %in% c(2.5, 5, 10), msg = "Resolution must be one of: 2.5, 5, or 10.")

Improve code documentation & output interpretability:
- Explain what the suitability scores represent (e.g., probability of presence, model confidence, etc.).
- Improve documentation using e.g. roxygen2

Note

You can create such an improved code using GenAI: The following code has been created using this simple prompt: i) remove the leaflet output map from the function. ii) Improve code robustness and maintenance by: - Breaking the function into smaller helper functions for AOI creation, predictor extraction, occurrence processing, model fitting. - Add error handling using tryCatch() Check predictors’ coordinate reference system (CRS) align with the species data and extraction of the raster information at occurrence location return something (failed GBIF queries, missing climate data). The variable res can only be 10, 5, 2.5, and 0.5 (minutes of a degree). remove ssp. Thanks!

#' Species climatic suitability modelling (GLM)
#'
#' Fits a binomial GLM using WorldClim bioclim predictors and GBIF occurrences,
#' then predicts climatic suitability across a region of interest.
#'
#' The function downloads climate predictors, retrieves species occurrences,
#' builds a presence–pseudoabsence dataset, fits a GLM, and returns a suitability raster.
#'
#' @param region SpatVector. Area of interest polygon (terra::SpatVector).
#' Must have a valid CRS.
#'
#' @param genus Character. Genus name (GBIF taxonomy).
#'
#' @param species Character. Species name.
#'
#' @param res Numeric. Resolution of WorldClim predictors in minutes of degree.
#' Must be one of: `10`, `5`, `2.5`, `0.5`.
#'
#' @param climate_path Character. Local directory where WorldClim data will be stored.
#' Default: `"data/raw/"`.
#'
#' @param seed Numeric. Random seed for pseudo-absence generation.
#' Default: `1975`.
#'
#' @param pa_ratio Numeric. Ratio of pseudo-absences to presences.
#' Default: `1`.
#'
#' @param suitability_min Numeric. Minimum probability threshold below which
#' values are set to NA in output raster.
#' Default: `0.01`.
#'
#' @return A list with:
#' \describe{
#'   \item{suitability_raster}{SpatRaster. Predicted suitability map.}
#'   \item{Dsq}{Numeric. Adjusted deviance explained (Dsquared).}
#'   \item{model}{glm object. Fitted GLM model.}
#'   \item{occurrence_df}{Data frame used to fit the model.}
#' }
#'
#' @details
#' Workflow:
#' 1. Download WorldClim bioclim predictors.
#' 2. Crop/mask to region of interest.
#' 3. Download GBIF occurrences.
#' 4. Generate pseudo-absences.
#' 5. Fit binomial GLM.
#' 6. Predict suitability.
#'
#' Error handling ensures:
#' - CRS consistency
#' - valid extraction at occurrence locations
#' - successful data downloads
#'
#' @examples
#' \dontrun{
#' library(terra)
#'
#' region <- vect(system.file("ex/lux.shp", package="terra"))
#'
#' out <- my_suitability_fct(
#'   region = region,
#'   genus = "Quercus",
#'   species = "robur",
#'   res = 10
#' )
#'
#' plot(out$suitability_raster)
#' out$Dsq
#' }
#'
#' @export
my_suitability_fct <- function(region,
                               genus,
                               species,
                               res,
                               climate_path = "data/raw/",
                               seed = 1975,
                               pa_ratio = 1,
                               suitability_min = 0.01) {

  # ---- checks ----
  if (!inherits(region, "SpatVector"))
    stop("`region` must be a terra::SpatVector")

  if (is.na(terra::crs(region)))
    stop("`region` must have a CRS")

  allowed_res <- c(10, 5, 2.5, 0.5)
  if (!res %in% allowed_res)
    stop("`res` must be one of: 10, 5, 2.5, 0.5")

  # ---- packages ----
  req <- c("terra", "geodata", "modEvA", "dplyr")
  miss <- req[!vapply(req, requireNamespace, logical(1), quietly = TRUE)]
  if (length(miss))
    stop("Missing packages: ", paste(miss, collapse = ", "))

  # ---- helpers ----
  load_predictors <- function() {
    tryCatch({
      geodata::worldclim_global(
        var = "bio",
        res = res,
        path = climate_path
      )
    }, error = function(e) {
      stop("Climate download failed: ", e$message)
    })
  }

  prep_predictors <- function(pred) {
    if (!terra::same.crs(pred, region))
      region_proj <- terra::project(region, terra::crs(pred))
    else
      region_proj <- region

    pred |>
      terra::crop(region_proj) |>
      terra::mask(region_proj)
  }

  load_occurrences <- function() {
    tryCatch({
      s <- geodata::sp_occurrence(
        genus = genus,
        species = species,
        ext = region,
        sp = TRUE,
        download = TRUE,
        geo = TRUE
      )
      s <- terra::vect(s)
      terra::crs(s) <- "EPSG:4326" # GBIF data is typically in WGS84

      if (nrow(s) == 0)
        stop("No occurrences found in region.")

      s
    }, error = function(e) {
      stop("GBIF query failed: ", e$message)
    })
  }

  build_pa <- function(pred, sightings) {

    if (!terra::same.crs(pred, sightings))
      sightings <- terra::project(sightings, terra::crs(pred))

    pres <- terra::extract(pred, sightings, df = TRUE)
    pres$pres <- 1

    pres <- pres[stats::complete.cases(pres), ]
    if (nrow(pres) == 0)
      stop("Extraction returned only NA values.")

    n_abs <- max(1, round(nrow(pres) * pa_ratio))

    set.seed(seed)
    abs <- terra::spatSample(pred,
                             size = n_abs,
                             method = "random",
                             na.rm = TRUE)

    abs <- as.data.frame(abs)
    abs$pres <- 0

    dplyr::bind_rows(abs, pres) |>
      dplyr::select(-dplyr::any_of("ID"))
  }

  # ---- pipeline ----
  predictors <- load_predictors()
  predictors <- prep_predictors(predictors)

  sightings <- load_occurrences()

  occurrence_df <- build_pa(predictors, sightings)

  model <- stats::glm(pres ~ ., data = occurrence_df, family = stats::binomial())

  Dsq <- modEvA::Dsquared(model, adjust = TRUE)

  suitability <- terra::predict(predictors, model, type = "response")
  suitability <- terra::ifel(suitability < suitability_min, NA, suitability)

  list(
    suitability_raster = suitability,
    Dsq = Dsq,
    model = model,
    occurrence_df = occurrence_df
  )
}

# test function
ROI <- gadm(country = "MEX", level = 0, path = "data/raw")
output <- my_suitability_fct(
    region = ROI,
    genus = "ambystoma", species = "mexicanum", res = 10
)

Assignment

Building on what we have done there together, the challenge is now to further develop this model and to make it broader by also predicting future suitability maps using the cmip climate predictions. Make sure that your model can predict current suitability and suitability for 2030, 2050 and 2070. For example using a loop over the years: years<-c("1970-2000", "2021-2040", "2041-2060", "2061-2080")

Also make sure your model can handle different type of climate models and scenario’s.

The output should be a list containing leaflet maps for the current situation, leaflet maps showing the change for each interval relative to the current condition, and the suitability maps for each time interval in spatRaster format.

Have fun!!

References

Kraemer, Moritz UG, Marianne E Sinka, Kirsten A Duda, Adrian QN Mylne, Freya M Shearer, Christopher M Barker, Chester G Moore, et al. 2015. “The Global Distribution of the Arbovirus Vectors Aedes Aegypti and Ae. Albopictus.” Elife 4: e08347. https://elifesciences.org/articles/08347.

Footnotes

E.g., if pseudo-absences are placed in highly unsuitable environments (e.g., deserts for a rainforest species), the model may overestimate how distinct presence locations are↩︎
Look alternatively at purrr::possibly()↩︎