LGEO2185: Efficient Spatial Data I/O

Author

Kristof Van Oost, Antoine Stevens & Valentin Charlier

Learning Objectives

Understand the limitations of traditional spatial data formats (e.g., shapefiles).
Discuss modern spatial data formats, specifically:
- GeoParquet: A columnar, efficient format for vector data.
- Cloud Optimized GeoTIFFs (COGs): Optimized for cloud-based storage and partial reads of large raster datasets.
Access and query cloud-based data stores :
- STAC (SpatioTemporal Asset Catalog): A standard for indexing and accessing spatiotemporal datasets

1. Introduction

Spatial data is growing rapidly in both volume and complexity. Traditional file formats—such as the ESRI shapefile for vectors and the GeoTIFF for rasters—have served geospatial professionals well for decades, but they present limitations in terms of efficiency, scalability, and interoperability. Modern file formats have emerged that address these issues by being more compact, faster to query, and better suited for big data and cloud-based environments.

Note

For our examples, we will use a sample dataset from Open Buildings for Africa and Microsoft Planetary Computer.

2. Vector

2.1 Geopackage

The Geopackage (.gpkg) format is a modern, open standard for storing geospatial data that leverages the power of an SQLite database. Unlike traditional shapefiles—which rely on multiple files to store geometry, spatial reference information, and attribute data—a geopackage consolidates all of this information into a single, portable file.

This capability, combined with its compact design, ensures high interoperability across various software platforms like R, QGIS, and ArcGIS, while offering improved performance and storage efficiency.

Below is an example of writing a shapefile and a geopackage using the sf package in R. We also compare file sizes and performance:

pacman::p_load("readr", "sf", "tictoc", "tidyverse", "geodata", "terra")

# Load sample data from a CSV and convert to an sf object
# Data from Open Buildings for Africa
# https://sites.research.google/gr/open-buildings/
# Use the n_max argument if you want not this to take too long, e.g.
x <- readr::read_csv("https://storage.googleapis.com/open-buildings-data/v3/polygons_s2_level_4_gzip/183_buildings.csv.gz", n_max = 10e6)
x <- sf::st_as_sf(x, wkt = "geometry", crs = 4326)
glimpse(x)

Rows: 10,000,000
Columns: 6
$ latitude       <dbl> -0.19348984, -0.04053038, -0.79517830, -1.09686864, -2.…
$ longitude      <dbl> 35.56646, 37.63605, 37.46274, 37.20256, 37.53664, 36.65…
$ area_in_meters <dbl> 13.6828, 39.3581, 21.6735, 9.9701, 17.2219, 24.4886, 33…
$ confidence     <dbl> 0.7385, 0.7824, 0.6809, 0.7498, 0.6932, 0.7223, 0.8489,…
$ geometry       <POLYGON [°]> POLYGON ((35.56649 -0.19349..., POLYGON ((37.63…
$ full_plus_code <chr> "6GFQRH48+JH4H", "6GFVXJ5P+QCMP", "6GFV6F37+W3HW", "6GC…

# Write to ESRI Shapefile (multiple files created)
tic("Writing Shapefile")
# Let's write only a portion to disk
sf::st_write(x[1:10000, ], dsn = file.path("data", "raw", "buildings_kenya.shp"), append = FALSE)

Deleting layer `buildings_kenya' using driver `ESRI Shapefile'
Writing layer `buildings_kenya' to data source 
  `data/raw/buildings_kenya.shp' using driver `ESRI Shapefile'
Writing 10000 features with 5 fields and geometry type Polygon.

# Note that column names will be abbreviated if longer than 10(!) characters
toc()

Writing Shapefile: 0.315 sec elapsed

# Write to Geopackage (single file)
tic("Writing Geopackage")
sf::st_write(x[1:10000, ], dsn = file.path("data", "raw", "buildings_kenya.gpkg"), layer = "buildings_poly", append = FALSE)

Deleting layer `buildings_poly' using driver `GPKG'
Writing layer `buildings_poly' to data source 
  `data/raw/buildings_kenya.gpkg' using driver `GPKG'
Writing 10000 features with 5 fields and geometry type Polygon.

toc()

Writing Geopackage: 0.179 sec elapsed

# Compare file sizes
cat("Shapefile size:", file.size(file.path("data", "raw", "buildings_kenya.shp")) / 1024^2, "Mb\n")

Shapefile size: 1.343575 Mb

cat("Geopackage size:", file.size(file.path("data", "raw", "buildings_kenya.gpkg")) / 1024^2, "Mb\n")

Geopackage size: 26.16406 Mb

Often, a geopackage will be smaller than a shapefile due to its efficient, single-file design and optimized storage of attribute data. However, as discussed in this Reddit thread, the outcome can vary depending on the dataset’s characteristics.

In practice, if your dataset contains many individual features with complex geometries (i.e., a high vertex count) and relatively little attribute data, the overhead of storing binary geometries (in Well-Known Binary format) within the geopackage can result in a larger file compared to the shapefile version. Conversely, if your features have simpler geometries and a richer set of attribute data, geopackages tend to be more compact.

Furthermore, in contrast to shapefiles, which are limited to one feature type per file, geopackage can store multiple layers within the same file. This means you can organize different datasets (such as points, lines, and polygons) as separate layers in one geopackage, simplifying data management and distribution. Here is how it works:

# Cast to point geometry
xp <- sf::st_as_sf(x[1:10000, c("longitude", "latitude")], coords = c("longitude", "latitude"), crs = 4326)

# Now write the centroids, note the argument 'append = TRUE' to add a new layer
sf::st_write(xp, dsn = file.path("data", "raw", "buildings_kenya.gpkg"), layer = "buildings_centroids", append = TRUE)

Updating layer `buildings_centroids' to data source `data/raw/buildings_kenya.gpkg' using driver `GPKG'
Updating existing layer buildings_centroids
Writing 10000 features with 2 fields and geometry type Polygon.

Partial reads

When your data is stored in a geopackage, you can leverage SQL queries and wkt_filter with the st_read() function to retrieve only a subset of features or columns without loading everything in memory.

# Partial read from a geopackage using an SQL query
buildings_partial <- sf::st_read(
    dsn = file.path("data", "raw", "buildings_kenya.gpkg"),
    query = "SELECT * FROM buildings_poly WHERE area_in_meters <= 50 AND longitude >= 37"
)

Reading query `SELECT * FROM buildings_poly WHERE area_in_meters <= 50 AND longitude >= 37'
from data source `/Users/Antoine_Stevens/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Projects/LGEO/Efficient Spatial IO/data/raw/buildings_kenya.gpkg' 
  using driver `GPKG'
Simple feature collection with 2788 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 37.00007 ymin: -3.974375 xmax: 39.97876 ymax: -3.869191e-05
Geodetic CRS:  WGS 84

# Same using wkt_filter argument
bbox <- st_bbox(c(xmin = 36.0, ymin = -2.5, xmax = 37.0, ymax = -1.5), crs = st_crs(4326)) |>
    st_as_sfc() |> # Cast to sf
    st_as_text() # Convert to wkt

buildings_bbox <- st_read(
    dsn = file.path("data", "raw", "buildings_kenya.gpkg"),
    layer = "buildings_poly",
    wkt_filter = bbox
)

Reading layer `buildings_poly' from data source 
  `/Users/Antoine_Stevens/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Projects/LGEO/Efficient Spatial IO/data/raw/buildings_kenya.gpkg' 
  using driver `GPKG'
Simple feature collection with 69 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 36.05732 ymin: -2.49275 xmax: 36.98253 ymax: -1.503822
Geodetic CRS:  WGS 84

2.2 Geoparquet

GeoParquet leverages the Apache Parquet format for storing tabular data in a columnar, binary form. Key benefits of Parquet include:

Compactness: Parquet files are typically smaller than other data formats, thanks to efficient encoding and built-in compression.
Rich schema information: Parquet files store data types (e.g., date, numeric, categorical) alongside the data.
Column-oriented Storage: This arrangement enables efficient reading of only the necessary columns, which is especially beneficial for large datasets.
Chunked storage: Data can be partitioned into chunks (or partitions), making it easier to work with subsets of data without reading the entire dataset into memory.

To work with geoparquet you will need the arrow and sfarrow libraries. We can write a parquet file to disk with the following - let’s measure performance and disk space in doing so.

pacman::p_load("arrow", "sfarrow")
# Write to geoparquet (single file)
tic("Writing geoparquet")
sfarrow::write_sf_dataset(x, file.path("data", "raw", "buildings_kenya"), format = "parquet")
toc()

Writing geoparquet: 43.795 sec elapsed

cat("Geoparquet size:", file.size(file.path("data", "raw", "buildings_kenya", "part-0.parquet")) / 1024^2, "Mb\n")

Geoparquet size: 1141.453 Mb

That was faster!

Partitioning

Parquet files can be partitioned by one or more columns, which means the data is split into subdirectories based on the values of those columns. This partitioning scheme is extremely beneficial when querying large datasets because it allows processing engines (like Apache Arrow, Spark, or even dplyr in R) to quickly “prune” or skip entire subsets of data that are irrelevant to a specific query.

There are no strict rules for partitioning datasets—the optimal approach depends on your data structure, access patterns, and the systems processing the data. Experimentation is often necessary to determine the best partitioning strategy. As a general guideline, Apache Arrow recommends:

Avoiding files smaller than 20MB or larger than 2GB
Keeping the total number of files under 10,000
Partitioning by variables that are frequently used in filtering, as this allows Arrow to efficiently read only the necessary files

To create partitions, we can use dplyr interface. We could partition our building dataset by 1° geographic cells (using the floor of longitude and latitude) and a condition on building size (e.g., buildings larger than a given threshold).

# Divide the data into coordinates buckets and save to disk using Parquet
# Create partitions using group_by
x |>
    group_by(
        longitude = floor(longitude),
        latitude = floor(latitude),
        small_buildings = area_in_meters <= 50
    ) |>
    write_sf_dataset(file.path("data", "raw", "buildings_kenya_partitioned"),
        format = "parquet"
    )

Let’s take a look at what we just produced:

p <- file.path("data", "raw", "buildings_kenya_partitioned")
tibble(
    files = list.files(p, recursive = TRUE),
    size_MB = file.size(file.path(p, files)) / 1024^2
)

# A tibble: 56 × 2
   files                                                         size_MB
   <chr>                                                           <dbl>
 1 longitude=34/latitude=-1/small_buildings=false/part-0.parquet  57.6  
 2 longitude=34/latitude=-1/small_buildings=true/part-0.parquet   74.9  
 3 longitude=34/latitude=-2/small_buildings=false/part-0.parquet   5.96 
 4 longitude=34/latitude=-2/small_buildings=true/part-0.parquet   17.7  
 5 longitude=34/latitude=-3/small_buildings=false/part-0.parquet   0.103
 6 longitude=34/latitude=-3/small_buildings=true/part-0.parquet    0.133
 7 longitude=34/latitude=-4/small_buildings=false/part-0.parquet   0.244
 8 longitude=34/latitude=-4/small_buildings=true/part-0.parquet    0.550
 9 longitude=34/latitude=-5/small_buildings=false/part-0.parquet   0.325
10 longitude=34/latitude=-5/small_buildings=true/part-0.parquet    0.976
# ℹ 46 more rows

This creates a nested set of files. The file names use a “self-describing” convention used by the Apache Hive project. Hive-style partitions name folders with a key=value convention, hence, as you might guess, the longitude=34/latitude=-1/small_buildings=false/ contains all the data with these conditions being met.

Now, let’s see how to read this back. The first step is to open the dataset using arrow:

tic()
ds <- arrow::open_dataset(file.path("data", "raw", "buildings_kenya_partitioned"),
    partitioning = c("longitude", "latitude", "small_buildings")
) # must be specified when reading partitioned data
toc()

0.093 sec elapsed

Notice how fast it is. Let’s look at what we got

ds

FileSystemDataset with 56 Parquet files
7 columns
area_in_meters: double
confidence: double
geometry: binary
full_plus_code: string
longitude: int32
latitude: int32
small_buildings: string

See $metadata for additional Schema metadata

# What is the median building size in Kenya
median(ds$area_in_meters)

NULL

But wait… Where is my data ? How to access it ? When you run open_dataset(), here’s what happens under the hood:

Metadata scanning: Arrow scans the directory structure (including the partition folders) and reads metadata from each Parquet file. This metadata includes the schema, column types, and partitioning information.
Lazy dataset creation: Rather than loading all data into memory immediately, Arrow creates a lazy dataset object. This object holds references to all the underlying files and their structure. It doesn’t actually read the full contents of the files until you explicitly request it.
Accessing columns: When you type ds$area_in_meters, you’re not immediately pulling all the latitude data into R. Instead, you’re getting a lazy reference to that column. The actual data is only read from disk when you perform an operation that forces evaluation (such as using collect() to load data into a data frame).

In essence, open_dataset() sets up a blueprint of your data on disk. It allows you to perform efficient, on-demand queries by reading only the necessary parts of the dataset when needed. This design is particularly beneficial when working with very large datasets, as it minimizes memory usage and speeds up processing by avoiding unnecessary I/O operations.

For example, if you want to actually get the median of building area, you would do:

median_area_in_meters <- ds |>
    summarize(area_in_meters = median(area_in_meters)) |> # nothing is happening yet
    collect() # now do the calculation!
median_area_in_meters

# A tibble: 1 × 1
  area_in_meters
           <dbl>
1           31.9

Here, collect() forces Arrow to execute the query and retrieve the actual data from disk.

Similarly, when you split a large dataset into partitions, Arrow inspects the metadata and reads only the partitions relevant to your query. This selective access minimizes I/O operations and speeds up processing. Here is an example of the approach. Suppose you want to extract the footprint of all the largest buildings in Nairobi.

First, to know which 1° cells falls within Nairobi, get its bbox.

options(geodata_default_path = file.path("data", "raw"))
nairobi_bb <- gadm(country = "KEN", level = 1) |>
    sf::st_as_sf() |>
    filter(NAME_1 == "Nairobi") |>
    st_bbox()

Now we can extract…

# take first the floor to align with 1° cells
nairobi_bb_f <- floor(nairobi_bb)
# Process
tic()
buildings_filtered <- ds |>
    filter(
        longitude %in% nairobi_bb_f[1]:nairobi_bb_f[3],
        latitude %in% nairobi_bb_f[2]:nairobi_bb_f[4],
        small_buildings == FALSE,
    ) |>
    read_sf_dataset()
toc()

32.032 sec elapsed

glimpse(buildings_filtered)

Rows: 809,796
Columns: 7
$ area_in_meters  <dbl> 79.1494, 239.7302, 139.8830, 131.1544, 131.9859, 132.3…
$ confidence      <dbl> 0.8145, 0.8988, 0.8540, 0.7899, 0.8381, 0.7927, 0.9041…
$ full_plus_code  <chr> "6GCRPWJF+V7HC", "6GCRPWH9+3G6J", "6GCRQP94+MWRV", "6G…
$ longitude       <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36…
$ latitude        <int> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2…
$ small_buildings <chr> "false", "false", "false", "false", "false", "false", …
$ geometry        <POLYGON [°]> POLYGON ((36.92323 -1.26786..., POLYGON ((36.9…

Blazing fast!

Let’s check if that worked now by plotting average size of buildings by small quadrants

# Create a tmap of buildings
pacman::p_load(tmap)

# First, create a grid of 25 by 25 cells
grid <- sf::st_make_grid(buildings_filtered, n = c(25, 25), square = FALSE) |>
    sf::st_sf() |>
    mutate(quadrant = row_number())

# Compute mean per grid
buildings_filtered <- buildings_filtered |>
    st_join(grid) |>
    st_drop_geometry() |>
    group_by(quadrant) |>
    summarize(area_in_meters = mean(area_in_meters))

# Join means to grid
grid <- grid |>
    left_join(st_drop_geometry(buildings_filtered))
# grid$area_in_meters
tmap_mode("view")
tm_basemap("OpenStreetMap") +
    tm_shape(grid) +
    tm_polygons(
        fill = "area_in_meters",
        fill_alpha = 0.5,
        fill.scale = tm_scale_continuous(values = "viridis"),
        legend.show = FALSE
    )

3. Raster

In this section, we demonstrate how to work with each of these modern data formats using Python. The following examples illustrate how to read GeoParquet files, access COGs, and query STAC catalogs (e.g., on the Microsoft Planetary Computer).

3.1 Cloud Optimized GeoTIFFs (COGs)

Cloud Optimized GeoTIFFs (COGs) are a specialized form of the standard GeoTIFF format, designed specifically for efficient storage and retrieval of large raster datasets in cloud environments. Here’s what makes COGs stand out:

Internal tiling:
COGs are organized into a grid of smaller tiles. This tiling allows clients to retrieve only the specific portions of the raster that are needed, rather than downloading the entire file.
Embedded overviews:
They include lower-resolution copies (overviews) of the data within the same file. These overviews speed up visualization and analysis at various scales, enabling rapid zooming and panning.
HTTP Byte-Range support:
COGs are optimized for cloud access because they support HTTP byte-range requests. This means that when a client requests a specific area, only the relevant parts of the file are transmitted, reducing bandwidth usage and latency.
Compatibility:
Although built on the GeoTIFF standard, COGs follow specific conventions that make them particularly well-suited for modern cloud-based workflows. They are supported by a variety of GIS software and libraries, making them a popular choice for distributing large-scale geospatial data (e.g., satellite imagery, elevation models).

By leveraging these features, COGs provide a scalable and efficient solution for managing and accessing massive raster datasets, especially in web mapping applications and cloud-based processing pipelines.

3.2 STAC (SpatioTemporal Asset Catalog)

The SpatioTemporal Asset Catalog (STAC) is a catalog system designed to simplify the discovery, indexing, and access of geospatial assets, particularly in cloud-based environments. STAC provides a standardized way to organize metadata about spatial and temporal datasets, making it easier for users and applications to search for and retrieve data efficiently, including COGS. Here are some key feature of STACs

Standardized metadata:
STAC uses a JSON-based format to describe geospatial assets. This standardized metadata includes information such as the geographic bounding box, timestamp, and details about the data source. This uniformity enables interoperability between different datasets and platforms.
Hierarchical organization:
STAC catalogs are organized hierarchically into catalogs, collections, and items:
- Catalog: A top-level grouping that organizes multiple collections.
- Collection: A grouping of related items (e.g., a series of satellite images from the same mission).
- Item: An individual asset (such as a single satellite image or raster file) with its metadata.
RESTful API:
Many STAC implementations provide a RESTful API, which allows users to perform powerful searches and filters based on spatial extent, time range, and other asset properties before downloading the data. This approach minimizes the amount of data that needs to be transferred.
Cloud-Optimized Workflows:
STAC is particularly useful in cloud-based geospatial workflows. By enabling precise querying of datasets, STAC reduces bandwidth and processing overhead, as only the relevant assets are retrieved and processed.

Example: Querying a STAC catalog on the Microsoft Planetary Computer

Concretely, the process of getting data through COGs and STAC specifications work as follows:

A user or application queries a STAC catalog for datasets that meet specific criteria (e.g., location, date, sensor type).
The catalog returns items with detailed metadata and asset links.
If an asset is a COG, the user can then directly retrieve the data efficiently, benefiting from the COG’s optimized structure.

If we wanted to obtain sentinel data from these area using the traditional workflow, we would have to download many GB to our computer, even if interested in a single site. R have a dedicated STAC client called rstac.

Let’s see how this can work using Microsoft Planetary Computer, a cloud-based platform that aggregates, curates, and provides access to vast amounts of environmental and geospatial data. It is designed to support research and decision-making related to global challenges such as climate change, land use, and biodiversity. THere are other repository using the STAC specification, such as:

Terrabyte from DLR
OpenLandMap
STAC Browser, which aggregates multiple STAC catalogs from different providers

pacman::p_load("rstac")

# Define the STAC search endpoint for the Microsoft Planetary Computer
stac_endpoint <- "https://planetarycomputer.microsoft.com/api/stac/v1"

# Connect to the catalog
s <- stac(stac_endpoint)
s

###rstac_query
- url: https://planetarycomputer.microsoft.com/api/stac/v1/
- params:
- field(s): version, base_url, endpoint, params, verb, encode

# Get fields, eg collection names
s |>
    collections() |>
    get_request()

###Collections
- collections (134 item(s)):
  - daymet-annual-pr
  - daymet-daily-hi
  - 3dep-seamless
  - 3dep-lidar-dsm
  - fia
  - gridmet
  - daymet-annual-na
  - daymet-monthly-na
  - daymet-annual-hi
  - daymet-monthly-hi
  - ... with 124 more collection(s).
- field(s): collections, links, numberMatched, numberReturned

Note

The actual data assets are in private Azure Blob Storage containers. As detailed in Using tokens for data access, accessing data from the Planetary Computer typically requires signing the item. rstac has built-in support for signing.

# Search collection
aoi <- nairobi_bb # define area of interest

t0 <- "2023-01-01" # start date
t1 <- "2024-12-31" # end date
toi <- paste(t0, t1, sep = "/") # time of interest
items <- s |>
    stac_search(
        collections = "sentinel-2-l2a",
        bbox = c(aoi["xmin"], aoi["ymin"], aoi["xmax"], aoi["ymax"]),
        datetime = toi
    ) |>
    post_request() |>
    items_sign(sign_fn = sign_planetary_computer())

# number of matches
paste("Number of matches: ", items_length(items))

[1] "Number of matches:  250"

…and we can add some extra filters too. Suppose we want to get all tiles with no more than 10% cloud cover. We would do:

items <- s |>
    stac_search(
        collections = "sentinel-2-l2a",
        bbox = c(aoi["xmin"], aoi["ymin"], aoi["xmax"], aoi["ymax"]),
        datetime = toi
    ) |>
    ext_filter(`eo:cloud_cover` < 10) |>
    post_request() |>
    items_sign(sign_fn = sign_planetary_computer())

# number of matches
paste("Number of matches: ", items_length(items))

[1] "Number of matches:  75"

# Check a specific asset
items$features[[10]]$assets$B05

$href
[1] "https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/37/M/BU/2024/09/09/S2A_MSIL2A_20240909T073611_N0511_R092_T37MBU_20240909T122046.SAFE/GRANULE/L2A_T37MBU_A048137_20240909T075950/IMG_DATA/R20m/T37MBU_20240909T073611_B05_20m.tif?st=2026-03-10T13%3A11%3A20Z&se=2026-03-11T13%3A56%3A20Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-03-11T08%3A48%3A01Z&ske=2026-03-18T08%3A48%3A01Z&sks=b&skv=2025-07-05&sig=XbJxnVRVNIvuW6jrAvFRkDSIM%2BgtTitCXUwV7Kdg3Ps%3D"

$`proj:bbox`
[1]  199980 9790240  309780 9900040

$`proj:shape`
[1] 5490 5490

$`proj:transform`
[1]      20       0  199980       0     -20 9900040

$gsd
[1] 20

$type
[1] "image/tiff; application=geotiff; profile=cloud-optimized"

$roles
[1] "data"

$title
[1] "Band 5 - Vegetation red edge 1 - 20m"

$`eo:bands`
$`eo:bands`[[1]]
$`eo:bands`[[1]]$name
[1] "B05"

$`eo:bands`[[1]]$common_name
[1] "rededge"

$`eo:bands`[[1]]$description
[1] "Band 5 - Vegetation red edge 1"

$`eo:bands`[[1]]$center_wavelength
[1] 0.704

$`eo:bands`[[1]]$full_width_half_max
[1] 0.019

We can get a sense of where they are and how they look like quite easily, without the need to download them!

# Where are they ?
sf_items <- items_as_sf(items)
plot_map <- function(x) {
    tmap_mode("view")
    tm_basemap("OpenStreetMap") +
        tm_shape(x) +
        tm_borders()
}

plot_map(sf_items)

# How they look like ?
urls <- items |>
    assets_url(asset_names = "rendered_preview")
preview_plot(urls[1])

Now that we have collected the metadata for our STAC Items—including the links to geospatial data, known as assets — it’s time to retrieve the actual data. To download all the assets from a STAC Item Collection, we can use the assets_download() function. Let’s download RGB bands only:

download_items <- items |>
    assets_select(asset_names = "B01") |> # Only 1 band
    assets_download(output_dir = file.path("data","raw"), overwrite = TRUE)

download_items

###Items
- features (75 item(s)):
  - S2A_MSIL2A_20241228T074331_R092_T37MBU_20241228T110550
  - S2B_MSIL2A_20241213T074229_R092_T37MBU_20241213T094330
  - S2A_MSIL2A_20241208T074321_R092_T37MBU_20241208T111348
  - S2A_MSIL2A_20241208T074321_R092_T36MZD_20241208T111348
  - S2A_MSIL2A_20241029T074031_R092_T37MBU_20241029T111159
  - S2A_MSIL2A_20241029T074031_R092_T36MZD_20241029T111159
  - S2B_MSIL2A_20241024T073909_R092_T36MZD_20241024T095754
  - S2B_MSIL2A_20241014T073759_R092_T37MBU_20241014T102258
  - S2B_MSIL2A_20241014T073759_R092_T36MZD_20241014T102258
  - S2A_MSIL2A_20240909T073611_R092_T37MBU_20240909T122046
  - ... with 65 more feature(s).
- assets: B01
- item's fields: 
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type

Let’s try to read this back to R. These files are each saved into their own folder inside of the output_dir. We can use list.files() to get the paths to each of them, and then terra::sprc() and terra::mosaic() to combine them into a single raster

s2 <- list.files(file.path("data","raw","sentinel2-l2"),
    pattern = ".tif",
    recursive = TRUE,
    full.names = TRUE
) |>
    lapply(terra::rast) |>
    terra::sprc()
# Let's see what's inside, e.g.
lapply(s2[36:37], function(x) x)

[[1]]
class       : SpatRaster 
size        : 1830, 1830, 1  (nrow, ncol, nlyr)
resolution  : 60, 60  (x, y)
extent      : 799980, 909780, 9790240, 9900040  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 36S (EPSG:32736) 
source      : T36MZD_20240313T073711_B01_60m.tif 
name        : T36MZD_20240313T073711_B01_60m 

[[2]]
class       : SpatRaster 
size        : 1830, 1830, 1  (nrow, ncol, nlyr)
resolution  : 60, 60  (x, y)
extent      : 799980, 909780, 9790240, 9900040  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 36S (EPSG:32736) 
source      : T36MZD_20240407T073609_B01_60m.tif 
name        : T36MZD_20240407T073609_B01_60m

They can represent different bands and have different extents. Ideally, you would like to combine all the files smartly. Also, we can’t download this way only specific parts of an asset, which means we can end up downloading much more data than we actually need.

gdalcubes simplifies a lot of these operations. It offers a set of functions for creating and processing four-dimensional regular raster data cubes from irregular image collections. This approach abstracts away complexities such as varying map projections, overlapping spatial extents, and differences in the spatial resolution of spectral bands.

Using gdalcubes, we can transform a STAC response into a gdalcubes image collection using the stac_image_collection() function. This function accepts a STAC feature list as input and can optionally filter the data based on metadata attributes and specific spectral bands.

pacman::p_load("gdalcubes")

s2_collection <- stac_image_collection(items$features)
s2_collection

Image collection object, referencing 75 images with 18 bands
Images:
                                                    name     left       top
1 S2A_MSIL2A_20241228T074331_R092_T37MBU_20241228T110550 36.30333 -0.903362
2 S2B_MSIL2A_20241213T074229_R092_T37MBU_20241213T094330 36.30333 -0.903362
3 S2A_MSIL2A_20241208T074321_R092_T37MBU_20241208T111348 36.30333 -0.903362
4 S2A_MSIL2A_20241208T074321_R092_T36MZD_20241208T111348 35.75081 -0.902491
5 S2A_MSIL2A_20241029T074031_R092_T37MBU_20241029T111159 36.30333 -0.903362
6 S2A_MSIL2A_20241029T074031_R092_T36MZD_20241029T111159 35.74817 -0.902491
     bottom    right            datetime        srs
1 -1.896907 37.29059 2024-12-28T07:43:31 EPSG:32737
2 -1.896907 37.29059 2024-12-13T07:42:29 EPSG:32737
3 -1.896907 37.29059 2024-12-08T07:43:21 EPSG:32737
4 -1.895543 36.68204 2024-12-08T07:43:21 EPSG:32736
5 -1.896907 37.29059 2024-10-29T07:40:31 EPSG:32737
6 -1.895548 36.68204 2024-10-29T07:40:31 EPSG:32736
[ omitted 69 images ] 

Bands:
         name offset scale unit nodata image_count
1         AOT      0     1                      75
2         B01      0     1                      75
3         B02      0     1                      75
4         B03      0     1                      75
5         B04      0     1                      75
6         B05      0     1                      75
7         B06      0     1                      75
8         B07      0     1                      75
9         B08      0     1                      75
10        B09      0     1                      75
11        B11      0     1                      75
12        B12      0     1                      75
13        B8A      0     1                      75
14        SCL      0     1                      75
15        WVP      0     1                      75
16 visual:B02      0     1                      75
17 visual:B03      0     1                      75
18 visual:B04      0     1                      75

We can now use this object as if it was a normal raster file. For exemple, let’s create a 100m x 100m resolution average-composite RGB image from our data cube, zooming on the city of Nairobi only.

gdalcubes_options(parallel = 8)

# Let's retrieve only the city of Nairobi (not the region)
aoi_utm <- st_as_sfc(st_bbox(c(xmin = 36.65, ymin = -1.48, xmax = 37.05, ymax = -1.10), crs = 4326)) |>
    st_transform("EPSG:32737") |>
    st_bbox()

v <- cube_view(
    srs = "EPSG:32737",
    dx = 100, dy = 100,
    dt = "P1Y", # Time aggregation, here 1y (ISO 8601 duration format)
    aggregation = "median", # defining how to deal with pixels containing data from multiple images
    resampling = "average",
    extent = list(
        t0 = t0,
        t1 = t1,
        left = aoi_utm["xmin"],
        right = aoi_utm["xmax"],
        bottom = aoi_utm["ymin"],
        top = aoi_utm["ymax"]
    )
)

r <- raster_cube(s2_collection, v) |>
    select_bands(c("B02", "B03", "B04")) |>
    reduce_time(c("mean(B02)", "mean(B03)", "mean(B04)")) |> # take the mean over selected years
    write_ncdf(file.path("data", "processed", "s2.nc"), overwrite = T)

# Read back and plot
r <- rast(file.path("data", "processed", "s2.nc"))
plotRGB(r, r = 3, g = 2, b = 1, stretch = "lin", zlim = c(0, 4000))

Note

The first three steps (raster_cube(), select_bands(), reduce_time()) mostly build a “processing graph” (lazy evaluation). They don’t necessarily compute all pixels immediately.
The real compute kicks in when you materialize the result. write_ncdf() forces evaluation: it reads source imagery, reprojects, resamples, aggregates over time, and writes NetCDF.

There are a few reasons why this code is very efficient:

Lazy evaluation + streaming: gdalcubes pulls just the data needed chunk-by-chunk, computes, and writes out.
Spatial and temporal subsetting: cube_view() restricts work to the Nairobi bounding box in the target projection, and to t0..t1, which prevents unnecessary IO and processing outside the AOI/time window
Chunked parallelism: the cube is split into independent blocks that are processed in parallel (gdalcubes_options(parallel = 8))
Processing is done in compiled code: Reprojection/resampling/aggregation are executed in optimized C++/GDAL routines rather than R-level iteration

4. Further Readings

Using GeoParquet in R
Using GeoParquet in R
An introduction to using GeoParquet format within R workflows.
Mapping in R Workshop: Import/Export Geopackage
Mapping in R Workshop – GPKG
A workshop module on reading and writing Geopackages.
R for Data Science – Arrow Integration
R4DS: Arrow
A chapter on integrating Apache Arrow with R for efficient data storage and processing.
Arrow User 2022 – Data Storage
Arrow User 2022 – Data Storage
Insights and best practices for using Apache Arrow in data storage scenarios, including geospatial data.
Download data from a STAC API using R
STAC API with R A tutorial walking through querying a STAC API using the rstac R package, and downloading data from the API.
Download data from a Planetary Computer
Reading from the STAC API with R Quickstart introduction of STAC for R users using the planetary computer
Data cubes from Sentinel-2 data in the cloud
Using gdalcubes to process COGs A tutorial walking through querying a STAC API using the rstac R package, and downloading data from the API.
gdalcubes and STAC
OpenGeoHub Summer School 2021 Analyzing massive amounts of EO data in the cloud with R, gdalcubes, and STAC
Cloud Native Geo Guide
Cloud Native Geospatial format Guide
A guide to building and operating geospatial systems using cloud-native technologies.
GeoParquet Specification
GeoParquet Specification
COG (Cloud Optimized GeoTIFF) Specification
COG Specification
STAC (SpatioTemporal Asset Catalog) Specification
STAC Specification and STAC Tutorials

Assignment

Query a Microsoft Planetary Computer STAC catalog for a satellite dataset of your choice.
Create a monthly time-series for a given region.
Apply an unsupervised kmean algorithm to the image to classify
Visualize the output and interpret

Session info

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Brussels
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gdalcubes_0.7.3 rstac_1.0.1     tmap_4.2        sfarrow_0.4.1  
 [5] arrow_23.0.1.1  geodata_0.6-6   terra_1.9-1     lubridate_1.9.5
 [9] forcats_1.0.1   stringr_1.6.0   dplyr_1.2.0     purrr_1.2.1    
[13] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.2   tidyverse_2.0.0
[17] tictoc_1.2.1    sf_1.1-0        readr_2.2.0    

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        farver_2.1.2            S7_0.2.1               
 [4] fastmap_1.2.0           leaflegend_1.2.1        leaflet_2.2.3          
 [7] pacman_0.5.1            XML_3.99-0.22           digest_0.6.39          
[10] timechange_0.4.0        lifecycle_1.0.5         magrittr_2.0.4         
[13] compiler_4.5.2          rlang_1.1.7             tools_4.5.2            
[16] utf8_1.2.6              yaml_2.3.12             data.table_1.18.2.1    
[19] knitr_1.51              htmlwidgets_1.6.4       bit_4.6.0              
[22] sp_2.2-1                classInt_0.4-11         curl_7.0.0             
[25] RColorBrewer_1.1-3      abind_1.4-8             KernSmooth_2.23-26     
[28] withr_3.0.2             leafsync_0.1.0          grid_4.5.2             
[31] cols4all_0.10           e1071_1.7-17            leafem_0.2.5           
[34] colorspace_2.1-2        spacesXYZ_1.6-0         scales_1.4.0           
[37] cli_3.6.5               rmarkdown_2.30          crayon_1.5.3           
[40] generics_0.1.4          otel_0.2.0              httr_1.4.8             
[43] tzdb_0.5.0              tmaptools_3.3           ncdf4_1.24             
[46] DBI_1.3.0               proxy_0.4-29            stars_0.7-1            
[49] assertthat_0.2.1        parallel_4.5.2          s2_1.1.9               
[52] base64enc_0.1-6         vctrs_0.7.1             jsonlite_2.0.0         
[55] hms_1.1.4               bit64_4.6.0-1           jpeg_0.1-11            
[58] crosstalk_1.2.2         jquerylib_0.1.4         maptiles_0.11.0        
[61] units_1.0-0             lwgeom_0.2-15           glue_1.8.0             
[64] leaflet.providers_2.0.0 codetools_0.2-20        stringi_1.8.7          
[67] gtable_0.3.6            raster_3.6-32           logger_0.4.1           
[70] pillar_1.11.1           rappdirs_0.3.4          htmltools_0.5.9        
[73] R6_2.6.1                wk_0.9.5                vroom_1.7.0            
[76] evaluate_1.0.5          lattice_0.22-7          png_0.1-8              
[79] class_7.3-23            Rcpp_1.1.1              xfun_0.56              
[82] pkgconfig_2.0.3