appelmar / gdalcubes_cpp Goto Github PK

Earth observation data cubes from GDAL image collections

License: MIT License

CMake 1.16% Dockerfile 0.08% C++ 98.51% C 0.25%

eo data cube gdal

gdalcubes_cpp's Introduction

gdalcubes - Earth observation data cubes from GDAL image collections

AS OF 2023-04-04, THIS REPOSITORY IS NO LONGER MAINTAINED. FROM NOW, THE C++ SOURCES OF GDALCUBES WILL BE FURTHER DEVELOPED AS PART OF THE R PACKAGE REPOSITORY.

gdalcubes is a library to represent collections of Earth Observation (EO) images as on demand data cubes (or multidimensional arrays). Users define data cubes by spatiotemporal extent, resolution, and spatial reference system and let gdalcubes read only relevant parts of the data and simultaneously apply reprojection, resampling, and cropping (using gdalwarp). Data cubes may be simply exported as NetCDF files or directly streamed chunk-wise into external software such as R or Python. The library furthermore implements simple operations to reduce data cubes over time, to apply pixel-wise arithmetic expressions, and to filter by space, time, and bands.

gdalcubes is not a database, i.e., it does not need to store additional copies of the imagery but instead simply links to and indexes existing files / GDAL datasets. Using GDAL virtual file systems, it can directly access data in cloud storage and run computations in distributed environments with gdalcubes_server and Docker.

The library is written in C++ and includes a basic command line interface and an R package. A python package is planned for the future. gdalcubes is licensed under the MIT license.

Core features:

Create image collections that link to and index existing imagery from local files or cloud storage
Read multitemporal, multispectral image collections as on demand data cubes with desired spatiotemporal resolution, extent, and map projection
Abstract from complexities in the data like different map projections for adjacent images and different resolutions for different bands
Stream chunks of data cubes to external programs (e.g. R, python)
Scale computations on data cubes in distributed environments with gdalcubes_server and Docker (yet experimental)

Installation

Installation from sources

gdalcubes uses CMake and can be compiled with a typical CMake workflow as listed below.

git clone https://github.com/appelmar/gdalcubes && cd gdalcubes
mkdir -p build 
cd build 
cmake -DCMAKE_BUILD_TYPE=Release ../ -DCMAKE_INSTALL_PREFIX=/usr
make 
sudo make install

You might need to install a few libraries before compiling gdalcubes successfully. On Ubuntu apt install libgdal-dev libnetcdf-dev libcurl4-openssl-dev libsqlite3-dev will install all libraries needed to compile the core gdalcubes library. If you want to compile the command line interface, you will furthermore need apt install libboost-program-options-dev libboost-system-dev and running gdalcubes as a server additionally requires apt install libcpprest-dev.

Docker image

This repository includes a Docker image which you can use either to run the gdalcubes command line interface interactively or to run gdalcubes_server as a service for distributed processing. The commands below demonstrate how to build the image and run a container. Notice that the image builds GDAL from sources, which might take up to 30 minutes.

git clone https://github.com/appelmar/gdalcubes && cd gdalcubes 
docker build -t appelmar/gdalcubes .
docker run -d -p 11111:1111 appelmar/gdalcubes # runs gdalcubes_server as a deamon 
docker run appelmar/gdalcubes /bin/bash # get a command line where you can run gdalcubes

R package

The gdalcubes R package is hosted on https://github.com/appelmar/gdalcubes_R. It includes a Dockerfile that runs RStudio Server with the gdalcubes R package.

Documentation

More detailed information can be found at the documentation page under https://gdalcubes.github.io/.

Warning

The library is still in an early development version. Major changes are possible to make gdalcubes more user-friendly, more stable, faster, and more robust. The documentation is also preliminary and not yet complete.

Credits

gdalcubes uses the following open source libraries. Detailed licensing and copyright information can be found at https://gdalcubes.github.io/source/introduction/credits.html and in LICENSE_THIRDPARTY.

GDAL: A translator library for raster and vector geospatial data formats

json11

SQLite: A self-contained, high-reliability, embedded, full-featured, public-domain, SQL database engine

CURL: Command line tool and library for transferring data with URLs

TinyExpr: A very small recursive descent parser and evaluation engine for math expressions

netCDF: The Unidata network Common Data Form C library

tiny-process-library: A small platform independent library making it simple to create and stop new processes in C++

Catch2: A modern, C++-native, header-only, test framework for unit-tests, TDD and BDD

Date: A date and time library based on the C++11/14/17 header

cpprestsdk

Boost.Filesystem

Boost.Program_options

gdalcubes_cpp's People

Contributors

Stargazers

Watchers

Forkers

edzer lbusett noerw dondealban tiledb-inc chirag19

gdalcubes_cpp's Issues

Problem with axis order

Dear author
Your project is just greath
Using new versions of proj, from GDAL, the first axis in geographic CRSs is latitude and the second axis es longitude.
I think the solution is compile including an easy change:
srs.SetAxisMappingStrategy(OAMS_TRADITIONAL_GIS_ORDER)
in every OGRSpatialReference object

Best
David

Image metadata filters in collection format or image_collection::add()

It is sometimes useful to ignore images based their metadata (e.g. cloud coverage). This could either be added in the collection_format json description or image_collection::add() method.

var and sd reducers cause segmentation fault

gdalcubes_worker binary and Docker image for scalability

To scale analysis in shared-nothing environments, a Docker image which runs a gdalcubes_worker binary that listens to incoming tasks would be useful.

Remove Aggregation and Resampling from Data Cube View

Currently, aggregation and resampling methods are part of the data cube view. This has the following disadvantages:

Methods cannot be specified per band.
Reusing data cube views for different datasets might be limited if both datasets require different methods.

It would be much cleaner to define aggregation and resampling methods as parameters to the image_collection_cube class. The data cube view definition should be completely independent from particular datasets.

Windows compilation

There are currently a few issues with compiling gdalcubes on Windows with MinGW:

The object file apply_pixel.obj leads to "too many sections" errors, which can only be solved using -Wa,-mbig-obj on x64 builds
Compiling the currently required Boost libraries is time consuming, openly available binaries use MSVC mostly
Linking the shared DLL did not work (the static .a does) and was canceled after 2 days (ld was still busy)#

Solutions to these problems might refer to #10 by reducing external dependencies.

Builds with MSVC have not been tried yet, because the R toolchain for package compilation uses
MinGW.

stac_image_collection crashing RStudio

I've been going through the Geocomputation with R textbook and have run into an issue with gdalcubes crashing in the example script given here..

library(rstac)
library(gdalcubes)

s = stac("https://earth-search.aws.element84.com/v0")
items = s |>
  stac_search(collections = "sentinel-s2-l2a-cogs",
              bbox = c(7.1, 51.8, 7.2, 52.8), 
              datetime = "2020-01-01/2020-12-31") |>
  post_request() |> items_fetch()

collection = stac_image_collection(items$features, 
                                   property_filter = function(x) {x[["eo:cloud_cover"]] < 10})

RStudio crashes upon running stac_image_collection(). The RGui terminal also freezes. Not sure if this is overwhelming the RAM on my machine (32 GB) or what.

Info (I actually have windows 11, not sure why it says Windows 10).

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

loaded via a namespace (and not attached):
 [1] pkgload_1.2.4      viridisLite_0.4.0 
 [3] brio_1.1.3         sp_1.5-0          
 [5] stats4_4.2.0       link2GI_0.4-7     
 [7] remotes_2.4.2      sessioninfo_1.2.2 
 [9] pillar_1.8.0       lattice_0.20-45   
[11] glue_1.6.2         digest_0.6.29     
[13] RColorBrewer_1.1-3 colorspace_2.0-3  
[15] htmltools_0.5.2    XML_3.99-0.10     
[17] pkgconfig_2.0.3    devtools_2.4.3    
[19] raster_3.5-21      stars_0.5-5       
[21] purrr_0.3.4        webshot_0.5.3     
[23] scales_1.2.0       processx_3.7.0    
[25] terra_1.5-34       satellite_1.0.4   
[27] tibble_3.1.7       proxy_0.4-27      
[29] generics_0.1.2     usethis_2.1.6     
[31] ellipsis_0.3.2     cachem_1.0.6      
[33] withr_2.5.0        cli_3.3.0         
[35] mapview_2.11.0     magrittr_2.0.3    
[37] crayon_1.5.1       memoise_2.0.1     
[39] ps_1.7.1           fs_1.5.2          
[41] fansi_1.0.3        xml2_1.3.3        
[43] lwgeom_0.2-8       class_7.3-20      
[45] pkgbuild_1.3.1     tools_4.2.0       
[47] prettyunits_1.1.1  lifecycle_1.0.1   
[49] stringr_1.4.0      munsell_0.5.0     
[51] callr_3.7.0        compiler_4.2.0    
[53] e1071_1.7-11       rlang_1.0.4       
[55] classInt_0.4-7     units_0.8-0       
[57] grid_4.2.0         tmaptools_3.1-1   
[59] dichromat_2.0-0.1  rstudioapi_0.13   
[61] htmlwidgets_1.5.4  crosstalk_1.2.0   
[63] leafem_0.2.0       base64enc_0.1-3   
[65] testthat_3.1.4     codetools_0.2-18  
[67] abind_1.4-5        DBI_1.1.3         
[69] roxygen2_7.2.0     R6_2.5.1          
[71] knitr_1.39         dplyr_1.0.9       
[73] fastmap_1.1.0      utf8_1.2.2        
[75] rprojroot_2.0.3    KernSmooth_2.23-20
[77] desc_1.4.1         stringi_1.7.8     
[79] parallel_4.2.0     Rcpp_1.0.8.3      
[81] vctrs_0.4.1        sf_1.0-7          
[83] png_0.1-7          leaflet_2.1.1     
[85] tidyselect_1.1.2   xfun_0.31

GDAL3 and PROJ6 issues

Running gdalcubes with GDAL3 and proj6 currently shows some problems:

The extent of images in collections has swapped x and y axes (due to axis order in EPSG:4326)
Reading image collections with gdalwarp tends to be significantly slower (seems to be related to proj, maybe just some sqlite mutex waits, needs more testing).

This issue aims at collecting further related problems and developing ideas how to solve these.

Environment variables with boost::process

It seems that setting environment variables for child processes using boost::process::env["xxx"] produces buffer overflows (according to ASAN). We still use it because the found alternative with boost::this_process::environment().set() will cause debug build failures (pthread llinking) although it works fin in release mode.

create_image_collection is not ending

Hello,
I am using gdalcube in order to built a data cube from .tif files downloaded with google earth engine (or rgee). These files (count: 27) contain all the same bands (count: 10) and have all the same extension. All files have a size of 13-14 MB each.
I provide the file paths, the string with respective dates and the names of the bands and the function (create_image_collection) is starting. The problem is that even for two files (I tried it also with a subset) the process is already taking longer than 30 mins and is not ending.
What could be the problem here?

Session Info: 
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gdalcubes_0.6.3           googleCloudStorageR_0.7.0 rgee_1.1.5                reticulate_1.26           sf_1.0-9                  RPostgres_1.4.4          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9         here_1.0.1         lattice_0.20-45    png_0.1-8          class_7.3-20       ps_1.7.2           assertthat_0.2.1   rprojroot_2.0.3    digest_0.6.31      utf8_1.2.2        
[11] R6_2.5.1           evaluate_0.20      e1071_1.7-12       httr_1.4.4         pillar_1.8.1       rlang_1.0.6        curl_4.3.3         rstudioapi_0.14    googleAuthR_2.0.0  blob_1.2.3        
[21] Matrix_1.5-1       rmarkdown_2.20     htmlwidgets_1.6.1  bit_4.0.5          proxy_0.4-27       compiler_4.2.2     xfun_0.36          pkgconfig_2.0.3    askpass_1.1        htmltools_0.5.4   
[31] openssl_2.0.5      tidyselect_1.2.0   tibble_3.1.8       fansi_1.0.3        crayon_1.5.2       dplyr_1.0.10       rappdirs_0.3.3     grid_4.2.2         jsonlite_1.8.4     lifecycle_1.0.3   
[41] DBI_1.1.3          magrittr_2.0.3     units_0.8-1        ncdf4_1.21         KernSmooth_2.23-20 zip_2.2.2          cli_3.4.1          cachem_1.0.6       fs_1.5.2           leaflet_2.1.1     
[51] ellipsis_0.3.2     generics_0.1.3     vctrs_0.5.1        tools_4.2.2        bit64_4.0.5        glue_1.6.2         hms_1.1.2          crosstalk_1.2.0    processx_3.8.0     fastmap_1.1.0     
[61] yaml_2.3.6         gargle_1.2.1       classInt_0.4-8     memoise_2.0.1      knitr_1.41       ```

build fails on dev

The build fails on dev (and master) because of

example.cpp:36:10: fatal error: test_multiprocess.h: No such file or directory
 #include "test_multiprocess.h"

Commenting out the include fixes the issue but the file seems redundant as part of the build for most users / devs.

 * THIS FILE IS JUST FOR TESTING WITH SOME LOCAL FILES INCLUDING ABSOLUTE PATHS.
 * IT WILL NOT RUN ON YOUR MACHINE:

broken link in README.md for documentation

The document link in the README seems wrong to me.

Currently, it is

# Documentation
More detailed information can be found at the documentation page under https://gdalcubes.github.io/docs.

I think it should be:

# Documentation
More detailed information can be found at the documentation page under https://gdalcubes.github.io/.

ie https://gdalcubes.github.io/.

Feature requests

This issue collections ideas for important features to be added. Highlighted items should be considered for the next minor release.

General Features:

Import data cubes from NetCDF file(s)
Automatic optimization of data cube graphs (by reordering)
cancel function in for long running operations
NetCDF export parameters chunking and compression
Masking on images before applying aggregation in image_collection_cube::read_chunk()
Chunk caching with JSON serialization as key

Data cube operations:

Image Collection:

store per-band and per-image metadata in image collections (optional)
implement per-image metadata filters (e.g. cloud coverage) (optional)
Support input images without GeoTransform but GCPs (e.g. Sentinel 1)

CLI

gdalcubes translate tool to apply gdal_translate on all GDAL dataset of a given image collection
gdalcubes addo tool to compute GDAL overviews with gdaladdo on all GDAL dataset of a given image collection

Library improvements

This issue is to collect ideas how to improve the library in its general design:

Ideas:

Data cube view is a parameter only to image_collection_cube, cube instances (except image_collection_cube) should not store its st reference but provide a function that derives the reference from input cube and maybe other parameters.
Make data cube objects copyable by copying all parameters (including input cubes)
Create collection format JSON description presets
For some operations that must read adjacent chunks, a chunk cache with key(hash(graph), chunk_id) could improve the overall performance by reducing reads from disk

gdalcubes update_collection tool

Support updating an image collection by rescanning a directory with collection_format from metadata table.

Reduce external dependencies

Consider rewriting parts to reduce the number of external dependencies:

use GDAL's VSI and CPL implementation for filesystem access instead of Boost::filesystem
check for header-only CLI tools instead of boost::program_options
rewrite streaming with https://github.com/eidheim/tiny-process-library
use https://github.com/HowardHinnant/date (which becomes part of std for C++20?) instead of boost::datetime
Include SQLite amalgamation (see https://www.sqlite.org/amalgamation.html) in source tree (is this really useful?)

Possibility to pipe raster cubes?

We have a use case where we would like to take a land cover map, say at 250 meter resolution, isolate one category, and resample it at 1 km resolution to identify the proportion of 250 m pixels of that category within each 1km pixel. In other words, we would need to do something like

prop12 <- raster_cube(col, cube_view(dx = 250, dy = 250, ....),...) |> 
apply_pixel(function(v) {v[1]==12}) |> 
raster_cube(cube_view(dx = 1000, dy = 1000, resampling='average',....),...)

This is not possible, but is there another way of doing this without bringing the 250 m raster in memory ?

Image collection for MOD13Q1

Hi, Marius and Edzer,
Congratulation for the gdalcubes package!! Great work!! I am starting to work on integrating "gdalcubes" with the SITS (Satellite Image Time Series) package, available on https://github.com/e-sensing/sits. One of the data sets we use a lot is MOD13Q1, which provides 250 m resolution imagery (see for example, https://doi.org/10.1016/j.isprsjprs.2018.08.007). Would it be asking too much from you to include the MOD13Q1 format in your image collection format?

Sentinel 2 unzipped SAFE not read

I'm trying to read into an image collection a list of directories of (unzipped) Sentinel 2 data. I get an empty collection. However the same procedure run on zipped files works.
My workflow begins with the R package sen2r to download the tiles, and this package automatically unzips the the downloaded file into a *.SAFE directory. So I need gdalcubes to read in the SAFE dir.

Here's what happens:

S2_SAFE_list = list.files(path = S2_download_dir, pattern = ".*SAFE$",  full.names = TRUE)
S2_SAFE_list
 [1] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20190908T081601_N0213_R121_T36RXV_20190908T110603.SAFE"
 [2] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20190918T081601_N0213_R121_T36RXV_20190918T131954.SAFE"
 [3] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20190928T081721_N0213_R121_T36RXV_20190928T110727.SAFE"
 [4] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20191008T081831_N0213_R121_T36RXV_20191008T110004.SAFE"
 [5] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20191018T081941_N0213_R121_T36RXV_20191018T113720.SAFE"
 [6] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20191028T082041_N0213_R121_T36RXV_20191028T095551.SAFE"
 [7] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20191107T082141_N0213_R121_T36RXV_20191107T093242.SAFE"
 [8] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20191117T082221_N0213_R121_T36RXV_20191117T114817.SAFE"
 [9] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2A_MSIL2A_20191127T082301_N0213_R121_T36RXV_20191127T093536.SAFE"
[10] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20190903T081609_N0213_R121_T36RXV_20190903T120314.SAFE"
[11] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20190913T081609_N0213_R121_T36RXV_20190913T125058.SAFE"
[12] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20190923T081639_N0213_R121_T36RXV_20190923T124312.SAFE"
[13] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20191013T081859_N0213_R121_T36RXV_20191013T124529.SAFE"
[14] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20191023T082009_N0213_R121_T36RXV_20191023T131005.SAFE"
[15] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20191102T082009_N0213_R121_T36RXV_20191102T111125.SAFE"
[16] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20191112T082109_N0213_R121_T36RXV_20191112T113452.SAFE"
[17] "/media/micha/MY_DATA/Data/DWH/Sentinel2/S2B_MSIL2A_20191122T082149_N0213_R121_T36RXV_20191122T113933.SAFE"

S2_collection = create_image_collection(files = S2_SAFE_list, format='Sentinel2_L2A', unroll_archives=TRUE, out_file=file.path(GIS_dir, "S2_collection.db"))
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|

S2_collection
A GDAL image collection object, referencing 0 images with 0  bands
Images:
[1] name     left     top      bottom   right    datetime srs     
<0 rows> (or 0-length row.names)

Bands:
[1] name        offset      scale       unit        nodata      image_count
<0 rows> (or 0-length row.names)

Again, if I zip the SAFE directories, and make a list of the zipfiles, then create_image_collection() succeeds.
Do I need to make any changes to the Sentinel_2A.json ?

Thanks, Micha

Add no data values to collection format

Some imagery formats do not report their NoData values in the corresponding GDAL field. Similar to #1, it makes sense to add an optional manual no data value per band in the image collection format description.

Plot RGB don't work

Hi,

I have donwloading this tile : "SENTINEL2B_20210110-105806-431_L2A_T31TEM_D.zip" on theia muscate.
i create a db with this file :
tile <- "SENTINEL2B_20210110-105806-431_L2A_T31TEM_D"
img_mensuel_col <- gdalcubes::create_image_collection(paste0("/media/obstetar/Seagate Expansion Drive/shiny_cnes/data/", tile, ".zip"),
"Sentinel2_L2A_THEIA",
"/home/obstetar/Bureau/image_collection.db",
unroll_archives = TRUE
)

vuem <- gdalcubes::cube_view(
srs = "EPSG:2154",
extent = list(
left = 797972.4,
right = 833590.6,
top = 6621103.5,
bottom = 6562586.2,
t0 = strftime(as.Date(str_sub(tile, 12, 19), format = "%Y%m%d"), format = "%Y-%m-%d"),
t1 = strftime(as.Date(str_sub(tile, 12, 19), format = "%Y%m%d"), format = "%Y-%m-%d")
),
dx = 100,
dt = "P1D",
resampling = "bilinear",
aggregation = "median",
keep.asp = TRUE
)

s_rgb_preview <- gdalcubes::raster_cube(img_mensuel_col, vuem) %>%
gdalcubes::select_bands(c("B2", "B3", "B4")) %>%
plot(rgb = 3:1)

All the plot is gray no color ??? Or with tile SENTINEL2B_20210226-104809-432_L2A_T31TEM_D.zip no problem
@+

Add date / time functions to `apply_pixel` and `filter_pixel`

Functions year(), month(), dayofyear(), dayofmonth(), and others related to time would be a nice addition for pixel-wise expressions, e.g. to be added as symbols in tinyexpr.

Default view for CLI with parameters nx, ny, gdalcubes create_view tool

Instead of providing the data view as JSON file, a default view for an image collection with full extent and nx, ny arguments could be implemented. Similarly, a tool gdalcubes create_view with nx,ny, dx,dy, extent, and projetion parameters or web tile arguments would be useful.

parallelism in create_image_collection + cube operations

I'm working with a use case that involves a lot of looping over the pattern of first creating an image collection, basically looks like:

lapply(many_collections, \(collection_i) {
   gdalcubes::create_image_collection(collection_i) |>
   raster_cube(view) |> 
   select_bands(bands) |> 
   extract_geom(sites) |>
   write_csv(somewhere)

gdalcubes nicely breaks this up with two 'progress' bars, one during the create_image_collection(), and the other during the calculation on the raster cube.

That second part of the pipeline parallelizes rather well under the hood with gdalcubes_options(parallel=TRUE). It doesn't quite saturate my bandwidth or CPU, but at it usually achieves a few 100 MB/s and uses all available cores (though at pretty low utilization).

However, resource use when preparing the collection is far lower -- it always uses only 1 core and achieves only a few MB/s transfers. As a result, the create_image_collection() step takes almost as much time as the rest of the processing, which is making it difficult to achieve good use of computational resources. I have tried parallelizing my create_image_collection() step, (e.g. with mcapply or makePSOCKCluster, but I guess because of the use of the C++ pointers, this doesn't actually work -- the code runs well and uses more cores and achieve higher bandwidth, but the resulting image collections are invalid).

I've found I can parallelize the whole loop above, (while also using gdalcubes_options(parallel=TRUE)). This creates higher RAM use (not surprisingly) but does allow me to get better compute resource use: with 10 parallel processes during the raster_cube() part I can get 10 GB/s (running on AWS c6i.32xlarge instance) and hit 100% CPU utilization for all 128 cores. However, at this scale, most of the time is spent on create_image_collection(), where I'm back down to a paltry 5 - 6 MB/s transfers.

Maybe there is no way around the realities here, but it would be wonderful if you had any advice on improving computational use in this workflow. Maybe my case is uniquely bad because I'm pulling from grib files instead of COG files, though they are already on the AWS platform as well. (The many iterations arise because these are NOAA GEFS forecasts -- so the forecast produced on any given day involves 31 ensemble members, and I'd like to get all ensemble members in a forecast for all forecasts in a year. But each ensemble member in each 35-day forecast needs it's own "cube", since a probablistic forecast really has 2 extra dimensions beyond the standard geocube -- we have forecast horizon and uncertainty, in addition to band, datetime, and xy location). Any advice you have would be much appreciated. Reproducible code I'm using is here: https://github.com/eco4cast/gefs4cast/blob/main/inst/cog-examples/efi-snapshots.R, which is just a thin wrapper around gdalcubes basically implementing the pseudo-code pipeline I outlined above.

Thanks for any ticks or pointers!

Support for GDAL subdatasets in collection creation

Currently, e.g. for MODIS HDF files, users need to take care about listing relevant subdatasets. Similar to how e.g. ZIP archives are automatically scanned, opened datasets could be checked for having subdatasets and, if available, these can be used as input to create the image collection.

Demo Docker image with sample data

To demonstrate what gdalcubes can do and how we can use it, a demo Docker image would be extremely useful.

Credits link in the README leads to 404

The first link in the Credits section in the README gives a 404 error.

I am not sure, where this should link to, but I thought I would let you know ;-)

Maintain date format in cube-view attributes when using monthly aggregation

Hi!
We produced a set of pipeable functions out of the gdalcubes image acquisition workflow. In this we generate the STAC query from the cube-view object.
Recently we used monthly aggregation for the first time and discovered that t0 and t1 of the cube-view get shortened to only include year and month. This causes the successive STAC search to crash as it requires the format YYYY-mm-dd (see error message).

Therefore our question: Is this shortening necessary? Or could the time period be kept as it is specified by the user like e.g. when using “P30D”?

Thanks!

#reprex

ext = list(4403205, 2961338, 4403379, 2961503)
names(ext) = c("xmin", "ymin", "xmax", "ymax")

## output list
cube_opts = list(
  "dx" = 10
  , "dy" = 10
  , "dt" = "P1M"
  , "aggregation" = "first"
  # , "resampling" = resampling
  , "period" = c("2021-01-01", "2021-08-15")
)

## define cube view
cv_args = list(
  srs = "EPSG:3035"
  , dx = cube_opts$dx
  , dy = cube_opts$dy
  , dt = cube_opts$dt
  , aggregation = cube_opts$aggregation
  # , resampling = cube_opts$resampling
  , extent = list(
    t0 = as.character(cube_opts$period[1])
    , t1 = as.character(cube_opts$period[2])
    , left = ext[["xmin"]]
    , right = ext[["xmax"]]
    , top = ext[["ymax"]]
    , bottom = ext[["ymin"]]
  )
)

cv = do.call(gdalcubes::cube_view, args = cv_args)

str(cv)
#> List of 4
#>  $ space      :List of 9
#>   ..$ right : num 4403382
#>   ..$ left  : num 4403202
#>   ..$ top   : num 2961506
#>   ..$ bottom: num 2961336
#>   ..$ nx    : num 18
#>   ..$ ny    : num 17
#>   ..$ srs   : chr "EPSG:3035"
#>   ..$ dx    : num 10
#>   ..$ dy    : num 10
#>  $ time       :List of 4
#>   ..$ t0: chr "2021-01"
#>   ..$ t1: chr "2021-08"
#>   ..$ dt: chr "P1M"
#>   ..$ nt: num 8
#>  $ aggregation: chr "first"
#>  $ resampling : chr "near"
#>  - attr(*, "class")= chr [1:2] "cube_view" "list"
## create stac query from cube view

bbox = sf::st_bbox(
  c(
    xmin = cv$space$left
    , xmax = cv$space$right
    , ymax = cv$space$top
    , ymin = cv$space$bottom
  )
  , crs = sf::st_crs(cv$space$srs)
)
bbox = sf::st_as_sfc(bbox)
bbox = sf::st_transform(bbox, crs = 4326)
bbox = sf::st_bbox(bbox, crs = 4326)

stacURL = "https://earth-search.aws.element84.com/v0"
collections = "sentinel-s2-l2a-cogs"

query = rstac::stac(base_url = stacURL)
query = rstac::stac_search(
  q = query,
  collections = collections
  , bbox = as.numeric(bbox)
  , datetime = paste0(cv$time$t0, "/", cv$time$t1)
  , limit = 500)
#> Error: The date time provided not follow the RFC 3339 format,please check the RFC 3339 rules.
# items = rstac::post_request(q = query)

^{Created on 2022-03-29 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Ubuntu 20.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2022-03-29
#>  pandoc   2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [2] CRAN (R 4.1.0)
#>  class         7.3-20  2022-01-13 [4] CRAN (R 4.1.2)
#>  classInt      0.4-3   2020-04-07 [2] CRAN (R 4.1.0)
#>  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.2)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr         1.0.7   2021-06-18 [2] CRAN (R 4.1.0)
#>  e1071         1.7-9   2021-09-16 [1] CRAN (R 4.1.2)
#>  ellipsis      0.3.2   2021-04-29 [2] CRAN (R 4.1.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.1)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  gdalcubes     0.6.1   2022-03-23 [1] CRAN (R 4.1.3)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [2] CRAN (R 4.1.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
#>  httr          1.4.2   2020-07-20 [2] CRAN (R 4.1.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.1.3)
#>  KernSmooth    2.23-20 2021-05-03 [4] CRAN (R 4.0.5)
#>  knitr         1.38    2022-03-25 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
#>  ncdf4         1.19    2021-12-15 [1] CRAN (R 4.1.2)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.1.0)
#>  proxy         0.4-26  2021-06-07 [2] CRAN (R 4.1.0)
#>  purrr         0.3.4   2020-04-17 [2] CRAN (R 4.1.0)
#>  R6            2.5.1   2021-08-19 [2] CRAN (R 4.1.1)
#>  Rcpp          1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.3)
#>  rstac         0.9.1-5 2021-10-31 [1] CRAN (R 4.1.1)
#>  rstudioapi    0.13    2020-11-12 [2] CRAN (R 4.1.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  sf            1.0-7   2022-03-07 [1] CRAN (R 4.1.3)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr       1.4.0   2019-02-10 [2] CRAN (R 4.1.0)
#>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  tidyselect    1.1.1   2021-04-30 [2] CRAN (R 4.1.0)
#>  units         0.8-0   2022-02-05 [1] CRAN (R 4.1.2)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs         0.3.8   2021-04-29 [2] CRAN (R 4.1.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.3)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] /home/carola/R/x86_64-pc-linux-gnu-library/4.1
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────