The chopin's discuss from niehs

grid_advanced mode merges grids too aggressively

mode = "grid_advanced" utilizes minimum spanning tree to merge adjacent grids with intersecting grids less than a threshold.
In a highly clustered point sets, this approach results in broad outskirts and a few internal unmerged grids. Points in the outskirt could exceed the number of intersecting points with any grids, which is not intended.
Simple go-around is that the merged grids are split by a few horizontal/vertical line
- Could we design this in a sophisticated way while keeping a decent performance?

error running README examples

srtm <- terra::unwrap(readRDS("../../tests/testdata/nc_srtm15_otm.rds")) fails for me because I think the code is intended to work only if run from the development version of the package.

We could use a system.file() call like you did with the example sf data to get that to work (although tests are not installed by default if using the “testthat” package for tests; you might have to include this example in the package itself)

Using a README.Rmd to knit a README file would be a good way to include these in the CI process.

sparklyr and spatial extension test

sparklyr: a connector between spark and dplyr
- Supports spark installation function spark_install(), no pain in spark configuration
- Paired use with distributed databases (HIVE HDFS) will make a distributed computing workflow smoother and more efficient especially when we handle large (e.g., exceeding tens of gigabytes) geospatial datasets
- Tried standalone use in my local machine. Later I will try it on the HPC after getting access to it
- To orchestrate Spark and HDFS, the "gateway" module should be installed in "all nodes" in HPC; I am still figuring out what exactly we need to configure to use Spark and HDFS on NIEHS HPC. (cf. a figure about spark configuration in HPC)
- Will learn settings in detail (i.e., caching for datasets in frequent use, worker specification, etc.)
- How to use spark in apptainer: to mount ddn as an external drive in a container, then to assign memory and cpu resources to the container?
sedona: a spark extension for spatial data analysis (previously geospark)
- A pair of Java applications (in *.jar extension) that support geospatial capabilities in Spark engine
- apache.sedona offers functions to use sedona with sparklyr
  - User base seems not as solid as that of Python equivalent (19K total downloads in R vs 14M in Python)... may need to consider spark module in Python then find a way to connect that with the main R workflow.
- Direct input is supported for several geospatial data formats (Shapefile, geoparquet, and geojson)
- Currently there is an issue on converting a Spark table with a geometry column (usually a list with geometry in WKT strings) back to sf column as sparklyr strongly assumes that every column in a table is a vector.
More ideas
- GPU capability with rapids extension: no R API exists. Might need to make one from the scratch.

Align input-output class in `par_make_gridset(..., mode="grid_advanced")`

Error description

Unlike other mode setting in par_make_gridset, par_make_gridset(..., mode="grid_advanced") returns a list of SpatVector.

Reproducible code

library(sf)
library(terra)
library(tigris)
library(chopin)
library(spatstat.random)
set.seed(2024)
# Read the nc example gpkg file from sf package
nc <- st_read(system.file("gpkg/nc.gpkg", package = "sf"))
nc <- st_transform(nc, 5070)

# Sample clustered points using st_sample (input is sf)
sampled_points <- st_sample(nc, type = "Thomas", mu = 3e-9, scale = 1000, kappa = 10)

# grid merge
grid_merge <- chopin::par_make_gridset(sampled_points, mode = "grid_advanced", nx = 24L, ny = 12L, grid_min_features = 30L, padding = 2e4)
plot(sampled_points$geom, pch = 19, cex = 0.2)
plot(grid_merge$padded$geometry, add = TRUE, border = "red", lwd = 2)

class(grid_merge)
# output is SpatVectors

Expected behavior

Input and output are supposed to be the same class.

`sessionInfo()` results

N/A: does not depend on local systems.

Estimate computational demands for short- and long-term tasks

The ultimate objective of this project is to make scalable GIS computation easier to adequately versed R/GIS users (e.g., master's students in epidemiology or geography). However, the short-term goal is to serve the {SET}group's NRT-AP project to compute required covariates nationwide (i.e., mainland United States). This issue is made to estimate the scale of covariate computation for the NRT-AP project as well as computational scales that potential users would need to work with.

NRT-AP

Mainland US: ~8 million sq km; split the mainland into standard rectangular regions
- 1 km spatial resolution: 8 million points
- 500 m spatial resolution: 32 million points
- 2 km spatial resolution: 2 million points
- Splitting the mainland into rectangular regions would be a viable option as the target points will be generated regularly. Rectangular shapes have advantages of the reduction in subsets of base data for covariates (e.g., raster data)
Checklist for scaling up the computation
- Memory footprint
- Multicore or multisession parallelization
- Storage read speed and memory bandwidth + inter-node network bandwidth
- Total data amount to distribute to each node
- Configuration of preprocessing and main processing assets

Brainstorming possible use cases

Irregular individual residential locations in highly dense urban areas
Temporally regular sample points along trajectories
Sample points with spatially varying density

Function ideas

Takes inputs of hardware configuration, data dimensions and complexity (e.g., shape complexity in vector data), data size and then determine the optimal splits to reduce running time
Merge adjacent computational grids with the small number of target points (grid_merge; implemented)

Performance benchmark

Benchmark design: one case, two functions

Case: Crop type summary

Hardware setting
- HPC (core-per-task: 1 vs 50-150 [depending on memory footprint]) via SLURM
Data
- USDA Cropland data (spatial resolution: 30 meters, 2020)
Test 1: distribute across grids, crop type summary
- Generate -- Points: 8+ millions points in mainland US
- Buffer: 10 kilometers
Test 2: distribute through hierarchy
- Polygons: 238+K block groups in the US (2020)
- Hierarchy: county
Test 3: nearest road join to point
- 8+ millions points -> 1.7 millions
- ~~Level-2 road from the Census Bureau~~ North American Roads from Department of Transportation

Add use cases

Deliverables: markdown (package vignettes)
The easiest and the fastest implementation is to convert tests to markdown.
Aim: add at least two vignettes around Nov. 30th, 2023
- Understanding distribute_process_* functions and their concepts/applicable situations
- Extract values at buffer/polygons
- ~~SEDC calculation and optimal bandwidth search -- is it relevant to this package?~~

Test-coverage workflow integration

The functions of this package will be tested and the coverage will be assessed with testthat packages
TODO

Prepare minimal datasets to test the functions (Insang)
Design effective strategies to test the scalability of the functions (Insang)
Add YAML file(s) to activate test-coverage and R CMD build workflows (Kyle)
Write valid tests for all functions

Zarr performance test

Background and objective

Zarr is a modern raster data format optimized for cloud storage and distributed data provision
The gist of Zarr format is to split a multidimensional raster into single "chunked" layers in a directory structure.
A question is: would this format result in shorter processing time than conventional approaches (e.g., NetCDF (climate data), ERDAS Imagine (NLCD), GeoTIFF (various elevation data))?

First test

There is seemingly no way to export raster data in Zarr format using R interface
Available approaches include using gdalmdimtranslate command line tool or xarray in Python
I tried converting 2021 CONUS NLCD data (27 GB; compressed) to Zarr using both tools
- gdalmdimtranslate resulted in an error of "Cannot guess driver"
- xarray.Dataset.to_zarr() crashed with an error of "cannot allocate 63.4 GB in memory." (my laptop has 32 GB (24 GB available) of memory).
One-day MERRA2 data was converted into Zarr (data size increased from 1.1 GB to 1.3 GB); performance test is underway

distribute_process_multirasters output formatting

distribute_process_multirasters works as designed, but the output does make little sense since it does not contain a file information from which the summary values are calculated. The internal processing lines should add file names in addition to the id and raster layer names.

Meanwhile, Codecov shows a different behavior in distribute_process_multirasters compared to other tests, where no main lines of the function were marked untested despite being tested in the test suite.

Existing R packages for spatial analysis for spatial epidemiology

SpatialEpi (CRAN Link)
- Statistical tests (i.e., spatial clusters by Besag-Newell's and Kulldorff's methods)
- Taking datasets that are readily analyzed are assumed
- Inputs are points: polygons should be converted to centroids for analysis
SpatialEpiApp (Moraga 2017)
- A R-Shiny app leveraging multiple external packages including INLA and SaTScan
- Functions are mostly for statistical analysis, not for geospatial data handling to obtain variables from geospatial datasets
- Seemingly not maintained by the author
aegis (Application for Epidemiological Geographic Information System) (Cho et al. 2020)
- A R-Shiny app supporting cohort definition, temporal exploration, disease mapping, clustering, and interactive visualization of health outcomes/modeling results
- Modeling module is based on R-INLA
- Data standardization functions for Korean National Health Insurance System cohort data to be compliant of Observational Medical Outcomes Partnership-Common Data Model (OMOP-CDM)
- Uses Database of Global Administrative Areas (GADM) for potential users outside Korea

More to come

add climate applications

As we move towards writing the manuscript, let's consider our target audience and the interest in climate and health modeling. The README should reflect common applications in climate and health.

Poster feedback

Split input locations/areas into the uniformly sized subsets
Bypass the limits in the number of processes
Give users a guide on the possible (in)efficiency of parallelization from input data

Support for other classes for spatial data + lower level access

To make the package future-proof, the package should be flexible to accommodate new spatial data I/O and processing packages that will be developed in the future.
For good performance, lower level access for data I/O and primitive operations will be helpful.

Dashboard for status report

Like RLUR and tar_watch, a GUI interface to check the (semi-)real time update on parallel tasks will be helpful.

CRAN Submission

Track CRAN submission system recovery
Hard check on CRAN requirements
Check built package file size (up to 100MB, each folder exceeding 1MB is noted)
Submit to CRAN

update README installation

Error description

@sigmafelix I noticed the README installation instructions include the out-of-date reference to github/spatiotemporal-exposure-and-toxicology

`extract_at` integration

extract_at_buffer and extract_at_poly have small difference in using radius argument to generate a circular buffer in extract_at_buffer. It is reasonable to use only one function extract_at for point and polygon overlay at rasters for brevity.

`par_grid` improvement

integrate par_make_gridset into par_grid (custom pre-generated grid inputs optional)
quadtree for mode = "density" in par_make_gridset

Determine `sf`/`terra` class depending on future::plan() value

plan(multicore) is available in *nix systems (due to the support of fork-eval). With other than multicore plan, terra objects are not exportable to parallel workers. par_* functions should utilize plan value to convert or to give a warning message when terra objects are put in.

Using the mix of future::supportsMulticore() and future::plan(), edit par_* functions to display a message on nonexportable objects
Add a vignette to encourage users to pass file path strings rather than sf/terra objects directly
- Following question: what if users want to use a database with multiple tables?

Renaming functions

Some functions, distribute_* in particular, give lint errors of variable names no longer than 30 characters. As rOpenSci requires a submission to pass lint checks, abbreviation of these function names is necessary. Naming conventions to consider are:

{object type}_{function name} as seen in sf functions (st_*) or stringi functions (stri_*)
Abbreviated not to overlap other generic function names as seen in terra (e.g., vect, rast, etc.)
{function group}_{function name}_{specific use case}: currently used, but will get shorter group and function names

Perhaps distribute_* are too verbose, so I am thinking of using par to represent parallelization (as parallel functions' par* naming).

distribute_process_grid to pargrid or par_grid or parGrid

Change function names
Edit vignettes and examples in functions
Pull to v0.3.0

PAT requirements for installation from GitHub in Mac

remotes::install_github("NIEHS/chopin")
# Using GitHub PAT from the git credential store.
# Error: Failed to install 'chopin' from GitHub:
#  HTTP error 403.
#  Resource protected by organization SAML enforcement. You must grant your Personal Access token access to this organization.
#
#  Rate limit remaining: 4995/5000
#  Rate limit reset at: 2024-04-24 18:41:23 UTC

It looks like people who have a personal access token available for organization access can install the package via remotes or pak. I found this message only on Mac, so I will check it with other devices later.

sessionInfo

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

Roll out a package for internal use

Fast-forward writing functions for elementary covariate computation
Unit-test the developed functions, error-shoot and revise
Documentation with roxygen2
[-] Integrate the package into SLURM job submission workflow
- [-] Automate installation (i.e., package into *.tar.gz file and insert an installation script)

Review DeGAUSS containerized platforms

TODO

Check if it runs -- runs smoothly on apptainer
Check its scalability: review the code and run test scripts with 10K+ points
Adopt design ideas
Include container build-run scripts into Scalable_GIS package

Points of consideration

DeGAUSS is a special platform for calculating geomarkers in relatively local areas (i.e., accessibility is evaluated at affiliated medical centers); our target is the mainland US, so the scalability is more important than the existing platform
They hosted their files on Amazon S3; each container includes minimal base system components and scripts. Our system needs to map the fixed (or parametrized) locations in ddn for reproducibility
Basic design
- A text file to read user information for SLURM submission: e-mail, job alias, output and error log file names
- A container with postscripts that install required packages we often use (i.e., exactextractr, terra, etc.)
- A fixed script that includes default arguments, which will make users have no trouble with entering incorrect/invalid arguments in calculation functions

Understanding SLURM clusters and R job submission test

Objective

The NIEHS HPC is managing computational demands by SLURM. I will explore SLURM in general and prepare internal materials for job submissions to the NIEHS HPC using R.
Learning existing package: rslurm
Designing an efficient workflow to distribute spatiotemporal (covariate) computation tasks across the assigned computational assets using SLURM

List of tasks and timeline

As per #3 , an array of essential functions will be prepared (around 09/05)
If possible, try setting up a SLURM locally and practice submitting jobs to the management system
~~Prepare short hands-on examples for team members~~
Actual test on the NIEHS HPC and organize a practice session

Develop the pkgdown static site

0.8.0 development roadmap: addressing ROpenSci review

Main functions

Auxiliary/internal functions

Do not export aux functions
dep_switch: add potential use cases, narrow the scope
- documentation: supported object types in sf and terra
Verbosity control for debugging and being explicit about intermediate spatial data manipulation steps (i.e., reprojection, extent setting, etc.)

Documentation

Programming practices

Exploration

wk dependency for lightweight branching
- Performance comparison

Deploy website

@Spatiotemporal-Exposures-and-Toxicology Could you activate github.io webpage for this repository? Perhaps we could change the repository name when the webpage is deployed. Thank you!

Quick development and performance comparison

Objective

To develop wrapper functions that organize low-level functions to serve common uses for geospatial exposure modeling

List of quick development/comparison tasks

Calculating distance from nearest --
- processing speed comparison: sf::st_nearest_feature() or nngeo::st_nn() or terra::nearest()
- points
- lines
Vector-Raster overlay to calculate summary
- Point-based
- Point buffers
- Polygon-based
- (Line?)
Area-weighted variable recalculation
- sf::st_interpolate_aw(); is there a terra equivalent? (needs to be developed)
- administrative areas with nonexhaustive and misaligned boundaries
CPU parallelization
- parlapply() vs foreach::foreach() vs future.apply::future_apply()

Name change

Candidates include:

BACH: Batch Analysis for Climate and Health data
CHOPIN: Computation for Climate and Health research On Parallelized INfrastructure

Rasterization option in `extract_at`

Current implementation of extract_at only accepts vector inputs. We will consider raster weight in _kernel subfunction to accommodate irregular polygon inputs and shorter processing time.

Practice raster weights in exactextractr::exact_extract()
Add a helper function to convert vectors into weight rasters
Add tests
Merge to main

Maximum number of threads per R installation

The default build setting of R allows up to ~~128~~125 concurrent processes. This makes a problem in running multiple tasks across different nodes in HPC. Running R sessions in containers needs to be tested whether this practice has nothing to do with the maximum thread limit in the local installation of R on HPC.

Test running Apptainer image with sufficiently large number of threads
Add documentation for bypassing local R thread number limitations

Self-fix function for distance calculation parallelization

Distance calculation parallelization with smaller spatial extents than the entire dataset's extent may result in erroneous values if some grids/sub-regions have no target data features or edge cases are present near the boundary of adjacent grids/sub-regions. Gradually expanding grids can be used to fix such edge cases. One challenge is to design a function which determine whether the current calculation is shorter or longer than the actual shortest distance to the nearest feature that would have been found at the full dataset.

Problem statement

Given a grid $G_i$, a point or line target feature set $V$, and a point origin feature set $U$, we want to find
$\text{if }\sup {d((U_k \cap G_i ), (V_l \cap G_i))} < \sup {d((U_k \cap G_i ), V)}$, or
$\text{if }\inf {d((U_k \cap G_i ), (V_l \cap G_i))} > \sup {d((U_k \cap G_i ), V)}$ $\text{ } \forall k, l$
$\inf$ problem is relevant as we consider calculating the shortest distance to the target feature set.

Hypothesis

A distance is considered suspicious/sub-optimal when it is longer than the distance from this point to the grid boundary.
Hypothesis implementation
Gradual increment in search window
Check the influence to the performance

0.8.0 roadmap: Addressing ROpenSci Review

Main functions

Rename par_group_grid
extract_at as the only function for extract raster values at generic polygons
Recheck examples of wrapper functions
Toy examples or make the current examples smaller
par_cut_coords: add sf::st_zm(drop = TRUE)
extract_at_poly: st_crs(polys) == st_crs(surf)
Elegant way of argument injection in par_grid, par_hierarchy, and par_multiraters, i.e., avoiding positional arguments

Auxiliary/internal functions

Do not export aux functions
dep_switch: add potential use cases, narrow the scope

Documentation

Programming practices

Long-term development targets

Bridging python packages (dask-geopandas, geopolars, cuSpatial) to R workflow
Optimal splitting of computational regions regarding shape complexity and input data resolutions
Streamlined integration of sparklyr and sf-derived functions

--- more ideas to comments ---

niehs / chopin Goto Github PK

chopin's Issues

Error description

Reproducible code

Expected behavior

sessionInfo() results

NRT-AP

Brainstorming possible use cases

Function ideas

Benchmark design: one case, two functions

Case: Crop type summary

Background and objective

First test

Error description

TODO

Points of consideration

Objective

List of tasks and timeline

Main functions

Auxiliary/internal functions

Documentation

Programming practices

Exploration

Objective

List of quick development/comparison tasks

Problem statement

Hypothesis

Main functions

Auxiliary/internal functions

Documentation

Programming practices

Recommend Projects

Recommend Topics

Recommend Org

`sessionInfo()` results