Coder Social home page Coder Social logo

chopin's Issues

grid_advanced mode merges grids too aggressively

  • mode = "grid_advanced" utilizes minimum spanning tree to merge adjacent grids with intersecting grids less than a threshold.
  • In a highly clustered point sets, this approach results in broad outskirts and a few internal unmerged grids. Points in the outskirt could exceed the number of intersecting points with any grids, which is not intended.
  • Simple go-around is that the merged grids are split by a few horizontal/vertical line
    • Could we design this in a sophisticated way while keeping a decent performance?

error running README examples

srtm <- terra::unwrap(readRDS("../../tests/testdata/nc_srtm15_otm.rds")) fails for me because I think the code is intended to work only if run from the development version of the package.

We could use a system.file() call like you did with the example sf data to get that to work (although tests are not installed by default if using the “testthat” package for tests; you might have to include this example in the package itself)

Using a README.Rmd to knit a README file would be a good way to include these in the CI process.

sparklyr and spatial extension test

  • sparklyr: a connector between spark and dplyr
    • Supports spark installation function spark_install(), no pain in spark configuration
    • Paired use with distributed databases (HIVE HDFS) will make a distributed computing workflow smoother and more efficient especially when we handle large (e.g., exceeding tens of gigabytes) geospatial datasets
    • Tried standalone use in my local machine. Later I will try it on the HPC after getting access to it
    • To orchestrate Spark and HDFS, the "gateway" module should be installed in "all nodes" in HPC; I am still figuring out what exactly we need to configure to use Spark and HDFS on NIEHS HPC. (cf. a figure about spark configuration in HPC)
    • Will learn settings in detail (i.e., caching for datasets in frequent use, worker specification, etc.)
    • How to use spark in apptainer: to mount ddn as an external drive in a container, then to assign memory and cpu resources to the container?
  • sedona: a spark extension for spatial data analysis (previously geospark)
    • A pair of Java applications (in *.jar extension) that support geospatial capabilities in Spark engine
    • apache.sedona offers functions to use sedona with sparklyr
      • User base seems not as solid as that of Python equivalent (19K total downloads in R vs 14M in Python)... may need to consider spark module in Python then find a way to connect that with the main R workflow.
    • Direct input is supported for several geospatial data formats (Shapefile, geoparquet, and geojson)
    • Currently there is an issue on converting a Spark table with a geometry column (usually a list with geometry in WKT strings) back to sf column as sparklyr strongly assumes that every column in a table is a vector.
  • More ideas
    • GPU capability with rapids extension: no R API exists. Might need to make one from the scratch.

Align input-output class in `par_make_gridset(..., mode="grid_advanced")`

Error description

Unlike other mode setting in par_make_gridset, par_make_gridset(..., mode="grid_advanced") returns a list of SpatVector.

Reproducible code

library(sf)
library(terra)
library(tigris)
library(chopin)
library(spatstat.random)
set.seed(2024)
# Read the nc example gpkg file from sf package
nc <- st_read(system.file("gpkg/nc.gpkg", package = "sf"))
nc <- st_transform(nc, 5070)

# Sample clustered points using st_sample (input is sf)
sampled_points <- st_sample(nc, type = "Thomas", mu = 3e-9, scale = 1000, kappa = 10)

# grid merge
grid_merge <- chopin::par_make_gridset(sampled_points, mode = "grid_advanced", nx = 24L, ny = 12L, grid_min_features = 30L, padding = 2e4)
plot(sampled_points$geom, pch = 19, cex = 0.2)
plot(grid_merge$padded$geometry, add = TRUE, border = "red", lwd = 2)

class(grid_merge)
# output is SpatVectors

Expected behavior

Input and output are supposed to be the same class.

sessionInfo() results

N/A: does not depend on local systems.

Estimate computational demands for short- and long-term tasks

The ultimate objective of this project is to make scalable GIS computation easier to adequately versed R/GIS users (e.g., master's students in epidemiology or geography). However, the short-term goal is to serve the {SET}group's NRT-AP project to compute required covariates nationwide (i.e., mainland United States). This issue is made to estimate the scale of covariate computation for the NRT-AP project as well as computational scales that potential users would need to work with.

NRT-AP

  • Mainland US: ~8 million sq km; split the mainland into standard rectangular regions

    • 1 km spatial resolution: 8 million points
    • 500 m spatial resolution: 32 million points
    • 2 km spatial resolution: 2 million points
    • Splitting the mainland into rectangular regions would be a viable option as the target points will be generated regularly. Rectangular shapes have advantages of the reduction in subsets of base data for covariates (e.g., raster data)
  • Checklist for scaling up the computation

    • Memory footprint
    • Multicore or multisession parallelization
    • Storage read speed and memory bandwidth + inter-node network bandwidth
    • Total data amount to distribute to each node
    • Configuration of preprocessing and main processing assets

Brainstorming possible use cases

  • Irregular individual residential locations in highly dense urban areas
  • Temporally regular sample points along trajectories
  • Sample points with spatially varying density

Function ideas

  • Takes inputs of hardware configuration, data dimensions and complexity (e.g., shape complexity in vector data), data size and then determine the optimal splits to reduce running time
  • Merge adjacent computational grids with the small number of target points (grid_merge; implemented)

Performance benchmark

Benchmark design: one case, two functions

Case: Crop type summary

  • Hardware setting
    • HPC (core-per-task: 1 vs 50-150 [depending on memory footprint]) via SLURM
  • Data
    • USDA Cropland data (spatial resolution: 30 meters, 2020)
  • Test 1: distribute across grids, crop type summary
    • Generate -- Points: 8+ millions points in mainland US
    • Buffer: 10 kilometers
  • Test 2: distribute through hierarchy
    • Polygons: 238+K block groups in the US (2020)
    • Hierarchy: county
  • Test 3: nearest road join to point
    • 8+ millions points -> 1.7 millions
    • Level-2 road from the Census Bureau North American Roads from Department of Transportation

Add use cases

  • Deliverables: markdown (package vignettes)
  • The easiest and the fastest implementation is to convert tests to markdown.
  • Aim: add at least two vignettes around Nov. 30th, 2023
    • Understanding distribute_process_* functions and their concepts/applicable situations
    • Extract values at buffer/polygons
    • SEDC calculation and optimal bandwidth search -- is it relevant to this package?

Test-coverage workflow integration

  • The functions of this package will be tested and the coverage will be assessed with testthat packages
  • TODO
  • Prepare minimal datasets to test the functions (Insang)
  • Design effective strategies to test the scalability of the functions (Insang)
  • Add YAML file(s) to activate test-coverage and R CMD build workflows (Kyle)
  • Write valid tests for all functions

Zarr performance test

Background and objective

  • Zarr is a modern raster data format optimized for cloud storage and distributed data provision
  • The gist of Zarr format is to split a multidimensional raster into single "chunked" layers in a directory structure.
  • A question is: would this format result in shorter processing time than conventional approaches (e.g., NetCDF (climate data), ERDAS Imagine (NLCD), GeoTIFF (various elevation data))?

First test

  • There is seemingly no way to export raster data in Zarr format using R interface
  • Available approaches include using gdalmdimtranslate command line tool or xarray in Python
  • I tried converting 2021 CONUS NLCD data (27 GB; compressed) to Zarr using both tools
    • gdalmdimtranslate resulted in an error of "Cannot guess driver"
    • xarray.Dataset.to_zarr() crashed with an error of "cannot allocate 63.4 GB in memory." (my laptop has 32 GB (24 GB available) of memory).
  • One-day MERRA2 data was converted into Zarr (data size increased from 1.1 GB to 1.3 GB); performance test is underway

distribute_process_multirasters output formatting

distribute_process_multirasters works as designed, but the output does make little sense since it does not contain a file information from which the summary values are calculated. The internal processing lines should add file names in addition to the id and raster layer names.

Meanwhile, Codecov shows a different behavior in distribute_process_multirasters compared to other tests, where no main lines of the function were marked untested despite being tested in the test suite.

Existing R packages for spatial analysis for spatial epidemiology

  1. SpatialEpi (CRAN Link)

    • Statistical tests (i.e., spatial clusters by Besag-Newell's and Kulldorff's methods)
    • Taking datasets that are readily analyzed are assumed
    • Inputs are points: polygons should be converted to centroids for analysis
  2. SpatialEpiApp (Moraga 2017)

    • A R-Shiny app leveraging multiple external packages including INLA and SaTScan
    • Functions are mostly for statistical analysis, not for geospatial data handling to obtain variables from geospatial datasets
    • Seemingly not maintained by the author
  3. aegis (Application for Epidemiological Geographic Information System) (Cho et al. 2020)

    • A R-Shiny app supporting cohort definition, temporal exploration, disease mapping, clustering, and interactive visualization of health outcomes/modeling results
    • Modeling module is based on R-INLA
    • Data standardization functions for Korean National Health Insurance System cohort data to be compliant of Observational Medical Outcomes Partnership-Common Data Model (OMOP-CDM)
    • Uses Database of Global Administrative Areas (GADM) for potential users outside Korea
  • More to come

add climate applications

As we move towards writing the manuscript, let's consider our target audience and the interest in climate and health modeling. The README should reflect common applications in climate and health.

Poster feedback

  • Split input locations/areas into the uniformly sized subsets
  • Bypass the limits in the number of processes
  • Give users a guide on the possible (in)efficiency of parallelization from input data

Support for other classes for spatial data + lower level access

  • To make the package future-proof, the package should be flexible to accommodate new spatial data I/O and processing packages that will be developed in the future.
  • For good performance, lower level access for data I/O and primitive operations will be helpful.

CRAN Submission

  • Track CRAN submission system recovery
  • Hard check on CRAN requirements
  • Check built package file size (up to 100MB, each folder exceeding 1MB is noted)
  • Submit to CRAN

update README installation

Error description

@sigmafelix I noticed the README installation instructions include the out-of-date reference to github/spatiotemporal-exposure-and-toxicology

`extract_at` integration

extract_at_buffer and extract_at_poly have small difference in using radius argument to generate a circular buffer in extract_at_buffer. It is reasonable to use only one function extract_at for point and polygon overlay at rasters for brevity.

`par_grid` improvement

  • integrate par_make_gridset into par_grid (custom pre-generated grid inputs optional)
  • quadtree for mode = "density" in par_make_gridset

Determine `sf`/`terra` class depending on future::plan() value

plan(multicore) is available in *nix systems (due to the support of fork-eval). With other than multicore plan, terra objects are not exportable to parallel workers. par_* functions should utilize plan value to convert or to give a warning message when terra objects are put in.

  • Using the mix of future::supportsMulticore() and future::plan(), edit par_* functions to display a message on nonexportable objects
  • Add a vignette to encourage users to pass file path strings rather than sf/terra objects directly
    • Following question: what if users want to use a database with multiple tables?

Renaming functions

Some functions, distribute_* in particular, give lint errors of variable names no longer than 30 characters. As rOpenSci requires a submission to pass lint checks, abbreviation of these function names is necessary. Naming conventions to consider are:

  1. {object type}_{function name} as seen in sf functions (st_*) or stringi functions (stri_*)
  2. Abbreviated not to overlap other generic function names as seen in terra (e.g., vect, rast, etc.)
  3. {function group}_{function name}_{specific use case}: currently used, but will get shorter group and function names

Perhaps distribute_* are too verbose, so I am thinking of using par to represent parallelization (as parallel functions' par* naming).

distribute_process_grid to pargrid or par_grid or parGrid

  • Change function names
  • Edit vignettes and examples in functions
  • Pull to v0.3.0

PAT requirements for installation from GitHub in Mac

remotes::install_github("NIEHS/chopin")
# Using GitHub PAT from the git credential store.
# Error: Failed to install 'chopin' from GitHub:
#  HTTP error 403.
#  Resource protected by organization SAML enforcement. You must grant your Personal Access token access to this organization.
#
#  Rate limit remaining: 4995/5000
#  Rate limit reset at: 2024-04-24 18:41:23 UTC

It looks like people who have a personal access token available for organization access can install the package via remotes or pak. I found this message only on Mac, so I will check it with other devices later.

sessionInfo
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

Roll out a package for internal use

  • Fast-forward writing functions for elementary covariate computation
  • Unit-test the developed functions, error-shoot and revise
  • Documentation with roxygen2
  • [-] Integrate the package into SLURM job submission workflow
    • [-] Automate installation (i.e., package into *.tar.gz file and insert an installation script)

Review DeGAUSS containerized platforms

TODO

  • Check if it runs -- runs smoothly on apptainer
  • Check its scalability: review the code and run test scripts with 10K+ points
  • Adopt design ideas
  • Include container build-run scripts into Scalable_GIS package

Points of consideration

  • DeGAUSS is a special platform for calculating geomarkers in relatively local areas (i.e., accessibility is evaluated at affiliated medical centers); our target is the mainland US, so the scalability is more important than the existing platform
  • They hosted their files on Amazon S3; each container includes minimal base system components and scripts. Our system needs to map the fixed (or parametrized) locations in ddn for reproducibility
  • Basic design
    • A text file to read user information for SLURM submission: e-mail, job alias, output and error log file names
    • A container with postscripts that install required packages we often use (i.e., exactextractr, terra, etc.)
    • A fixed script that includes default arguments, which will make users have no trouble with entering incorrect/invalid arguments in calculation functions

Understanding SLURM clusters and R job submission test

Objective

  • The NIEHS HPC is managing computational demands by SLURM. I will explore SLURM in general and prepare internal materials for job submissions to the NIEHS HPC using R.
  • Learning existing package: rslurm
  • Designing an efficient workflow to distribute spatiotemporal (covariate) computation tasks across the assigned computational assets using SLURM

List of tasks and timeline

  • As per #3 , an array of essential functions will be prepared (around 09/05)
  • If possible, try setting up a SLURM locally and practice submitting jobs to the management system
  • Prepare short hands-on examples for team members
  • Actual test on the NIEHS HPC and organize a practice session

0.8.0 development roadmap: addressing ROpenSci review

Main functions

  • Two parallel backends: future and mirai
  • Rename par_group_grid
  • extract_at as the only function for extract raster values at generic polygons
  • Recheck examples of wrapper functions
  • Toy examples or make the current examples smaller
  • par_cut_coords: add sf::st_zm(drop = TRUE)
  • extract_at_poly: st_crs(polys) == st_crs(surf) -> extract_at preprocesses CRS. Still having issues with nonexisting CRS.
  • CRS handling for all input classes (i.e., character inputs)
  • Elegant way of argument injection in par_grid, par_hierarchy, and par_multiraters, i.e., avoiding positional arguments
    • Use generic argument names x and y in favor of sf/terra conventions
  • ➡️ seamless tar_target integration: return WKT/WKB list to branch out
    • Test branching WKT vs sf objects ... the other argument in extract_at or parallelizable piece of chopin and others' functions?
    • [ ] Hierarchy case: truly nested vs different hierarchy across branch -- spin off a hierarchy definition function from par_hierarchy where takes a simple hierarchy object like the output of par_make_gridset

Auxiliary/internal functions

  • Do not export aux functions
  • dep_switch: add potential use cases, narrow the scope
    • documentation: supported object types in sf and terra
  • Verbosity control for debugging and being explicit about intermediate spatial data manipulation steps (i.e., reprojection, extent setting, etc.)

Documentation

  • Clear distinction between summarize_aw (vector-vector) and extract_at_buffer_kernel (vector-raster)
  • Move 2D geometries caveats to the later part of README
  • Multiple point of entry: add quick start code snippets in README, introductory vignettes, and package documentation
  • Remove acronym from DESCRIPTION and consider more generalized package title
  • Flowchart
    • Split into clearly separate plots
    • Add explicit elaboration of the plots in README
  • Reorganize vignettes
    • Benchmark cases into a separate vignette
    • Extend HPC vignettes
  • Change default future::plan to future::multisession
    • Internally convert terra input to sf if multisession is detected (#82)
  • Precomputing: fix broken figure paths
  • Clarify inexhaustive grid in par_group_balanced documentation
  • Add justification to all default argument values
  • Fix URLs (to internal and external functions as well as internet links)
  • Add CITATION file
  • Remove To run this example
  • Clarification of the concept of SEDC in summarize_sedc

Programming practices

  • Standard S3 practices for switching functions
  • Add expected values in expect_message
  • Reduce the number of expect_no_error
  • Avoid single long tests
  • Legibility of package loading codes: use library
  • Community guidelines (i.e., CONTRIBUTING.md)
  • Codemetar synchronization (cf. link)
  • Standard messaging functions instead of cat or print
  • inherits() rather than using grepl() at class() outputs
  • Add tests for different operating systems

Exploration

  • wk dependency for lightweight branching
    • Performance comparison

Deploy website

@Spatiotemporal-Exposures-and-Toxicology Could you activate github.io webpage for this repository? Perhaps we could change the repository name when the webpage is deployed. Thank you!

Quick development and performance comparison

Objective

  • To develop wrapper functions that organize low-level functions to serve common uses for geospatial exposure modeling

List of quick development/comparison tasks

  • Calculating distance from nearest --
    • processing speed comparison: sf::st_nearest_feature() or nngeo::st_nn() or terra::nearest()
    • points
    • lines
  • Vector-Raster overlay to calculate summary
    • Point-based
    • Point buffers
    • Polygon-based
    • (Line?)
  • Area-weighted variable recalculation
    • sf::st_interpolate_aw(); is there a terra equivalent? (needs to be developed)
    • administrative areas with nonexhaustive and misaligned boundaries
  • CPU parallelization
    • parlapply() vs foreach::foreach() vs future.apply::future_apply()

Name change

Candidates include:

  1. BACH: Batch Analysis for Climate and Health data
  2. CHOPIN: Computation for Climate and Health research On Parallelized INfrastructure

Rasterization option in `extract_at`

Current implementation of extract_at only accepts vector inputs. We will consider raster weight in _kernel subfunction to accommodate irregular polygon inputs and shorter processing time.

  • Practice raster weights in exactextractr::exact_extract()
  • Add a helper function to convert vectors into weight rasters
  • Add tests
  • Merge to main

Maximum number of threads per R installation

The default build setting of R allows up to 128125 concurrent processes. This makes a problem in running multiple tasks across different nodes in HPC. Running R sessions in containers needs to be tested whether this practice has nothing to do with the maximum thread limit in the local installation of R on HPC.

  • Test running Apptainer image with sufficiently large number of threads
  • Add documentation for bypassing local R thread number limitations

Self-fix function for distance calculation parallelization

Distance calculation parallelization with smaller spatial extents than the entire dataset's extent may result in erroneous values if some grids/sub-regions have no target data features or edge cases are present near the boundary of adjacent grids/sub-regions. Gradually expanding grids can be used to fix such edge cases. One challenge is to design a function which determine whether the current calculation is shorter or longer than the actual shortest distance to the nearest feature that would have been found at the full dataset.

Problem statement

Given a grid $G_i$, a point or line target feature set $V$, and a point origin feature set $U$, we want to find
$\text{if }\sup {d((U_k \cap G_i ), (V_l \cap G_i))} &lt; \sup {d((U_k \cap G_i ), V)}$, or
$\text{if }\inf {d((U_k \cap G_i ), (V_l \cap G_i))} &gt; \sup {d((U_k \cap G_i ), V)}$ $\text{ } \forall k, l$
$\inf$ problem is relevant as we consider calculating the shortest distance to the target feature set.

Hypothesis

  • A distance is considered suspicious/sub-optimal when it is longer than the distance from this point to the grid boundary.

  • Hypothesis implementation

  • Gradual increment in search window

  • Check the influence to the performance

0.8.0 roadmap: Addressing ROpenSci Review

Main functions

  • Rename par_group_grid
  • extract_at as the only function for extract raster values at generic polygons
  • Recheck examples of wrapper functions
  • Toy examples or make the current examples smaller
  • par_cut_coords: add sf::st_zm(drop = TRUE)
  • extract_at_poly: st_crs(polys) == st_crs(surf)
  • Elegant way of argument injection in par_grid, par_hierarchy, and par_multiraters, i.e., avoiding positional arguments

Auxiliary/internal functions

  • Do not export aux functions
  • dep_switch: add potential use cases, narrow the scope

Documentation

  • Clear distinction between summarize_aw (vector-vector) and extract_at_buffer_kernel (vector-raster)
  • Move 2D geometries caveats to the later part of README
  • Multiple point of entry: add quick start code snippets in README, introductory vignettes, and package documentation
  • Remove acronym from DESCRIPTION and consider more generalized package title
  • Flowchart
    • Split into clearly separate plots
    • Add explicit elaboration of the plots in README
  • Reorganize vignettes
    • Benchmark cases into a separate vignette
    • Extend HPC vignettes
  • Change default future::plan to future::multisession
    • Internally convert terra input to sf if multisession is detected (#82)
  • Precomputing: fix broken figure paths
  • Clarify inexhaustive grid in par_group_balanced documentation
  • Add justification to all default argument values
  • Fix URLs (to internal and external functions as well as internet links)
  • Add CITATION file
  • Remove To run this example
  • Clarification of the concept of SEDC in summarize_sedc

Programming practices

  • Standard S3 practices for switching functions
  • Add expected values in expect_message
  • Reduce the number of expect_no_error
  • Avoid single long tests
  • Legibility of package loading codes: use library
  • Community guidelines (i.e., CONTRIBUTING.md)
  • Codemetar synchronization (cf. link)
  • Standard messaging functions instead of cat or print
  • inherits() rather than using grepl() at class() outputs
  • Add tests for different operating systems

Long-term development targets

  • Bridging python packages (dask-geopandas, geopolars, cuSpatial) to R workflow
  • Optimal splitting of computational regions regarding shape complexity and input data resolutions
  • Streamlined integration of sparklyr and sf-derived functions

--- more ideas to comments ---

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.