niehs / chopin Goto Github PK
View Code? Open in Web Editor NEWScalable GIS methods for environmental and climate data analysis
Home Page: https://niehs.github.io/chopin/
License: Other
Scalable GIS methods for environmental and climate data analysis
Home Page: https://niehs.github.io/chopin/
License: Other
mode = "grid_advanced"
utilizes minimum spanning tree to merge adjacent grids with intersecting grids less than a threshold.srtm <- terra::unwrap(readRDS("../../tests/testdata/nc_srtm15_otm.rds"))
fails for me because I think the code is intended to work only if run from the development version of the package.
We could use a system.file()
call like you did with the example sf data to get that to work (although tests are not installed by default if using the “testthat” package for tests; you might have to include this example in the package itself)
Using a README.Rmd
to knit a README
file would be a good way to include these in the CI process.
sparklyr
: a connector between spark and dplyr
spark_install()
, no pain in spark configurationsedona
: a spark extension for spatial data analysis (previously geospark
)
*.jar
extension) that support geospatial capabilities in Spark engineapache.sedona
offers functions to use sedona with sparklyr
list
with geometry in WKT strings) back to sf
column as sparklyr
strongly assumes that every column in a table is a vector.rapids
extension: no R API exists. Might need to make one from the scratch.Unlike other mode
setting in par_make_gridset
, par_make_gridset(..., mode="grid_advanced")
returns a list of SpatVector
.
library(sf)
library(terra)
library(tigris)
library(chopin)
library(spatstat.random)
set.seed(2024)
# Read the nc example gpkg file from sf package
nc <- st_read(system.file("gpkg/nc.gpkg", package = "sf"))
nc <- st_transform(nc, 5070)
# Sample clustered points using st_sample (input is sf)
sampled_points <- st_sample(nc, type = "Thomas", mu = 3e-9, scale = 1000, kappa = 10)
# grid merge
grid_merge <- chopin::par_make_gridset(sampled_points, mode = "grid_advanced", nx = 24L, ny = 12L, grid_min_features = 30L, padding = 2e4)
plot(sampled_points$geom, pch = 19, cex = 0.2)
plot(grid_merge$padded$geometry, add = TRUE, border = "red", lwd = 2)
class(grid_merge)
# output is SpatVectors
Input and output are supposed to be the same class.
sessionInfo()
resultsN/A: does not depend on local systems.
The ultimate objective of this project is to make scalable GIS computation easier to adequately versed R/GIS users (e.g., master's students in epidemiology or geography). However, the short-term goal is to serve the {SET}group's NRT-AP project to compute required covariates nationwide (i.e., mainland United States). This issue is made to estimate the scale of covariate computation for the NRT-AP project as well as computational scales that potential users would need to work with.
Mainland US: ~8 million sq km; split the mainland into standard rectangular regions
Checklist for scaling up the computation
grid_merge
; implemented)distribute_process_*
functions and their concepts/applicable situationstestthat
packagesgdalmdimtranslate
command line tool or xarray
in Pythongdalmdimtranslate
resulted in an error of "Cannot guess driver"xarray.Dataset.to_zarr()
crashed with an error of "cannot allocate 63.4 GB in memory." (my laptop has 32 GB (24 GB available) of memory).distribute_process_multirasters
works as designed, but the output does make little sense since it does not contain a file information from which the summary values are calculated. The internal processing lines should add file names in addition to the id and raster layer names.
Meanwhile, Codecov shows a different behavior in distribute_process_multirasters
compared to other tests, where no main lines of the function were marked untested despite being tested in the test suite.
SpatialEpi
(CRAN Link)
SpatialEpiApp
(Moraga 2017)
aegis
(Application for Epidemiological Geographic Information System) (Cho et al. 2020)
As we move towards writing the manuscript, let's consider our target audience and the interest in climate and health modeling. The README should reflect common applications in climate and health.
@sigmafelix I noticed the README installation instructions include the out-of-date reference to github/spatiotemporal-exposure-and-toxicology
extract_at_buffer
and extract_at_poly
have small difference in using radius
argument to generate a circular buffer in extract_at_buffer
. It is reasonable to use only one function extract_at
for point and polygon overlay at rasters for brevity.
par_make_gridset
into par_grid
(custom pre-generated grid inputs optional)quadtree
for mode = "density"
in par_make_gridset
plan(multicore)
is available in *nix systems (due to the support of fork
-eval
). With other than multicore
plan, terra
objects are not exportable to parallel workers. par_*
functions should utilize plan
value to convert or to give a warning message when terra
objects are put in.
future::supportsMulticore()
and future::plan()
, edit par_*
functions to display a message on nonexportable objectssf
/terra
objects directly
Some functions, distribute_*
in particular, give lint errors of variable names no longer than 30 characters. As rOpenSci requires a submission to pass lint checks, abbreviation of these function names is necessary. Naming conventions to consider are:
{object type}_{function name}
as seen in sf
functions (st_*
) or stringi
functions (stri_*
)terra
(e.g., vect
, rast
, etc.){function group}_{function name}_{specific use case}
: currently used, but will get shorter group and function namesPerhaps distribute_*
are too verbose, so I am thinking of using par
to represent parallelization (as parallel
functions' par*
naming).
distribute_process_grid
topargrid
orpar_grid
orparGrid
remotes::install_github("NIEHS/chopin")
# Using GitHub PAT from the git credential store.
# Error: Failed to install 'chopin' from GitHub:
# HTTP error 403.
# Resource protected by organization SAML enforcement. You must grant your Personal Access token access to this organization.
#
# Rate limit remaining: 4995/5000
# Rate limit reset at: 2024-04-24 18:41:23 UTC
It looks like people who have a personal access token available for organization access can install the package via remotes
or pak
. I found this message only on Mac, so I will check it with other devices later.
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
roxygen2
*.tar.gz
file and insert an installation script)exactextractr
, terra
, etc.)rslurm
future
and mirai
par_group_grid
extract_at
as the only function for extract raster values at generic polygonspar_cut_coords
: add sf::st_zm(drop = TRUE)
extract_at_poly
: st_crs(polys) == st_crs(surf)
-> extract_at
preprocesses CRS. Still having issues with nonexisting CRS.par_grid
, par_hierarchy
, and par_multiraters
, i.e., avoiding positional arguments
x
and y
in favor of sf
/terra
conventionstar_target
integration: return WKT/WKB list to branch out
sf
objects ... the other argument in extract_at
or parallelizable piece of chopin
and others' functions?par_hierarchy
where takes a simple hierarchy object like the output of par_make_gridset
dep_switch
: add potential use cases, narrow the scope
sf
and terra
summarize_aw
(vector-vector) and extract_at_buffer_kernel
(vector-raster)future::plan
to future::multisession
multisession
is detected (#82)par_group_balanced
documentationCITATION
fileTo run this example
summarize_sedc
expect_message
expect_no_error
library
CONTRIBUTING.md
)cat
or print
inherits()
rather than using grepl()
at class()
outputswk
dependency for lightweight branching
@Spatiotemporal-Exposures-and-Toxicology Could you activate github.io webpage for this repository? Perhaps we could change the repository name when the webpage is deployed. Thank you!
sf::st_nearest_feature()
or nngeo::st_nn()
or terra::nearest()
sf::st_interpolate_aw()
; is there a terra
equivalent? (needs to be developed)parlapply()
vs foreach::foreach()
vs future.apply::future_apply()
Candidates include:
Current implementation of extract_at
only accepts vector inputs. We will consider raster weight in _kernel
subfunction to accommodate irregular polygon inputs and shorter processing time.
exactextractr::exact_extract()
main
The default build setting of R allows up to 128125 concurrent processes. This makes a problem in running multiple tasks across different nodes in HPC. Running R sessions in containers needs to be tested whether this practice has nothing to do with the maximum thread limit in the local installation of R on HPC.
Distance calculation parallelization with smaller spatial extents than the entire dataset's extent may result in erroneous values if some grids/sub-regions have no target data features or edge cases are present near the boundary of adjacent grids/sub-regions. Gradually expanding grids can be used to fix such edge cases. One challenge is to design a function which determine whether the current calculation is shorter or longer than the actual shortest distance to the nearest feature that would have been found at the full dataset.
Given a grid
A distance is considered suspicious/sub-optimal when it is longer than the distance from this point to the grid boundary.
Hypothesis implementation
Gradual increment in search window
Check the influence to the performance
par_group_grid
extract_at
as the only function for extract raster values at generic polygonspar_cut_coords
: add sf::st_zm(drop = TRUE)
extract_at_poly
: st_crs(polys) == st_crs(surf)
par_grid
, par_hierarchy
, and par_multiraters
, i.e., avoiding positional argumentsdep_switch
: add potential use cases, narrow the scopesummarize_aw
(vector-vector) and extract_at_buffer_kernel
(vector-raster)future::plan
to future::multisession
multisession
is detected (#82)par_group_balanced
documentationCITATION
fileTo run this example
summarize_sedc
expect_message
expect_no_error
library
CONTRIBUTING.md
)cat
or print
inherits()
rather than using grepl()
at class()
outputsdask-geopandas
, geopolars
, cuSpatial
) to R workflowsparklyr
and sf
-derived functions--- more ideas to comments ---
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.