trias-project / occ-cube-alien Goto Github PK

View Code? Open in Web Editor NEW

2.0 7.0 1.0 193.14 MB

🗺 Occurrence cubes for non-native taxa in Belgium and Europe

License: MIT License

gbif r occurrences rstats oscibio invasive-species

occ-cube-alien's People

Contributors

Stargazers

Watchers

Forkers

timadriaens

occ-cube-alien's Issues

Save output files as csv instead of tsv

See trias-project/occ-cube#3

Automate updating of occ-cubes using github actions

@damianooldoni as you know I'm working on the alien species portal a huge part of the data is the alienspecies occurrence cube and its derivatives.
I'm currently looking into the dataflows¹ I need to create a useful and interesting webpage. The latest update of the cube is dated to march 2022 which means a update is long overdue. I was wondering if it is an option to create a github action to do the updating and thereby increasing the update frequency²?

¹prepocessing -> upload to UAT s3 bucket -> test in UAT -> publish to PRD. I'm mainly looking into automating the preprocessing part as well as looking into the feasability to automate the uploading to UAT.

²up for debate but more than a year is not usable.

Occurrence processing for Europe (CUBE EU)

Preprocessing

Download occurrences (1?? million)
- Rough bounding box (bigger than EEA ref grid)
- Some quality filters
- taxonKey (50)
Load 1km EEA ref grid for Europe
Randomly assign each occurrence to grid cell, within its coordinateUncertainty circle.

Regarding 2: at EEA website, only 10km and 100km are available. If we want to stitch country 1km together, we should do this in a separate repository, so that functionality can be used by others.

Regarding 3: since we take a rough bounding box, some occurrences (e.g. at sea) won't be assigned a grid cell and not included in the aggregated data.

Aggregate

Aggregate by:
- species
- kingdom
- year
- grid cell
Summarize by:
- occupied: TRUE
- min_uncertainty: min(coordinateUncertaintyInMeters)

Simplify workflow using Arrow C++ library in R

As this issue has been solved (inbo/inborutils#42), we can avoid the two table step. This will reduce a sensible portion of the time used to generate the cubes.

This will affect the creation db script, both for BE and EU.

Repo structure

├── README.md
├── LICENCE
├── .gitignore
│
├── data
│   │
│   ├── raw
│   │   ├── modelling_species.csv
│   │   └── (do not include GBIF data in repo)
│   │
│   ├── interim (.gitignore)
│   │   └── occ with assigned coordinates + grid => S3?
│   │
│   └── processed
│       ├── cube_europe.csv
│       ├── cube_belgium.csv
│       └── cube_belgium_baseline.csv
│
└── src
    │
    ├── belgium
    │   ├── download.Rmd: define filters, trigger download
    │   ├── create_db.Rmd: create sqlite, fill with data, filter on issues
    │   ├── assign_grid.Rmd: assign coordinates, assign grid (chunk based)
    │   └── aggregate.Rmd: (filter on taxa), agg for alien, agg for baseline
    │
    └── europe
        ├── download.Rmd
        ├── create_db.Rmd
        ├── assign_grid.Rmd
        └── aggregate.Rmd

Where can I find the be_classes_cube.csv?

This file was deleted from ./data/processed/be_classes_cube.csv
Where can I find it now ?
I need it for the exoten-portal.

Add kingdom to taxonomic compendium

Add column kingdom to https://github.com/trias-project/occ-cube-alien/blob/master/data/processed/be_alientaxa_info.csv and https://github.com/trias-project/occ-cube-alien/blob/master/data/processed/eu_modellingtaxa_info.csv

Remove speciesKey from cube_europe

cube_europe seems to have a column speciesKey which is empty for all. Should we remove it?

set seed

@damianooldoni . I have a suggestion: If you would like to be able to obtain the exact the same random allocation of occurrences each time your run the occurrence cube script, you should set the seed (e.g. set.seed(678). For example, I would set the seed in the first line of code for {r assign_pts_in_circle_occs_eu} in the 2_assign_grid.Rmd. This is useful if you want to control the variation among different cubes and to have an exactly reproducible cube.

eea grid code

Some 'eea_cell_code' have negative values ex. 1kmE3370N-2322, 1kmE-412N6010, 1kmE-410N6010

Update the modelling species list

This list should be reviewed based on the last advances in selecting the species for risk assessment or any other usage of the european cube:
https://github.com/trias-project/occ-cube-alien/blob/master/references/modelling_species.tsv

error rendering create_db for Belgium

I could run the download step.
It fails to run the 2nd step here: https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L156-L169

Error message:

Quitting from lines 157-169 (2_create_db.Rmd) 
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : 
  RS_sqlite_import: /home/mvarewyck/git/occ-cube-alien/data/raw/0089458-210914110416597_occurrence.txt line 3319620 expected 250 columns of data but found 207

Add reference file for taxa included in cube

Since the cube will only contain taxonKeys I would also produce a reference file, so it is clear what taxa are included under those keys. We have 3 groups:

species + their infraspecific taxa + synonyms
infraspecific taxa + synonyms
synonyms

The reference file would be named nameofcube_taxa.csv and be of the format:

taxonKey	scientificname	taxonRank	taxonomicStatus	includes
100	Reynoutria japonica	species	accepted	100 : Reynoutria japonica \| 101: Fallopia japonica \| 102: Reynoutria japonica var. japonica
110	Pastinaca sativa subsp. sativa	subspecies	accepted	111: Anethum pastinaca
120	xxx	species	synonym

Its information could be added to the cube with a simple left_join() by the user.

Add PRA species to modelling taxa for European occurrence cube

This issue starts from @amyjsdavis comment in trias-project/indicators#84 (comment) which I copy pasted here below:

@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.

In order to make data processing as much linear and FAIR as possible, we decided to make a European occurrence cube for ALL species which we need for modelling, where modelling should be intended in a broader sense, that is any species @amyjsdavis and @DiederikStrubbe need to run SDM. There is a list of species for modelling in this repo (references/modelling _species.tsv) which should be updated anytime new species are found worth to be considered for risk maps.

I invite all of you to maintain it updated so that I can run a new version of the cube for them. Thanks.

Celastrus : GBIF code changed or inaccurate ?

@damianooldoni -- I accidently noticed that the GBIF code of Celastrus orbiculatus (8104460) is deleted as of March 2021. I assume the new GBIF code is 3169169. I assume this change of codes is also due for the occurrence cube.

Discard occurrences with unverified identificationVerificationStatus

Based on issue trias-project/indicators#84:

Add column identificationVerificationStatus to the cols_to_use (https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L183-L193)
Discard occurrences with identificationVerificationStatus = "unverified". This is very similar to what we already do for occurrenceStatus (https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L234-L241)

Apply these changes in both pipelines (BE and modelling taxa at EU level).

baseline incomplete for trachemis ?

@damianooldoni does the baseline only account for the native species in a gridcell ?

Ifnot the baseline would be incomplete then. When using the Trias GAM workflow you get some gridcells with Trachemis scripta but the same cell is lacking reptilians. example gridcell 1kmE3903N3136, year 2020 & speciesKey 2443002.

Baseline datacube at classis level

While discussing about occurrence indicators trias-project/indicators#49 @timadriaens suggests to calculate baseline based on phylum instead of phylum as

this is a widely used procedure to make distribution models of alien species. [Tim Adrieans]

Occurrence processing for Belgium (CUBE BE)

Preprocessing

Download occurrences (20 million)
- country=BE
- Some quality filters (see trias-project/indicators#41 (comment))
- No taxon filter (GBIF does not support long lists of taxa: gbif/portal-feedback#1768)
Randomly assign coordinates to each occurrence, within its coordinateUncertainty circle
Calculate EEA 1km ref grid cell(51.726)

Baseline data

Aggregate by:
- kingdom
- year
- grid cell
Summarize by:
- occ_count: count(occurrences)

Years/grid cells without occurrences are not included.

Alien data

Filter occurrences on checklist taxa (2.500):
- For SPECIES => query on speciesKey (will include species synonyms, subspecies/varieties and their synonyms)
- (For SUBSPECIES and VARIETY => query on acceptedKey (will include synonyms))
- (For SYNONYM => query on taxonKey)
Aggregate by:
- species (or taxon)
- kingdom
- year
- grid cell
Summarize by:
- occ_count: count(occurrences)

Note for step 1: as initial step, we could only work with SPECIES

Aggregate data at species level for broader use

There is a huge need to produce national datacubes without filtering out non alien taxa as at the moment we do (fourth pipeline).

This need has been expressed by LifeWatch ERIC IJI as starting point for a use case and also by @qgroom. Quentin wrote:

It might be useful to add a flag so that it an option in the code. I can see many people finding it useful for all species.

Add GBIF key to name of output datacube

Adding gbif key in the name of the datacube would be very useful for versioning control and documentation purposes.

Additional checks for vague date ranges required?

Early records are less likely to be resolved to single years.
For example, the first exemplar row here
https://zenodo.org/record/3635510#.Xj1LLWj7SHt
1700 | 1kmE3802N3133 | 2287615 | 1 | 301
apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724
but this seems to misrepresent the original
https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567
which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)

Should the automated aggregation process should include some sort of flag for early records that are unlikely to, in reality, be resolved to a single year?
What checks could be done?
For example, it’s not clear to me why the GBIF record linked above has a date but also the claim of “no verbatim date data”, is this contradictory?

Occurrences from new zenodo cube all over the place

Observations of Lithobates catesbeianus for the year 2018:

Current zenodo cube (https://zenodo.org/records/10058400):

Old zenodo cube (https://zenodo.org/records/5819028):

When I look at be_alientaxa_cube.csv from zenodo it looks like the number of infected gridcells for Lithobates catesbeianus seems to be inflated compared to the previous version.

df <- read_csv(
  file = "https://zenodo.org/records/10058400/files/be_alientaxa_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    taxonKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)

test <- df %>% filter(taxonKey == 2427091)

df <- read_csv(
  file = "https://zenodo.org/records/5819028/files/be_alientaxa_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    taxonKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)

test <- df %>% filter(taxonKey == 2427091)

PS not certain this issue belongs here!

trias-project / occ-cube-alien Goto Github PK

occ-cube-alien's People

Contributors

Stargazers

Watchers

Forkers

occ-cube-alien's Issues

Preprocessing

Aggregate

Preprocessing

Baseline data

Alien data

Recommend Projects

Recommend Topics

Recommend Org