trias-project / occ-cube-alien Goto Github PK
View Code? Open in Web Editor NEWπΊ Occurrence cubes for non-native taxa in Belgium and Europe
License: MIT License
πΊ Occurrence cubes for non-native taxa in Belgium and Europe
License: MIT License
@damianooldoni as you know I'm working on the alien species portal a huge part of the data is the alienspecies occurrence cube and its derivatives.
I'm currently looking into the dataflows1 I need to create a useful and interesting webpage. The latest update of the cube is dated to march 2022 which means a update is long overdue. I was wondering if it is an option to create a github action to do the updating and thereby increasing the update frequency2?
1prepocessing -> upload to UAT s3 bucket -> test in UAT -> publish to PRD. I'm mainly looking into automating the preprocessing part as well as looking into the feasability to automate the uploading to UAT.
2up for debate but more than a year is not usable.
taxonKey
(50)grid cell
, within its coordinateUncertainty
circle.Regarding 2: at EEA website, only 10km and 100km are available. If we want to stitch country 1km together, we should do this in a separate repository, so that functionality can be used by others.
Regarding 3: since we take a rough bounding box, some occurrences (e.g. at sea) won't be assigned a grid cell and not included in the aggregated data.
Aggregate by:
species
kingdom
year
grid cell
Summarize by:
occupied
: TRUE
min_uncertainty
: min(coordinateUncertaintyInMeters)
As this issue has been solved (inbo/inborutils#42), we can avoid the two table step. This will reduce a sensible portion of the time used to generate the cubes.
This will affect the creation db script, both for BE and EU.
βββ README.md
βββ LICENCE
βββ .gitignore
β
βββ data
β β
β βββ raw
β β βββ modelling_species.csv
β β βββ (do not include GBIF data in repo)
β β
β βββ interim (.gitignore)
β β βββ occ with assigned coordinates + grid => S3?
β β
β βββ processed
β βββ cube_europe.csv
β βββ cube_belgium.csv
β βββ cube_belgium_baseline.csv
β
βββ src
β
βββ belgium
β βββ download.Rmd: define filters, trigger download
β βββ create_db.Rmd: create sqlite, fill with data, filter on issues
β βββ assign_grid.Rmd: assign coordinates, assign grid (chunk based)
β βββ aggregate.Rmd: (filter on taxa), agg for alien, agg for baseline
β
βββ europe
βββ download.Rmd
βββ create_db.Rmd
βββ assign_grid.Rmd
βββ aggregate.Rmd
This file was deleted from ./data/processed/be_classes_cube.csv
Where can I find it now ?
I need it for the exoten-portal.
cube_europe seems to have a column speciesKey
which is empty for all. Should we remove it?
@damianooldoni . I have a suggestion: If you would like to be able to obtain the exact the same random allocation of occurrences each time your run the occurrence cube script, you should set the seed (e.g. set.seed(678). For example, I would set the seed in the first line of code for {r assign_pts_in_circle_occs_eu} in the 2_assign_grid.Rmd. This is useful if you want to control the variation among different cubes and to have an exactly reproducible cube.
Some 'eea_cell_code' have negative values ex. 1kmE3370N-2322, 1kmE-412N6010, 1kmE-410N6010
This list should be reviewed based on the last advances in selecting the species for risk assessment or any other usage of the european cube:
https://github.com/trias-project/occ-cube-alien/blob/master/references/modelling_species.tsv
I could run the download step.
It fails to run the 2nd step here: https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L156-L169
Error message:
Quitting from lines 157-169 (2_create_db.Rmd)
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) :
RS_sqlite_import: /home/mvarewyck/git/occ-cube-alien/data/raw/0089458-210914110416597_occurrence.txt line 3319620 expected 250 columns of data but found 207
Since the cube will only contain taxonKey
s I would also produce a reference file, so it is clear what taxa are included under those keys. We have 3 groups:
The reference file would be named nameofcube_taxa.csv
and be of the format:
taxonKey | scientificname | taxonRank | taxonomicStatus | includes |
---|---|---|---|---|
100 | Reynoutria japonica | species | accepted | 100 : Reynoutria japonica | 101: Fallopia japonica | 102: Reynoutria japonica var. japonica |
110 | Pastinaca sativa subsp. sativa | subspecies | accepted | 111: Anethum pastinaca |
120 | xxx | species | synonym | Β |
Its information could be added to the cube with a simple left_join()
by the user.
This issue starts from @amyjsdavis comment in trias-project/indicators#84 (comment) which I copy pasted here below:
@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.
In order to make data processing as much linear and FAIR as possible, we decided to make a European occurrence cube for ALL species which we need for modelling, where modelling should be intended in a broader sense, that is any species @amyjsdavis and @DiederikStrubbe need to run SDM. There is a list of species for modelling in this repo (references/modelling _species.tsv
) which should be updated anytime new species are found worth to be considered for risk maps.
I invite all of you to maintain it updated so that I can run a new version of the cube for them. Thanks.
@damianooldoni -- I accidently noticed that the GBIF code of Celastrus orbiculatus (8104460) is deleted as of March 2021. I assume the new GBIF code is 3169169. I assume this change of codes is also due for the occurrence cube.
Based on issue trias-project/indicators#84:
identificationVerificationStatus
to the cols_to_use
(https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L183-L193)identificationVerificationStatus = "unverified"
. This is very similar to what we already do for occurrenceStatus
(https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L234-L241)Apply these changes in both pipelines (BE and modelling taxa at EU level).
@damianooldoni does the baseline only account for the native species in a gridcell ?
Ifnot the baseline would be incomplete then. When using the Trias GAM workflow you get some gridcells with Trachemis scripta but the same cell is lacking reptilians. example gridcell 1kmE3903N3136, year 2020 & speciesKey 2443002.
While discussing about occurrence indicators trias-project/indicators#49 @timadriaens suggests to calculate baseline based on phylum instead of phylum as
this is a widely used procedure to make distribution models of alien species. [Tim Adrieans]
country=BE
coordinateUncertainty
circlegrid cell
(51.726)kingdom
year
grid cell
occ_count
: count(occurrences)
Years/grid cells without occurrences are not included.
SPECIES
=> query on speciesKey
(will include species synonyms, subspecies/varieties and their synonyms)SUBSPECIES
and VARIETY
=> query on acceptedKey
(will include synonyms))SYNONYM
=> query on taxonKey
)species
(or taxon
)kingdom
year
grid cell
occ_count
: count(occurrences)
Note for step 1: as initial step, we could only work with SPECIES
There is a huge need to produce national datacubes without filtering out non alien taxa as at the moment we do (fourth pipeline).
This need has been expressed by LifeWatch ERIC IJI as starting point for a use case and also by @qgroom. Quentin wrote:
It might be useful to add a flag so that it an option in the code. I can see many people finding it useful for all species.
Adding gbif key in the name of the datacube would be very useful for versioning control and documentation purposes.
Early records are less likely to be resolved to single years.
For example, the first exemplar row here
https://zenodo.org/record/3635510#.Xj1LLWj7SHt
1700 | 1kmE3802N3133 | 2287615 | 1 | 301
apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724
but this seems to misrepresent the original
https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567
which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)
Observations of Lithobates catesbeianus for the year 2018:
vs
When I look at be_alientaxa_cube.csv from zenodo it looks like the number of infected gridcells for Lithobates catesbeianus seems to be inflated compared to the previous version.
df <- read_csv(
file = "https://zenodo.org/records/10058400/files/be_alientaxa_cube.csv?download=1",
col_types = cols(
year = col_double(),
eea_cell_code = col_character(),
taxonKey = col_double(),
n = col_double(),
min_coord_uncertainty = col_double()
),
na = ""
)
test <- df %>% filter(taxonKey == 2427091)
df <- read_csv(
file = "https://zenodo.org/records/5819028/files/be_alientaxa_cube.csv?download=1",
col_types = cols(
year = col_double(),
eea_cell_code = col_character(),
taxonKey = col_double(),
n = col_double(),
min_coord_uncertainty = col_double()
),
na = ""
)
test <- df %>% filter(taxonKey == 2427091)
PS not certain this issue belongs here!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.