Coder Social home page Coder Social logo

occ-cube-alien's People

Contributors

amyjsdavis avatar damianooldoni avatar peterdesmet avatar qgroom avatar timadriaens avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

timadriaens

occ-cube-alien's Issues

Automate updating of occ-cubes using github actions

@damianooldoni as you know I'm working on the alien species portal a huge part of the data is the alienspecies occurrence cube and its derivatives.
I'm currently looking into the dataflows1 I need to create a useful and interesting webpage. The latest update of the cube is dated to march 2022 which means a update is long overdue. I was wondering if it is an option to create a github action to do the updating and thereby increasing the update frequency2?

1prepocessing -> upload to UAT s3 bucket -> test in UAT -> publish to PRD. I'm mainly looking into automating the preprocessing part as well as looking into the feasability to automate the uploading to UAT.

2up for debate but more than a year is not usable.

Occurrence processing for Europe (CUBE EU)

Preprocessing

  1. Download occurrences (1?? million)
    • Rough bounding box (bigger than EEA ref grid)
    • Some quality filters
    • taxonKey (50)
  2. Load 1km EEA ref grid for Europe
  3. Randomly assign each occurrence to grid cell, within its coordinateUncertainty circle.

Regarding 2: at EEA website, only 10km and 100km are available. If we want to stitch country 1km together, we should do this in a separate repository, so that functionality can be used by others.

Regarding 3: since we take a rough bounding box, some occurrences (e.g. at sea) won't be assigned a grid cell and not included in the aggregated data.

Aggregate

  1. Aggregate by:

    • species
    • kingdom
    • year
    • grid cell
  2. Summarize by:

    • occupied: TRUE
    • min_uncertainty: min(coordinateUncertaintyInMeters)

Repo structure

β”œβ”€β”€ README.md
β”œβ”€β”€ LICENCE
β”œβ”€β”€ .gitignore
β”‚
β”œβ”€β”€ data
β”‚   β”‚
β”‚   β”œβ”€β”€ raw
β”‚   β”‚   β”œβ”€β”€ modelling_species.csv
β”‚   β”‚   └── (do not include GBIF data in repo)
β”‚   β”‚
β”‚   β”œβ”€β”€ interim (.gitignore)
β”‚   β”‚   └── occ with assigned coordinates + grid => S3?
β”‚   β”‚
β”‚   └── processed
β”‚       β”œβ”€β”€ cube_europe.csv
β”‚       β”œβ”€β”€ cube_belgium.csv
β”‚       └── cube_belgium_baseline.csv
β”‚
└── src
    β”‚
    β”œβ”€β”€ belgium
    β”‚   β”œβ”€β”€ download.Rmd: define filters, trigger download
    β”‚   β”œβ”€β”€ create_db.Rmd: create sqlite, fill with data, filter on issues
    β”‚   β”œβ”€β”€ assign_grid.Rmd: assign coordinates, assign grid (chunk based)
    β”‚   └── aggregate.Rmd: (filter on taxa), agg for alien, agg for baseline
    β”‚
    └── europe
        β”œβ”€β”€ download.Rmd
        β”œβ”€β”€ create_db.Rmd
        β”œβ”€β”€ assign_grid.Rmd
        └── aggregate.Rmd


set seed

@damianooldoni . I have a suggestion: If you would like to be able to obtain the exact the same random allocation of occurrences each time your run the occurrence cube script, you should set the seed (e.g. set.seed(678). For example, I would set the seed in the first line of code for {r assign_pts_in_circle_occs_eu} in the 2_assign_grid.Rmd. This is useful if you want to control the variation among different cubes and to have an exactly reproducible cube.

eea grid code

Some 'eea_cell_code' have negative values ex. 1kmE3370N-2322, 1kmE-412N6010, 1kmE-410N6010

Add reference file for taxa included in cube

Since the cube will only contain taxonKeys I would also produce a reference file, so it is clear what taxa are included under those keys. We have 3 groups:

  • species + their infraspecific taxa + synonyms
  • infraspecific taxa + synonyms
  • synonyms

The reference file would be named nameofcube_taxa.csv and be of the format:

taxonKey scientificname taxonRank taxonomicStatus includes
100 Reynoutria japonica species accepted 100 : Reynoutria japonica | 101: Fallopia japonica | 102: Reynoutria japonica var. japonica
110 Pastinaca sativa subsp. sativa subspecies accepted 111: Anethum pastinaca
120 xxx species synonym Β 

Its information could be added to the cube with a simple left_join() by the user.

Add PRA species to modelling taxa for European occurrence cube

This issue starts from @amyjsdavis comment in trias-project/indicators#84 (comment) which I copy pasted here below:

@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.

In order to make data processing as much linear and FAIR as possible, we decided to make a European occurrence cube for ALL species which we need for modelling, where modelling should be intended in a broader sense, that is any species @amyjsdavis and @DiederikStrubbe need to run SDM. There is a list of species for modelling in this repo (references/modelling _species.tsv) which should be updated anytime new species are found worth to be considered for risk maps.

I invite all of you to maintain it updated so that I can run a new version of the cube for them. Thanks.

Discard occurrences with unverified identificationVerificationStatus

Based on issue trias-project/indicators#84:

  1. Add column identificationVerificationStatus to the cols_to_use (https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L183-L193)
  2. Discard occurrences with identificationVerificationStatus = "unverified". This is very similar to what we already do for occurrenceStatus (https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L234-L241)

Apply these changes in both pipelines (BE and modelling taxa at EU level).

baseline incomplete for trachemis ?

@damianooldoni does the baseline only account for the native species in a gridcell ?

Ifnot the baseline would be incomplete then. When using the Trias GAM workflow you get some gridcells with Trachemis scripta but the same cell is lacking reptilians. example gridcell 1kmE3903N3136, year 2020 & speciesKey 2443002.

Occurrence processing for Belgium (CUBE BE)

Preprocessing

  1. Download occurrences (20 million)
  2. Randomly assign coordinates to each occurrence, within its coordinateUncertainty circle
  3. Calculate EEA 1km ref grid cell(51.726)

Baseline data

  1. Aggregate by:
    • kingdom
    • year
    • grid cell
  2. Summarize by:
    • occ_count: count(occurrences)

Years/grid cells without occurrences are not included.

Alien data

  1. Filter occurrences on checklist taxa (2.500):
    • For SPECIES => query on speciesKey (will include species synonyms, subspecies/varieties and their synonyms)
    • (For SUBSPECIES and VARIETY => query on acceptedKey (will include synonyms))
    • (For SYNONYM => query on taxonKey)
  2. Aggregate by:
    • species (or taxon)
    • kingdom
    • year
    • grid cell
  3. Summarize by:
    • occ_count: count(occurrences)

Note for step 1: as initial step, we could only work with SPECIES

Aggregate data at species level for broader use

There is a huge need to produce national datacubes without filtering out non alien taxa as at the moment we do (fourth pipeline).

This need has been expressed by LifeWatch ERIC IJI as starting point for a use case and also by @qgroom. Quentin wrote:

It might be useful to add a flag so that it an option in the code. I can see many people finding it useful for all species.

Additional checks for vague date ranges required?

Early records are less likely to be resolved to single years.
For example, the first exemplar row here
https://zenodo.org/record/3635510#.Xj1LLWj7SHt
1700 | 1kmE3802N3133 | 2287615 | 1 | 301
apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724
but this seems to misrepresent the original
https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567
which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)

  • Should the automated aggregation process should include some sort of flag for early records that are unlikely to, in reality, be resolved to a single year?
  • What checks could be done?
  • For example, it’s not clear to me why the GBIF record linked above has a date but also the claim of β€œno verbatim date data”, is this contradictory?

Occurrences from new zenodo cube all over the place

Observations of Lithobates catesbeianus for the year 2018:

vs

When I look at be_alientaxa_cube.csv from zenodo it looks like the number of infected gridcells for Lithobates catesbeianus seems to be inflated compared to the previous version.

df <- read_csv(
  file = "https://zenodo.org/records/10058400/files/be_alientaxa_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    taxonKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)

test <- df %>% filter(taxonKey == 2427091)

df <- read_csv(
  file = "https://zenodo.org/records/5819028/files/be_alientaxa_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    taxonKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)

test <- df %>% filter(taxonKey == 2427091)

PS not certain this issue belongs here!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.