Coder Social home page Coder Social logo

natagora-occurrences's Introduction

Observations.be - Species occurrence datasets published by Natagora

Rationale

This repository contains the functionality to standardize datasets of observations.be to Darwin Core Occurrence datasets that can be harvested by GBIF. It was originally developed for the TrIAS project.

Workflow

observations.be database → Darwin Core SQL view → Direct connection with the IPT or CSV upload

Datasets

Title (and GitHub directory) IPT GBIF
Observations.be - Non-native species occurrences in Wallonia, Belgium natagora-alien-occurrences https://doi.org/10.15468/p58ip1
Observations.be - Orthoptera occurrences in Wallonia, Belgium natagora-orthoptera-occurrences https://doi.org/10.15468/r763pb

Repo structure

The structure for each dataset in datasets is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

├── sql                      : Darwin Core SQL queries
│
└── specs                    : Whip specifications for validation

references contains controlled vocabularies for:

These are shared with the waarnemingen.be datasets.

Validating with whip

Published data can be validated with whip:

  1. Download the published DwC Archive from the IPT
  2. Unzip the data in the directory data (git ignored), so data are available at data/data_file.txt
  3. In terminal, start jupyter notebook from the repository root
  4. Open notebooks/whip.ipynb
  5. In the notebook, set the correct paths at the top of the file
  6. Run the notebook
  7. Update dataset or specifications until they align

Contributors

List of contributors

License

MIT License

natagora-occurrences's People

Contributors

lienreyserhove avatar louisnatagora avatar peterdesmet avatar

Watchers

 avatar  avatar  avatar  avatar

natagora-occurrences's Issues

Suggested changes for occurrenceRemarks

I inspected the content of the current field occurrenceRemarks, a compilation of the information contained in typeActid, typdeKid and typeMid. Some suggestions for change:

occurrenceRemarks decision source assign to new Darwin Core term
recently hatched young remove typeActID lifeStage leave in occurrenceRemarks
unknown remove typeActID ?
indigenous remove typeActID establishmentMeans
present remove typeActID occurrenceStatus
occupied nest with eggs remove typeActID lifeStage leave in occurrenceRemarks
adventive remove typeActID establishmentMeans (=introduced), leave in occurrenceRemarks, but write as accidentally introduced
micr_examined_mat_present remove typeActID ? identificationRemarks( = microscopic examination)
catch_by_cat remove typeActID leave in occurrenceRemarks, but write as catch by cat
washed_ashore remove typeActID leave in occurrenceRemarks, but write as washed ashore
with broodpatch remove typeActID reproductiveStatus leave in occurrenceRemarks, but write as with brood patch
microscopic_examined remove typeActID samplingProtocol identificationRemarks( = microscopic examination)
seen while diving remove typeMid samplingProtocol

suggested changes for samplingProtocol

I inspected the content of the current field samplingProtocol, originally mapped from typeActid. Some suggestions for change:

samplingProtocol decision source new Darwin Core term
planted remove typeActID occurrenceRemarks
unknown remove typeActID ?
escaped remove typeActID occurrenceRemarks
indigenous remove typeActID establishmentMeans
sown remove typeActID occurrenceRemarks
present remove typeActID occurrenceStatus
adventive remove typeActID establishmentMeans: accidentally introduced in occurrenceRemarks
micr_examined_mat_present remove typeActID ?: microscopic examination in identificationRemarks
catch_by_cat remove typeActID occurrenceRemarks (= catch by cat)
washed_ashore remove typeActID occurrenceRemarks (= washed ashore)

Feedback on first_test_herpetology_one_user

Thanks for this first data input. I browsed through the file, here are my remarks:

  1. Presence of non-Dwc terms:

The following terms are not Darwin core terms:

  • typekid
  • typeactid
  • typemid
  • act_occrem
  • kleed_occrem
  • met_occrem
  • act_samplingprotocol
  • met_samplingprotocol

typekid, typeactid and typemid are the respective id's from database tables type_kleed, type_activiteit and type_determination_method. These id's have been translated to a vocabulary for the mapping of the Darwin Core fields shown in the table below. The content of a Darwin Core term can be a compilation of id's from type_kleed and/or type_activiteit and/or type_determination_method, this is why act_occrem, kleed_occrem, met_occrem(for occurrenceRemarks), act_samplingprotocol and met_samplingprotocol (for samplingProtocol) have been created.

id identificationRemarks lifeStage occurrenceRemarks samplingProtocol reproductiveCondition behavior
typekid x x x
typeactid x x x
typemid x x x
  1. accessRights: the current link https://www.natagora.be/usage-des-donnees does not work (already discussed this)

  2. There should be no dataGeneralizations in the datasets: all locations should be point coordinates and not generalized to a 5x5km UTM grid

  3. decimalLatitude and decimalLongitude are not in the correct format. now valeus like 507098773 and 567654132843018 are present, should be something like 50.7098 and 5.6765

Remove generalized records

I know we discussed this issue before (and to be honest, I can't remember why we concluded to keep them in), but in hindsight, I believe we should remove the generalized records (those coordinates generalized to a 4x4km IFBL grid). This because:

  1. The contract specifies that all records should be published as point records.
  2. It only concerns 116 records

I also wonder why georeferenceRemarks is set to coordinates are centroid of used grid square for 2745 records. I suppose this is due to other reasons then to secrecy reasons?

Use correct link in references

references should have a link like https://observations.be/observation/193304784/. Currently the values are https://observations.be/waarneming/view/193304784.

Review of full Natagora dataset

I reviewed the whole Natagora dataset, below are my remarks. Could you please tick off the boxes when the changes are integrated and send me a new export afterwards?

  • type
  • language: should be en (now e)
  • license
  • rightsHolder
  • accessRights
  • references
  • datasetID: to be completed after publication on GBIF
  • institutionCode
  • datasetName: remove animal from title, dataset is applicable to animals, plants and fungi
  • basisOfRecord
  • informationWithheld
  • dataGeneralizations: see #11
  • occurrenceID
  • individualCount
  • sex
  • lifeStage
  • behavior: collected should be NA, see #8
  • occurrenceRemarks: collected should be NA, see #8
  • samplingProtocol: collected should be specimen collected, see #8
  • dynamicProperties
  • reproductiveCondition
  • identificationRemarks
  • eventDate
  • continent
  • countryCode: should be BE (now B)
  • stateProvince
  • municipality
  • decimalLatitude
  • decimalLongitude
  • geodeticDatum
  • coordinateUncertaintyInMeters: for 2745 records the uncertainty is 2828, correct? see #11
  • georeferenceRemarks: coordinates are centroid of used grid square (2745), correct? see #11
  • identificationRemarks
  • taxonID
  • scientificName
  • kingdom
  • taxonRank
  • nomenclaturalCode: Should be ICZN for Animalia and ICN for plants and fungi

Suggested changes for behavior

I inspected the content of the current field behavior, originally mapped from typeActid. Some suggestions for change:

behavior decision source new Darwin Core term
planted remove typeActID occurrenceRemarks
escaped remove typeActID occurrenceRemarks
indigenous remove typeActID estblishmentMeans
sown remove typeActID occurrenceRemarks
present remove typeActID occurrenceStatus

Selection criteria for occurrences

A selection should be made based on the following criteria:

  1. Filter on the status of non-native species:
category status details
1b Incidental / Vagrant / Migrant Regular sightings in this country
2a Naturalized Introduced by man, now autonomously reproducing
2b Naturalizing Introduced by man, autonomous populations for 10-100 years.
2c Exotic Introduced by man, no autonomous populations for more than 10 years.
2d Incidental import Introduced by man, no autonomous population.
  1. Filter on walloon observations:

Two important questions:

  1. Do all observations have a regional status provided for each of the provinces?
  2. Do some observations only have national status information?

If option one is true, then all data can be obtained by filtering on observations from the walloon provinces brabant wallon, hainaut, liège, luxembourg, namur

  1. Filter on observations for which the observer did not object to the record being published

This raw selection can be uploaded in the branch upload-SQL-dump where you can create a new folder raw.

Requested changes for behavior, occurrenceRemarks and samplingProtocol

I went through the mapping of the following Darwin Core terms: behavior, lifeStage, occurrenceRemarks, reproductiveCondition and samplingProtocol. For the following terms, some adaptations are required to fit the natuurpunt vocabularies:

behavior

  • collected is not integrated in the vocabulary for behavior and should be NA instead (thus obsbe_act = COLLECTED has no value for behavior)

occurrenceRemarks

  • collected is not integrated in the vocabulary for occurrenceRemarks and should be NA instead (thus obsbe_act = COLLECTED has no value for occurrenceRemarks)
  • Use | as a separator for multiple values, rather then ; (mark: there's a space before and after the hash)

samplingPrototcol

  • There's only one value allowed for samplingProtocol. Now, the field often contains a combination between casual observation and another sampling protocol. Instead, casual observation is the default value, used only when no alternatives are presented. This is why I suggest to map obsbe_act and obsbe_method directly to samplingProtocol, rather then joining the content of the intermediary columns act_samplingProtocol and met_samplingProtocol. The mapping should look like this (use a case_when statement):
obsbe_act obsbe_method samplingprotocol
CAMERATRAP (...) camera trap
CATCH (...) catch
CATCH_ELECTRIC (...) catch by electrofishing
CATCH_POLE (...) catch by fishing rod
CATCH_NET (...) catch by net
COLLECTED (...) specimen collected
WITH_DETECTORHUNTING (...) observation with bat detector
FLASHLIGHT_NIGHT_OBSERVATION (...) observation with flashlight
IN_PELLET (...) pellet examination
COLLECTED (...) specimen collected
(...) BATDETECTOR observation with bat detector
(...) CAMERATRAP camera trap
(...) CAUGHT catch
(...) CAUGHT_ELECTRIC catch by electrofishing
(...) CAUGHT_BY_HAND catch by hand
(...) CAUGHT_BY_HAND_AND_COLLECTED catch by hand and collected
(...) CAUGHT_NET catch by net
(...) CAUGHT_POLE catch by pole
(...) BEATING_SCREEN catch by screen
(...) COLOURTRAP colour trap
(...) HEARD heard
(...) LIGHTTRAP light trap
(...) IN_PELLET pellet examination
(...) SEEN seen
(...) SEEN_AND_HEARD seen and heard
(...) INDOORS seen indoors
(...) SOUNDTRAPPED sound trap
(...) SPOTLIGHT_NIGHT_OBSERVATION spotlight
(...) TRACK_BED track bed
(...) (...) casual observation

Remap so values for behavior, occurrenceRemarks and samplingProtocol

The following values in the Natagora alien species dataset are incorrectly mapped. @LouisNatagora can you correct this:

value behavior occurrenceRemarks samplingProtocol
dead (empty)  found dead (default)
destroyed_nest (empty)  found as destroyed nest (default)
drowning_victim (empty) drowning victim (default)
eating feeding (empty) (default)
laying_egg laying egg (empty) (default)
prey_dead (empty) found dead (default)
tagged (empty) tagged (default)
tracks  (empty)  found as tracks  (default)
unknown  (empty)  (empty) (default)

Note that (default) is the default mapping to casual observation

Complete README

@LienReyserhove as per other datasets, can you complete the README for this datasets. I'm especially missing links to the dataset on IPT and GBIF.

I would also rename the src directory to sql

reproductiveCondition

This field (for which no documentation can be found on the web) only contains at present time
information from kleed : 'queen','worker','winged gyne' and 'unwinged gyne'

I suggest we could also add to this field the following information that is now sent to other fields :
1/ now sent to behavior : 'territorial behavior', 'copulating', 'laying egg', 'transporting feed or faeces', 'courtship/mating', 'nest building', 'distraction display',
2/ now sent to occurrenceRemarks : 'adult in territory', 'near nest', 'colony in trees', 'colony','found as nest','occupied nest','occupied nest with eggs','occupied nest with young','with broodpatch','pair in territory','probably nesting place', 'recently hatched young', ''recently used nest', and also 'found as substrate with miner damage', 'found as gall','found as egg mass','found as cocoon'
and also 'abandoned nest'

Do you find it useful or unappropriate ?

Why is there high coordinateUncertaintyInMeters

2 questions:

  1. Some (2%) of the records have "coordinates are generalized to a 4x4km IFBL grid". What is the reason for this?
  2. For those records, the coordinateUncertaintyInMeters is 2828, but I'm also seeing higher values (e.g. 9363): where are these coming from?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.