Coder Social home page Coder Social logo

inbo / etn Goto Github PK

View Code? Open in Web Editor NEW
5.0 11.0 4.0 7.94 MB

R package to access data from the European Tracking Network

Home Page: https://inbo.github.io/etn/

License: MIT License

R 100.00%
r fish animal-movement animal-tracking biologging data-access oscibio lifewatch r-package rstats

etn's Introduction

etn

CRAN status repo status

Etn provides functionality to access data from the European Tracking Network (ETN) database hosted by the Flanders Marine Institute (VLIZ) as part of the Flemish contribution to LifeWatch. ETN data is subject to the ETN data policy and can be:

  • restricted: under moratorium and only accessible to logged-in data owners/collaborators
  • unrestricted: publicly accessible without login and routinely published to international biodiversity facilities

The ETN infrastructure currently requires the package to be run within the LifeWatch.be RStudio server, which is password protected. A login can be requested at http://www.lifewatch.be/etn/contact.

Installation

You can install the development version of etn from GitHub with:

# install.packages("devtools")
devtools::install_github("inbo/etn")

Meta

  • We welcome contributions including bug reports.
  • License: MIT
  • Get citation information for etn in R doing citation("etn").
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

etn's People

Contributors

damianooldoni avatar jreubens avatar peterdesmet avatar pieterjanverhelst avatar pietrh avatar stijnvanhoey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etn's Issues

get_ functions output: tibble or just data.frame?

The output of the get_ functions like get_animals() or get_projects() return a classic data.frame. However, nowadays more and more R developers tend to use tibble data.frames.
In order to see the differences among a tibble df and a normal df, read this tutorial from tidyverse (yes, tibble comes from the tidyverse world!) or this chapter from R for Data Science.

Problems installing etn package

People are getting problems installing etn package, very likely due to old packages which need to be updated. I encountered this problem as well just now. This is what I got:

> library(etn)
Error: package or namespace load failed for ‘etn’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 namespace ‘rlang’ 0.4.2 is already loaded, but >= 0.4.3 is required

Attention: you could get something else, depending on status of your packages.

Here below I share with you how I solved it by reinstalling the package and updating all packages etn directly or indirectly depends on.

So, please, follow this procedure:

  1. restart R (Session -> Restart R, or just Ctrl+Shift+F10)
  2. type devtools::install_github("inbo/etn") in console
  3. You should get this message:
These packages have more recent versions available.
Which would you like to update?

followed by a list of packages. At the question "Enter one or more numbers separated by spaces, or an empty line to cancel", type 1 which stands for "All".
4. Restart R again (Session -> Restart R, or just Ctrl+Shift+F10)
5. Load the library by typing library(etn) in the console. I didn't get any error anymore.

Please, let me know if this helped.

Extending etn package to retrieve cpod data

This issue discusses the enhancements proposed by user @DebusschereE about extending etn package to get cpod data, which are stored in etn database as well: http://www.lifewatch.be/etn/cpoddeployments.

Yes, this enhancement would be nice for sure. We have already some changes planned for the package as answer to some changes at db level. I will make my best to extend the package to cpod data as well before the "ETN training school" planned on 24 - 25 March 2020.

Conflict in functions due to new column called "project_type" in view/database

While starting solving issue #24 I found that many functionalities of package etn start crashing or not returning the desired result. After debugging @stijnvanhoey and I are both very sure that the problem is the introduction of a new column called project_type. Using such a name is quite unlucky because it is also the name of input parameter for filtering on animal or network projects. And dplyr doesn't like this at all unfortunately 😢

I see two options:

  1. change of the column name in views/database
  2. change the name of parameter internally in all corrupted functions

In order to prevent this kind of bugs in the future, could @bwydoogh, @jreubens, the unit-tests perform in RStudio before any update of views/database?
It is just one line of code:

devtools::test()

If no errors are returned, then the change is welcome!

External database access: considerations and discussion

I've recently had a conversation with my colleague @peterdesmet about the possibility, for the European Tracking Network, to give access to a PostgreSQL database (hosted at VLIZ) over the Internet.

A few details about how we would like to implement this:

A small group of users will have access to this database in read-only mode. We also plan to make use of the row security policies feature of PostgreSQL so each user has only access to the rows that concern him/her. Since we don’t have much experience with this feature and its limitations, we plan to implement a small prototype to make sure the correct access restrictions can be easily implemented.

About exposing PostgreSQL over the Internet:

I agree this route opens one more “door” to the network infrastructure than a more traditional approach (such as a data visualization web app that loads its data from a database that’s, in itself, accessible from the outside network), but we think it would work well for this project well and the risks can be greatly mitigated:

  • Direct database access would give great flexibility (manual SQL queries) at a very limited cost for this small group of experienced users.
  • The access would be read-only, with only access to a subset of rows.
  • If you prefer to have this database separated from your other PostgreSQL databases, we could run it in a separated/more isolated environment (such as a dedicated machine on the DMZ)
  • We’d pay attention to the usual security measures: always keep PostgreSQL updated (in case of security issues in the PostgreSQL server itself), use SSL, strong passwords, … We can also automate (script, tests, …) the data updates, so the user permissions are automatically configured, and checked periodically.

Maybe we could also reduce the surface attack further by forcing that traffic through a tunnel such as OpenVPN. That would require more configuration on both sides ( infrastructure + users), so we’d need to assess the feasibility first.

Don’t hesitate to give us feedback over this. In short: we’d like to take this route but understand the concerns and would like to work together to reach a good solution for everyone involved.

my_projects

  • owner_organisation/group should be added to the available output fields. Currently you don't know to which group the project belongs; Ideally a PI should be added as well, but this is not yet available in the database. --> should be added soon

Cannot edit .Renviron on rstudio.lifewatch.be

@fwaumans, I notice my .Renviron file on rstudio.lifewatch.be contains:

username="..."
password="..."

I would like to update this to:

userid="..."
pwd="..."

Which is the RStudio default (https://db.rstudio.com/best-practices/managing-credentials/#use-environment-variables) and used by this vignette:

my_con <- connect_to_etn(Sys.getenv("userid"),
Sys.getenv("pwd"))

When I try to do so, I get:

this source file is read-only so changes cannot be saved

Can this be resolved?

Don't export certain functions

project filtering as function input

@PieterjanVerhelst and @jreubens
when requesting for data and addingnetwork or animal projects to the input, which type of input do you expect to use: name or projectcode?

Let me explain with an example: For the usage of for example the get_tags function,

get_tags(connection)  # all tags
get_tags(connection, animal_project = NULL)  # all tags
get_tags(connection, animal_project = "phd_reubens")  # tags of animal project phd_reubens
get_tags(connection, animal_project = c("phd_reubens", "homarus")  # tags of animal project phd_reubens and homarus

In the example, I use the projectcode. Is this the appropriate naming or will people use the name?

An example of the info about animal projects:

    id                   name        projectcode imis_dataset_id
1  616              rangetest          rangetest              NA
2   16        PhD Jan Reubens        phd_reubens            5846
3  599                Homarus            homarus              NA
4  632 Ocean Tracking Network                OTN              NA
5   22             2015 Dijle         2015_dijle            5872
6   21             2014 Demer         2014_demer            5871
7  621     2016 PhD Vergeynst 2016_phd_vergeynst            5875
8   18      2013 Albertkanaal  2013_albertkanaal            5868

Authentication failed while trying establishing connection

I am busy developing code with @stijnvanhoey.
While testing function `connect_to_etn(user, password), I get this error message:

Error: nanodbc/nanodbc.cpp:950: 28P01: [unixODBC]FATAL: password authentication failed for user "damianoo"

I am using same credentials as those ones for RStudio server. Any idea?
Thanks!

Uniqueness of scientific names not guaranteed

When I query the DISTINCT names from the animals view, I get both Salmo salar and salmo salar

> scientific_names(connection)
 [1] "Rutilus rutilus"      "Alosa fallax"         "Platichthys flesus"  
 [4] "Built-in"             "Anguilla anguilla"    "Petromyzon marinus"  
 [7] "Sentinel"             "Squalius cephalus"    "Cyprinus carpio"     
[10] "salmo salar"          "Sync tag"             "Gadus morhua"        
[13] "Silurus glanis"       "Lampetra fluviatilis" "Homarus gammarus"    
[16] "Salmo salar" 

@bwydoogh something to fix on dbase_level? Or should I make sure to do all evaluations on lower case?

Update pkg to work with new views

@jreubens @IPauwels @PieterjanVerhelst @fwaumans and myself completely reviewed the 5 views offered by ETN (tags, animals, receivers, deployments, detections) to make them consist and to expose users with the same information via the ETN application and the etn package.

Once @fwaumans is finished with implementing the views, we should update all functions to make use of the new views. This might imply renaming some parameters.

Size limit data upload

I was trying to download the dataset of the animal network 2013_albertkanaal and apparently we cannot download files larger than 100 Mb. Is it possible to allow download of larger files?
Here is the error I got: Error: cannot allocate vector of size 100.0 Mb

Project names and project codes

It is sometimes not very clear if you need to use a project name or project code in the function. I think for us it is often quite obvious, but for external/new users in the future, this may require some extra documentation.

How to select acoustic vs c-pod?

  1. Are those two separate types of data that should never be shown together: e.g. never showing receivers for both c-pod and acoustic at the same time?
  2. Should the visible fields in receivers, deployments and detections differ between acoustic and c-pod, or should it be the same, i.e. different views or shared views?
  3. Dependent on 2 we need a way to select c-pod vs acoustic. For now we can only make that selection in the (shared) view of receivers.

How to return animal-tag relationship?

Currently:

  • get_tags() doesn't return unique tags, as it's internally linked to animals to allow filtering on animal project. Since a tag (A69-1303-20695) can be associated with multiple animals (673 and 674), duplicates are returned
  • get_animals() doesn't return unique animals, as the view return animals + their tag ID. Since an animal (2369) can be associated with multiple tags (A69-9006-971 and A69-9006-972) duplicates are returned.

I would:

  • get_tags(): remove the option to select on animal project, so the function (just like the view and the table on the website) returns unique tags
  • get_animals(): return unique animals by default, which is similar to how the website displays a single row per animal:

Screenshot 2020-02-19 at 16 25 34

  • get_animals(): add an option to include tag_fk (resulting in multiple), so one can join with information from get_tags()

get_detections

I think the get_detections() functionality can be improved by adding filters on:

  • Receiver
  • species

connect_to_etn

When using the function (both with the Sys.getenv and by typing my credentials in the function), I get:
Error: nanodbc/nanodbc.cpp:950: 28P01: [unixODBC]FATAL: password authentication failed for user "[email protected]"

I double checked if I am using the right username and password (and tried with another account as well, same issue). Is this a problem of the code or should I change my access rights to ETN itself?

Many receivers don't have network_project_code in new view

get_receivers() will in the background look for all network codes and filter receivers on those with a network code. As a result, that currently only returns 12 entries.

receivers_all <- DBI::dbGetQuery(connection, "SELECT * FROM vliz.receivers_view2")

Returns 1640 entries, many of them without network_project_code. Why is this information missing?

`get_tags` to `get_transmitters`

@jreubens and @PieterjanVerhelst, in #9, we defined the function get_tags. However, working on the code I do think it would be more consistent to call it get_transmitters, as this is the naming used in the detections view as well to define the tag identifiers. Moreover, it complements better the get_receivers function.

I propose to provide the function get_transmitters instead of get_tags, ok?

Consistency project_code and project_name

While writing a vignette about the usage of project code (instead of project name) (see #44) I found a kind of inconsistency about the name of the columns related to project codes and project names.

Here below the thing I am speaking about.

In functions like get_detections() we get columns named like this:

"animal_project_name", "animal_project_code", "network_project_name", "network_project_code"

which is fine.

While running functions get_projects() we get:

get_projects(my_con) %>% colnames()
 [1] "id"              "name"            "projectcode"     "type"           
 [5] "startdate"       "enddate"         "moratorium"      "imis_dataset_id"
 [9] "latitude"        "longitude"

As you see, columns projectcode and name are returned. I would expect project_code and project_name. The same holds true for get_receivers() which returns a data.frame with a column named projectcode while all other columns containing the word code contain an underscore: *_code like ar_release_code and ar_disable_code.

get_receivers

currently the get_receivers function links receivers to a network.
Some remarks on this:

  • output gives an overview of all deployment that occur/have occurred. Thus receivers can be present multiple times. I don't think this is a wanted output. Should be unique receives to my opinion
  • currently you can only link to the network_project. should be extended to animal_project as well
  • next to projects you should also be able to view and filter on owner_organisation. The output gives a number. should be a name.... TO be able to filter on it using 'get_receivers(my_con, network_project = "INBO")' I guess this info needs to be added to 'my_projects' output first...

basic set of functions for ETN data access

@jreubens @PieterjanVerhelst as we discussed:

As a replacement of the current getLWData function, we aim to go to a minimal set of functions for the data access. Each of these functions target a specific set of data-tables, comparable to the online ETN data portal.

Apart from a make_connection function, the first set of functions that we will focus on are. For each function, we define the required filters as function inputs that will be embedded to query the existing views:

get_projects()

get_deployments(
    netwerk-project,
    active)               # open deployments, default ALL  

get_animals(
    netwerk-project,   # i.e. projectlist
    animal_project,   # i.e. tagprojectlist
    species)

get_tags(
    animal_project)   # i.e. tagprojectlist

get_receivers
    netwerk-project)    # i.e. projectlist

get_detections(
    animal_project,
    netwerk-project,
    start_date,
    end_date,
    station,
    get_tags,
    ...?)

By default, we return all the available columns in the database views/tables; but maybe this will be too much. @jreubens you could maybe define a subset of columns for each output that are really out of scope for any of the ETN users?

get_animals() suspect behavior if only network_project is given as input

I am wondering whether it is correct that

> animals <- get_animals(connection, network_project = "thornton")
> animals2 <- get_animals(connection, network_project = "leopold")
> animals3 <- get_animals(connection, network_project = c("leopold", "thornton"))

are equal. In fact no error is asserted by executing the code below:

> testthat::expect_identical(animals, animals2)
> testthat::expect_identical(animals2, animals3)

Moreover, the network_project value given as input is not one of the projectcode in the output df:

> which(animals %>% distinct(projectcode) %in% valid_network_projects)
integer(0)
> which(animals2 %>% distinct(projectcode) %in% valid_network_projects)
integer(0)

I would rather prefer an empty data.frame as output or a warning. Or maybe do I miss something about the structure of the database?

species subsetting as function input

@PieterjanVerhelst and @jreubens when subsetting on species name, I assume you'll provide a number of scienitific names as these seem to be provided, whereas common_name not:

> scientific_name
 [1] "Gadus morhua"         "Sentinel"             "Homarus gammarus"    
 [4] "Alosa fallax"         "Built-in"             "Anguilla anguilla"   
 [7] "Lampetra fluviatilis" "Salmo salar"          "Squalius cephalus"   
[10] "Silurus glanis"       "Petromyzon marinus"   "Rutilus rutilus"     
[13] "Platichthys flesus"   "Cyprinus carpio"      "Sync tag"            
[16] "salmo salar"         
> common_name
[1] NA             "Atlantic cod" "Fint"      

ok to filter on the scientific names?

Create COST course material as vignette

@damianooldoni will organize a short session on using the etn package. As course material we propose to create a vignette, which quickly explains a typical workflow.

Workflow suggested by @jreubens:

  • Welke projecten heb ik exact nodig. Is niet altijd duidelijk hoe projecten gespeld worden in ETN. Dus ik begin altijd met overzicht op te vragen van mijn projecten.
  • Dan maak ik een query van de data die ik wil: welke projecten heb ik exact nodig heb (default alle network projects en selectie aan animal projects), over welke tijdsperiode gaat het…
  • Dan roep ik detecties op (en filter ik eventueel op bepaalde soort, want detecties van een ander neem je niet mee in je analyse en ook sync tags wil je er bv niet altijd in)
  • Voor je aan echte analyse begint nog enkele grafieken om wat data-exploratie te doen:
    • Tijdslijn: welke vis gezien op welke dag
    • Totaal aantal vissen gezien
    • Overzichtskaart met alle deployments

Additional comments by @PieterjanVerhelst:

Nog enkele nuttige data-exploratie grafieken:

  • Welke soorten gedetecteerd + aantal per soort
  • Time range waarover een vis gedetecteerd werd
  • Aantal stations waarop een vis gedetecteerd werd
  • Aantal detecties per vis (vaak ook per deelgebied/netwerk, vb. Zeeschelde, Westerschelde, BPNS)
  • Aantal detecties per station

Additional comments by @IPauwels:

Aandeel tijd dat een dier gedetecteerd werd op de maximale tijd dat hij kon gedetecteerd worden (duur van netwerk-deployment of batterijduur zender) is eventueel ook nog iets.

Data for unit-tests

We are adding unit-tests to etn package.
In particular we would like to test functions we are developing based on #9 .
To do it, we need using data which are:

  1. public available,
  2. stable (= always present and never change)

Are there such data available?

`get_tag` handling of the animal_project choice

When I check the tags data coupled with the projectcode to enable searches using a set of animal project codes:

  "SELECT tags.*, animals.projectcode
      FROM vliz.tags
        LEFT JOIN vliz.animal_tag_release ON (animal_tag_release.tag_fk = tags.id_pk)
        LEFT JOIN vliz.animals_view animals ON (animals.id_pk = animal_tag_release.animal_fk)
  "

(@bwydoogh, I do think I do this correct?)

I do get (on my account) 1403 NA values for the projectcode. @PieterjanVerhelst or @jreubens when users request the data and the default is all projects (get_tags(connection) is by default all projects), should this result in all projects and ALL not-matched tags or only those tags that can be linked to an animal project (according to the query defined here)?

Error when loading package

When I want to load the package using
"devtools::install_github("inbo/etn")
library(etn)"
I get the following error
"Error: package ‘digest’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version"

I'm not sure what to do.
@damianooldoni or @peterdesmet can you help me out?

Create vignette about the handling of username/pwd

We should inform the users to not store their credentials as such or support them properly so they don't have to... Some options are provided here, but would be good to have a proper one documented to the users as part of the package:

  1. fixing this on the server/DSN-level, would be nice (users just don't have to care): users are logged in, so would be good if it was automatically linke to their account, in order that connect_to_etn() would just work without credentials.... @bwydoogh any thought suggestions (or maybe this or a comparable option is actually supported)

  2. If dbase level is not possible, how to handle this? Different options are described, any preferences? in short:, the options:

  • config package using a simple config file in peoples user account?
  • Environment variables that can be asked
  • R general options with options()
  • rstudioapi::askForPassword("Database user"),
  • operating systems credential store with keyring package

controlled vocabulary for `receiver_status`

As we aim to provide receiver status as an input argument of the function get_edployments, we have to know the possible inputs (or have access to query these options).

The options I can currently find are:

           receiver_status
1                Available
2                     Lost
3                   Broken
4                   Active
5 Returned to manufacturer

I do not think the normal user has access rights to the original table with the receiver_status options, but having the controlled vocabulary hard coded in the package won't be that bad I assume as this won't changes regularly?

get_deployments

When you filter for 'Active' deployments for a specific project, you currently get a list of all deployments that have occured in this project for the (currently) active receivers. This includes also closed deployments.
What we want is a list of the 'open deployements (i.e. no end date). thus probably we need to include an extra criterion in this function.
@PieterjanVerhelst correct me if I'm wrong

transmitter type not standardised

When requesting the transmitter information, I see the following unique types:

      type
1 internal
2 acoustic
3 built-in
4 Acoustic

I would clear the usage of acoustic versus Acoustic on the database or import application side.

Coordinates

When I recently downloaded data via the package, I noticed some coordinates were wrong or not filled in (i.e. 'NA'). Therefore, I think the retrieved coordinates were from the columns Recover_lat and Recover_long. However, the coordinates which should come with the data, are from the columns Deploy_lat and Deploy_long, which are a obliged field to fill in.

my_animals

Some bugs in the get_animals() functionality:

  • When I run the code my_animals <- get_animals(my_con, animal_project = "2012_leopoldkanaal") I get the animals for both the 2012_leopoldkanaal and PhD van der Knaap projects. This also holds true when other animal projects are selected; apparently PhD van der Knaap comes as a standard in the output.

  • The function does not return information about length_type3 and length_type4 nor the actual weight of the fish.

Incorrect time on machine running RStudio server

I was checking the package by launching the standard devtools::check() commando while I got this error which stops the check at an early stage:

> checking for future file timestamps ... ERROR
  This system is set to the wrong time: please correct
     system: 2019-09-20 09:36 (UTC)
    correct: 2019-09-20 09:41 (UTC)

It seems that machine time is not correct? Could somebody with admin rights on the machine running Rstudio server check this asap? Thanks!

Add information to potential etn users about setup/login info/...

As the package only works within the Rstudio server (or within the VLIZ network), we have to provide this information to potential users in the readme (which is the landing page of the documentation).

@jreubens could you add that kind of information (here or direclty edit the readme)? Short the concept ( and link to other pages), how they can get a login, what the data can be used for, some other useful links,...

get_animals don't filter on animal_project

Bug found by @peterdesmet. Thanks for it. Indeed, I confirmed that get_animals returns too much animals as it doesn't filter properly.

Example:

animals <- get_animals(connection = con, animal_project = "2011_rivierprik")

animals should contain 39 lines, but it is not.

The problem is that we have param netework_project in the query too and if no network is specified than we take all network projects. This should be avoided. Removing network_project param from the function and the code related to it seems to solve the issue.
I am immediately tackling this bug in branch no-ntwk_prjct-in-get_animals. Asap to master.

Handle ghost detections (at DB & pkg)

Here's a quick overview of the data that will be included in the dataset we will publish. @PieterjanVerhelst @IPauwels can you have a look if this makes sense? Let me know if you need more info.

animal_project_name animals.scientific_name detections individuals stations start_date end_date
2011 Rivierprik Lampetra fluviatilis 114605 29 29 2011-12-14 2012-07-03
2012 Leopoldkanaal Anguilla anguilla 2215829 92 60 2012-07-04 2017-03-12
2014 Demer Petromyzon marinus 42 1 1 2015-05-06 2015-05-12
2014 Demer Rutilus rutilus 11030 2 9 2014-04-19 2014-06-28
2014 Demer Silurus glanis 86023 9 46 2014-04-25 2018-01-31
2014 Demer Squalius cephalus 139013 2 10 2014-04-30 2015-02-12
2015 Dijle Anguilla anguilla 41798 1 7 2015-05-01 2015-10-15
2015 Dijle Cyprinus carpio 4944 2 9 2015-04-23 2015-11-06
2015 Dijle Platichthys flesus 101488 8 28 2015-04-29 2016-04-08
2015 Dijle Rutilus rutilus 7870 4 9 2015-04-23 2015-09-14
2015 Dijle Silurus glanis 78331 11 25 2015-04-22 2017-09-16

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.