inbo / etn Goto Github PK

View Code? Open in Web Editor NEW

5.0 11.0 4.0 7.94 MB

R package to access data from the European Tracking Network

Home Page: https://inbo.github.io/etn/

License: MIT License

R 100.00%

r fish animal-movement animal-tracking biologging data-access oscibio lifewatch r-package rstats

etn's Introduction

etn

Etn provides functionality to access data from the European Tracking Network (ETN) database hosted by the Flanders Marine Institute (VLIZ) as part of the Flemish contribution to LifeWatch. ETN data is subject to the ETN data policy and can be:

restricted: under moratorium and only accessible to logged-in data owners/collaborators
unrestricted: publicly accessible without login and routinely published to international biodiversity facilities

The ETN infrastructure currently requires the package to be run within the LifeWatch.be RStudio server, which is password protected. A login can be requested at http://www.lifewatch.be/etn/contact.

Installation

You can install the development version of etn from GitHub with:

# install.packages("devtools")
devtools::install_github("inbo/etn")

etn's People

Contributors

Stargazers

Watchers

Forkers

salvafern sarahcd claumemo jag308

etn's Issues

get_ functions output: tibble or just data.frame?

The output of the get_ functions like get_animals() or get_projects() return a classic data.frame. However, nowadays more and more R developers tend to use tibble data.frames.
In order to see the differences among a tibble df and a normal df, read this tutorial from tidyverse (yes, tibble comes from the tidyverse world!) or this chapter from R for Data Science.

Problems installing etn package

People are getting problems installing etn package, very likely due to old packages which need to be updated. I encountered this problem as well just now. This is what I got:

> library(etn)
Error: package or namespace load failed for ‘etn’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 namespace ‘rlang’ 0.4.2 is already loaded, but >= 0.4.3 is required

Attention: you could get something else, depending on status of your packages.

Here below I share with you how I solved it by reinstalling the package and updating all packages etn directly or indirectly depends on.

So, please, follow this procedure:

restart R (Session -> Restart R, or just Ctrl+Shift+F10)
type devtools::install_github("inbo/etn") in console
You should get this message:

These packages have more recent versions available.
Which would you like to update?

followed by a list of packages. At the question "Enter one or more numbers separated by spaces, or an empty line to cancel", type 1 which stands for "All".
4. Restart R again (Session -> Restart R, or just Ctrl+Shift+F10)
5. Load the library by typing library(etn) in the console. I didn't get any error anymore.

Please, let me know if this helped.

Extending etn package to retrieve cpod data

This issue discusses the enhancements proposed by user @DebusschereE about extending etn package to get cpod data, which are stored in etn database as well: http://www.lifewatch.be/etn/cpoddeployments.

Yes, this enhancement would be nice for sure. We have already some changes planned for the package as answer to some changes at db level. I will make my best to extend the package to cpod data as well before the "ETN training school" planned on 24 - 25 March 2020.

Conflict in functions due to new column called "project_type" in view/database

While starting solving issue #24 I found that many functionalities of package etn start crashing or not returning the desired result. After debugging @stijnvanhoey and I are both very sure that the problem is the introduction of a new column called project_type. Using such a name is quite unlucky because it is also the name of input parameter for filtering on animal or network projects. And dplyr doesn't like this at all unfortunately 😢

I see two options:

change of the column name in views/database
change the name of parameter internally in all corrupted functions

In order to prevent this kind of bugs in the future, could @bwydoogh, @jreubens, the unit-tests perform in RStudio before any update of views/database?
It is just one line of code:

devtools::test()

If no errors are returned, then the change is welcome!

External database access: considerations and discussion

I've recently had a conversation with my colleague @peterdesmet about the possibility, for the European Tracking Network, to give access to a PostgreSQL database (hosted at VLIZ) over the Internet.

A few details about how we would like to implement this:

A small group of users will have access to this database in read-only mode. We also plan to make use of the row security policies feature of PostgreSQL so each user has only access to the rows that concern him/her. Since we don’t have much experience with this feature and its limitations, we plan to implement a small prototype to make sure the correct access restrictions can be easily implemented.

About exposing PostgreSQL over the Internet:

I agree this route opens one more “door” to the network infrastructure than a more traditional approach (such as a data visualization web app that loads its data from a database that’s, in itself, accessible from the outside network), but we think it would work well for this project well and the risks can be greatly mitigated:

Direct database access would give great flexibility (manual SQL queries) at a very limited cost for this small group of experienced users.
The access would be read-only, with only access to a subset of rows.
If you prefer to have this database separated from your other PostgreSQL databases, we could run it in a separated/more isolated environment (such as a dedicated machine on the DMZ)
We’d pay attention to the usual security measures: always keep PostgreSQL updated (in case of security issues in the PostgreSQL server itself), use SSL, strong passwords, … We can also automate (script, tests, …) the data updates, so the user permissions are automatically configured, and checked periodically.

Maybe we could also reduce the surface attack further by forcing that traffic through a tunnel such as OpenVPN. That would require more configuration on both sides ( infrastructure + users), so we’d need to assess the feasibility first.

Don’t hesitate to give us feedback over this. In short: we’d like to take this route but understand the concerns and would like to work together to reach a good solution for everyone involved.

Refactor input checks to common functionalities

Cfr. the check_null_or_value input check, this should be extended to all functions having network_project or animal_project as input argument.

my_projects

owner_organisation/group should be added to the available output fields. Currently you don't know to which group the project belongs; Ideally a PI should be added as well, but this is not yet available in the database. --> should be added soon

Exclude "sync tag" when returning detections

See list in #21. I would propose to exclude "sync tag" by default when returning occurrences, but maybe allow a flag to include those. @jreubens @PieterjanVerhelst are there other non animal detections in the database?

Cannot edit .Renviron on rstudio.lifewatch.be

@fwaumans, I notice my .Renviron file on rstudio.lifewatch.be contains:

username="..."
password="..."

I would like to update this to:

userid="..."
pwd="..."

Which is the RStudio default (https://db.rstudio.com/best-practices/managing-credentials/#use-environment-variables) and used by this vignette:

etn/vignettes/access-etn-data.Rmd

Lines 40 to 41 in 03d6010

    
           my_con <- connect_to_etn(Sys.getenv("userid"), 
        
                                    Sys.getenv("pwd"))

When I try to do so, I get:

this source file is read-only so changes cannot be saved

Can this be resolved?

Don't export certain functions

I noticed that for some helper functions/variables, pages are build on pkgdown where they probably should just stay internal:

@damianooldoni they probably need some Roxygen remark?

project filtering as function input

@PieterjanVerhelst and @jreubens
when requesting for data and addingnetwork or animal projects to the input, which type of input do you expect to use: name or projectcode?

Let me explain with an example: For the usage of for example the get_tags function,

get_tags(connection)  # all tags
get_tags(connection, animal_project = NULL)  # all tags
get_tags(connection, animal_project = "phd_reubens")  # tags of animal project phd_reubens
get_tags(connection, animal_project = c("phd_reubens", "homarus")  # tags of animal project phd_reubens and homarus

In the example, I use the projectcode. Is this the appropriate naming or will people use the name?

An example of the info about animal projects:

    id                   name        projectcode imis_dataset_id
1  616              rangetest          rangetest              NA
2   16        PhD Jan Reubens        phd_reubens            5846
3  599                Homarus            homarus              NA
4  632 Ocean Tracking Network                OTN              NA
5   22             2015 Dijle         2015_dijle            5872
6   21             2014 Demer         2014_demer            5871
7  621     2016 PhD Vergeynst 2016_phd_vergeynst            5875
8   18      2013 Albertkanaal  2013_albertkanaal            5868

Authentication failed while trying establishing connection

I am busy developing code with @stijnvanhoey.
While testing function `connect_to_etn(user, password), I get this error message:

Error: nanodbc/nanodbc.cpp:950: 28P01: [unixODBC]FATAL: password authentication failed for user "damianoo"

I am using same credentials as those ones for RStudio server. Any idea?
Thanks!

Create projects view

Create a view for projects, with specific names and order of fields.

View is defined in the spreadsheet: https://docs.google.com/spreadsheets/d/1XVmoxrzxoBGqC7AYBjTgzUAer0w04D6iiCJW2TW41SU/edit?usp=sharing

Uniqueness of scientific names not guaranteed

When I query the DISTINCT names from the animals view, I get both Salmo salar and salmo salar

> scientific_names(connection)
 [1] "Rutilus rutilus"      "Alosa fallax"         "Platichthys flesus"  
 [4] "Built-in"             "Anguilla anguilla"    "Petromyzon marinus"  
 [7] "Sentinel"             "Squalius cephalus"    "Cyprinus carpio"     
[10] "salmo salar"          "Sync tag"             "Gadus morhua"        
[13] "Silurus glanis"       "Lampetra fluviatilis" "Homarus gammarus"    
[16] "Salmo salar"

@bwydoogh something to fix on dbase_level? Or should I make sure to do all evaluations on lower case?

Update pkg to work with new views

@jreubens @IPauwels @PieterjanVerhelst @fwaumans and myself completely reviewed the 5 views offered by ETN (tags, animals, receivers, deployments, detections) to make them consist and to expose users with the same information via the ETN application and the etn package.

Once @fwaumans is finished with implementing the views, we should update all functions to make use of the new views. This might imply renaming some parameters.

Size limit data upload

I was trying to download the dataset of the animal network 2013_albertkanaal and apparently we cannot download files larger than 100 Mb. Is it possible to allow download of larger files?
Here is the error I got: Error: cannot allocate vector of size 100.0 Mb

Project names and project codes

It is sometimes not very clear if you need to use a project name or project code in the function. I think for us it is often quite obvious, but for external/new users in the future, this may require some extra documentation.

Rename get_transmitters() to get_tags()

The word tags is used in the views and application, I would rename the function.

How to select acoustic vs c-pod?

Are those two separate types of data that should never be shown together: e.g. never showing receivers for both c-pod and acoustic at the same time?
Should the visible fields in receivers, deployments and detections differ between acoustic and c-pod, or should it be the same, i.e. different views or shared views?
Dependent on 2 we need a way to select c-pod vs acoustic. For now we can only make that selection in the (shared) view of receivers.

How to return animal-tag relationship?

Currently:

get_tags() doesn't return unique tags, as it's internally linked to animals to allow filtering on animal project. Since a tag (A69-1303-20695) can be associated with multiple animals (673 and 674), duplicates are returned
get_animals() doesn't return unique animals, as the view return animals + their tag ID. Since an animal (2369) can be associated with multiple tags (A69-9006-971 and A69-9006-972) duplicates are returned.

I would:

get_tags(): remove the option to select on animal project, so the function (just like the view and the table on the website) returns unique tags
get_animals(): return unique animals by default, which is similar to how the website displays a single row per animal:

get_animals(): add an option to include tag_fk (resulting in multiple), so one can join with information from get_tags()

get_detections

I think the get_detections() functionality can be improved by adding filters on:

Receiver
species

connect_to_etn

When using the function (both with the Sys.getenv and by typing my credentials in the function), I get:
Error: nanodbc/nanodbc.cpp:950: 28P01: [unixODBC]FATAL: password authentication failed for user "[email protected]"

I double checked if I am using the right username and password (and tried with another account as well, same issue). Is this a problem of the code or should I change my access rights to ETN itself?

Many receivers don't have network_project_code in new view

get_receivers() will in the background look for all network codes and filter receivers on those with a network code. As a result, that currently only returns 12 entries.

receivers_all <- DBI::dbGetQuery(connection, "SELECT * FROM vliz.receivers_view2")

Returns 1640 entries, many of them without network_project_code. Why is this information missing?

`get_tags` to `get_transmitters`

@jreubens and @PieterjanVerhelst, in #9, we defined the function get_tags. However, working on the code I do think it would be more consistent to call it get_transmitters, as this is the naming used in the detections view as well to define the tag identifiers. Moreover, it complements better the get_receivers function.

I propose to provide the function get_transmitters instead of get_tags, ok?

tag_owner_organisation requested in get_transmitters

In the data retreived by the function get_transmitters(), the column tag_owner_organisation, present in the ETN view for tags, is not included.
I would like to have this information included, is it possible to integrate this column?

Parameters for get_transmitters()

would be nice to have the filter function on species here as well
filtering on sensor type, tag type, owner organisation would be nice as well
@PieterjanVerhelst other filters needed?

Consistency project_code and project_name

While writing a vignette about the usage of project code (instead of project name) (see #44) I found a kind of inconsistency about the name of the columns related to project codes and project names.

Here below the thing I am speaking about.

In functions like get_detections() we get columns named like this:

"animal_project_name", "animal_project_code", "network_project_name", "network_project_code"

which is fine.

While running functions get_projects() we get:

get_projects(my_con) %>% colnames()
 [1] "id"              "name"            "projectcode"     "type"           
 [5] "startdate"       "enddate"         "moratorium"      "imis_dataset_id"
 [9] "latitude"        "longitude"

As you see, columns projectcode and name are returned. I would expect project_code and project_name. The same holds true for get_receivers() which returns a data.frame with a column named projectcode while all other columns containing the word code contain an underscore: *_code like ar_release_code and ar_disable_code.

get_receivers

currently the get_receivers function links receivers to a network.
Some remarks on this:

output gives an overview of all deployment that occur/have occurred. Thus receivers can be present multiple times. I don't think this is a wanted output. Should be unique receives to my opinion
currently you can only link to the network_project. should be extended to animal_project as well
next to projects you should also be able to view and filter on owner_organisation. The output gives a number. should be a name.... TO be able to filter on it using 'get_receivers(my_con, network_project = "INBO")' I guess this info needs to be added to 'my_projects' output first...

basic set of functions for ETN data access

@jreubens @PieterjanVerhelst as we discussed:

As a replacement of the current getLWData function, we aim to go to a minimal set of functions for the data access. Each of these functions target a specific set of data-tables, comparable to the online ETN data portal.

Apart from a make_connection function, the first set of functions that we will focus on are. For each function, we define the required filters as function inputs that will be embedded to query the existing views:

get_projects()

get_deployments(
    netwerk-project,
    active)               # open deployments, default ALL  

get_animals(
    netwerk-project,   # i.e. projectlist
    animal_project,   # i.e. tagprojectlist
    species)

get_tags(
    animal_project)   # i.e. tagprojectlist

get_receivers
    netwerk-project)    # i.e. projectlist

get_detections(
    animal_project,
    netwerk-project,
    start_date,
    end_date,
    station,
    get_tags,
    ...?)

By default, we return all the available columns in the database views/tables; but maybe this will be too much. @jreubens you could maybe define a subset of columns for each output that are really out of scope for any of the ETN users?

get_animals() suspect behavior if only network_project is given as input

I am wondering whether it is correct that

> animals <- get_animals(connection, network_project = "thornton")
> animals2 <- get_animals(connection, network_project = "leopold")
> animals3 <- get_animals(connection, network_project = c("leopold", "thornton"))

are equal. In fact no error is asserted by executing the code below:

> testthat::expect_identical(animals, animals2)
> testthat::expect_identical(animals2, animals3)

Moreover, the network_project value given as input is not one of the projectcode in the output df:

> which(animals %>% distinct(projectcode) %in% valid_network_projects)
integer(0)
> which(animals2 %>% distinct(projectcode) %in% valid_network_projects)
integer(0)

I would rather prefer an empty data.frame as output or a warning. Or maybe do I miss something about the structure of the database?

Why was principal investigator dropped from projects

That field contained email before. Is it because of GDPR? @fwaumans

species subsetting as function input

@PieterjanVerhelst and @jreubens when subsetting on species name, I assume you'll provide a number of scienitific names as these seem to be provided, whereas common_name not:

> scientific_name
 [1] "Gadus morhua"         "Sentinel"             "Homarus gammarus"    
 [4] "Alosa fallax"         "Built-in"             "Anguilla anguilla"   
 [7] "Lampetra fluviatilis" "Salmo salar"          "Squalius cephalus"   
[10] "Silurus glanis"       "Petromyzon marinus"   "Rutilus rutilus"     
[13] "Platichthys flesus"   "Cyprinus carpio"      "Sync tag"            
[16] "salmo salar"         
> common_name
[1] NA             "Atlantic cod" "Fint"

ok to filter on the scientific names?

Support id as parameter in get_animals()

I would like get_animals() to have an additional parameter id, to get metadata on specific animals I look for.

Create COST course material as vignette

@damianooldoni will organize a short session on using the etn package. As course material we propose to create a vignette, which quickly explains a typical workflow.

Workflow suggested by @jreubens:

Welke projecten heb ik exact nodig. Is niet altijd duidelijk hoe projecten gespeld worden in ETN. Dus ik begin altijd met overzicht op te vragen van mijn projecten.

Dan maak ik een query van de data die ik wil: welke projecten heb ik exact nodig heb (default alle network projects en selectie aan animal projects), over welke tijdsperiode gaat het…

Dan roep ik detecties op (en filter ik eventueel op bepaalde soort, want detecties van een ander neem je niet mee in je analyse en ook sync tags wil je er bv niet altijd in)

Voor je aan echte analyse begint nog enkele grafieken om wat data-exploratie te doen:

Tijdslijn: welke vis gezien op welke dag

Totaal aantal vissen gezien

Overzichtskaart met alle deployments

Additional comments by @PieterjanVerhelst:

Nog enkele nuttige data-exploratie grafieken:

Welke soorten gedetecteerd + aantal per soort

Time range waarover een vis gedetecteerd werd

Aantal stations waarop een vis gedetecteerd werd

Aantal detecties per vis (vaak ook per deelgebied/netwerk, vb. Zeeschelde, Westerschelde, BPNS)

Aantal detecties per station

Additional comments by @IPauwels:

Aandeel tijd dat een dier gedetecteerd werd op de maximale tijd dat hij kon gedetecteerd worden (duur van netwerk-deployment of batterijduur zender) is eventueel ook nog iets.

Data for unit-tests

We are adding unit-tests to etn package.
In particular we would like to test functions we are developing based on #9 .
To do it, we need using data which are:

public available,
stable (= always present and never change)

Are there such data available?

Create unit-tests to check column names in db views

Based on #59 (comment), we should add some basic unit-tests to identify immediately changes in any column name of the database views.

`get_tag` handling of the animal_project choice

When I check the tags data coupled with the projectcode to enable searches using a set of animal project codes:

  "SELECT tags.*, animals.projectcode
      FROM vliz.tags
        LEFT JOIN vliz.animal_tag_release ON (animal_tag_release.tag_fk = tags.id_pk)
        LEFT JOIN vliz.animals_view animals ON (animals.id_pk = animal_tag_release.animal_fk)
  "

(@bwydoogh, I do think I do this correct?)

I do get (on my account) 1403 NA values for the projectcode. @PieterjanVerhelst or @jreubens when users request the data and the default is all projects (get_tags(connection) is by default all projects), should this result in all projects and ALL not-matched tags or only those tags that can be linked to an animal project (according to the query defined here)?

Error when loading package

When I want to load the package using
"devtools::install_github("inbo/etn")
library(etn)"
I get the following error
"Error: package ‘digest’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version"

I'm not sure what to do.
@damianooldoni or @peterdesmet can you help me out?

Create vignette about the handling of username/pwd

We should inform the users to not store their credentials as such or support them properly so they don't have to... Some options are provided here, but would be good to have a proper one documented to the users as part of the package:

fixing this on the server/DSN-level, would be nice (users just don't have to care): users are logged in, so would be good if it was automatically linke to their account, in order that connect_to_etn() would just work without credentials.... @bwydoogh any thought suggestions (or maybe this or a comparable option is actually supported)
If dbase level is not possible, how to handle this? Different options are described, any preferences? in short:, the options:

config package using a simple config file in peoples user account?
Environment variables that can be asked
R general options with options()
rstudioapi::askForPassword("Database user"),
operating systems credential store with keyring package

controlled vocabulary for `receiver_status`

As we aim to provide receiver status as an input argument of the function get_edployments, we have to know the possible inputs (or have access to query these options).

The options I can currently find are:

           receiver_status
1                Available
2                     Lost
3                   Broken
4                   Active
5 Returned to manufacturer

I do not think the normal user has access rights to the original table with the receiver_status options, but having the controlled vocabulary hard coded in the package won't be that bad I assume as this won't changes regularly?

Extend the `animals_view` (and others) on database level wit the `projectcode` column

As decided in #12 to use the projectcode to enable the user to subselect specific projects, it would be more convenient to have the projectcode columns added to the views for which this selection is relevant and when it is possible:

get_deployments

When you filter for 'Active' deployments for a specific project, you currently get a list of all deployments that have occured in this project for the (currently) active receivers. This includes also closed deployments.
What we want is a list of the 'open deployements (i.e. no end date). thus probably we need to include an extra criterion in this function.
@PieterjanVerhelst correct me if I'm wrong

transmitter type not standardised

When requesting the transmitter information, I see the following unique types:

      type
1 internal
2 acoustic
3 built-in
4 Acoustic

I would clear the usage of acoustic versus Acoustic on the database or import application side.

trias is loaded in testthat.R

I guess it should be etn

etn/tests/testthat.R

Lines 1 to 3 in b64c434

    
           library(trias) 
        
           test_check("etn")

Coordinates

When I recently downloaded data via the package, I noticed some coordinates were wrong or not filled in (i.e. 'NA'). Therefore, I think the retrieved coordinates were from the columns Recover_lat and Recover_long. However, the coordinates which should come with the data, are from the columns Deploy_lat and Deploy_long, which are a obliged field to fill in.

my_animals

Some bugs in the get_animals() functionality:

When I run the code my_animals <- get_animals(my_con, animal_project = "2012_leopoldkanaal") I get the animals for both the 2012_leopoldkanaal and PhD van der Knaap projects. This also holds true when other animal projects are selected; apparently PhD van der Knaap comes as a standard in the output.
The function does not return information about length_type3 and length_type4 nor the actual weight of the fish.

Incorrect time on machine running RStudio server

I was checking the package by launching the standard devtools::check() commando while I got this error which stops the check at an early stage:

> checking for future file timestamps ... ERROR
  This system is set to the wrong time: please correct
     system: 2019-09-20 09:36 (UTC)
    correct: 2019-09-20 09:41 (UTC)

It seems that machine time is not correct? Could somebody with admin rights on the machine running Rstudio server check this asap? Thanks!

Add information to potential etn users about setup/login info/...

As the package only works within the Rstudio server (or within the VLIZ network), we have to provide this information to potential users in the readme (which is the landing page of the documentation).

@jreubens could you add that kind of information (here or direclty edit the readme)? Short the concept ( and link to other pages), how they can get a login, what the data can be used for, some other useful links,...

get_animals don't filter on animal_project

Bug found by @peterdesmet. Thanks for it. Indeed, I confirmed that get_animals returns too much animals as it doesn't filter properly.

Example:

animals <- get_animals(connection = con, animal_project = "2011_rivierprik")

animals should contain 39 lines, but it is not.

The problem is that we have param netework_project in the query too and if no network is specified than we take all network projects. This should be avoided. Removing network_project param from the function and the code related to it seems to solve the issue.
I am immediately tackling this bug in branch no-ntwk_prjct-in-get_animals. Asap to master.

Handle ghost detections (at DB & pkg)

Here's a quick overview of the data that will be included in the dataset we will publish. @PieterjanVerhelst @IPauwels can you have a look if this makes sense? Let me know if you need more info.

animal_project_name	animals.scientific_name	detections	individuals	stations	start_date	end_date
2011 Rivierprik	Lampetra fluviatilis	114605	29	29	2011-12-14	2012-07-03
2012 Leopoldkanaal	Anguilla anguilla	2215829	92	60	2012-07-04	2017-03-12
2014 Demer	Petromyzon marinus	42	1	1	2015-05-06	2015-05-12
2014 Demer	Rutilus rutilus	11030	2	9	2014-04-19	2014-06-28
2014 Demer	Silurus glanis	86023	9	46	2014-04-25	2018-01-31
2014 Demer	Squalius cephalus	139013	2	10	2014-04-30	2015-02-12
2015 Dijle	Anguilla anguilla	41798	1	7	2015-05-01	2015-10-15
2015 Dijle	Cyprinus carpio	4944	2	9	2015-04-23	2015-11-06
2015 Dijle	Platichthys flesus	101488	8	28	2015-04-29	2016-04-08
2015 Dijle	Rutilus rutilus	7870	4	9	2015-04-23	2015-09-14
2015 Dijle	Silurus glanis	78331	11	25	2015-04-22	2017-09-16

	my_con <- connect_to_etn(Sys.getenv("userid"),
	Sys.getenv("pwd"))

inbo / etn Goto Github PK

etn's Introduction

etn

Installation

Meta

etn's People

Contributors

Stargazers

Watchers

Forkers

etn's Issues

Recommend Projects

Recommend Topics

Recommend Org