hrdag / verdata Goto Github PK

Una herramienta para el uso y análisis de los datos de Conflicto armado en Colombia resultantes del proyecto conjunto JEP-CEV-HRDAG.

License: GNU General Public License v2.0

R 99.46% TeX 0.54%

armed-conflict capture-recapture colombia multiple-imputation multiple-systems-estimation rstats

verdata's People

Contributors

Stargazers

Watchers

Forkers

diegoramirezram jamesmbaazam alecastro65 cmoltes

verdata's Issues

Provide alternate Rd title for `get_valid_sources()`

verdata/R/estimate_mse.R

Line 7 in 8bf2329

#' @title get_valid_sources

README updates

Hi @thegargiulian and @pamadoa I've made updated the README, please change anything as you see fit .

@thegargiulian I added the estimates_exist function and made just a draft for it, but feel free to change it completely if you need to.

Installation instructions: Check if `devtools` is installed

Under the installation instructions, it might be better to replace
install.packages("devtools") with if (!require("devtools")) install.packages("devtools") to only install devtools if it's not already installed. This could reduce unnecessary re-installations.

This issue is with reference to the JOSS review here openjournals/joss-reviews#5844

Create a list of the stratification used in the methodological report?

Hi @thegargiulian, since I have been trying to figure out the issue with mse, I thought it would be helpful in making the package a little bit more user friendly to provide a list of the stratification used for different estimations in the methodological report. For example, the estimates by year for desaparicion used a stratification by year and is_forced_dis. I don't think this would be obvious for everybody, maybe someone would just filter by is_conflict and is_forced_dis and the stratify by year, which would mean they wouldn't find the estimates, mse would take too long to run and they would end up with different results. Or in secuestro, the stratification for the estimates by year was by year and sex.

Maybe some people would like to use their own stratification, but I was thinking, for someone who wants to save time or just wants to replicate the results, maybe providing this would be helpful.

What do you think? And, in case you agree, where do you think we should put it?

Thanks!

Make data dictionary available as data frame in package

The README tells users that the data dictionary can be found in inst/docs however R users that are not package developers are not familiar with these files. The data dictionary should be made easy to access and interact with and preferably within R.

Consider exporting the data dictionary as a data.frame see R Packages Ch. 7.

At minimum, it would be helpful to demonstrate to users how to download or load the file itself. For example:

verdata_dict_fp <- system.file(
  "docs/diccionario-variables-replicas.xlsx",
  package = "verdata"
)

readxl::read_excel(verdata_dict_fp)
#> # A tibble: 55 × 4
#>    `Nombre variable` Tipo     `Detalle variable`           `Categorías variable`
#>    <chr>             <chr>    <chr>                        <chr>                
#>  1 replica           caracter Número de réplica            "R1 - R100"          
#>  2 match_group_id    caracter Identificador único           <NA>                
#>  3 in_1              numérico Variable dicotómica que tom… "1 \r\n0"            
#>  4 in_2              numérico Variable dicotómica que tom… "1 \r\n0"            
#>  5 in_3              numérico Variable dicotómica que tom… "1 \r\n0"            
#>  6 in_4              numérico Variable dicotómica que tom… "1 \r\n0"            
#>  7 in_5              numérico Variable dicotómica que tom… "1 \r\n0"            
#>  8 in_6              numérico Variable dicotómica que tom… "1 \r\n0"            
#>  9 in_7              numérico Variable dicotómica que tom… "1 \r\n0"            
#> 10 in_8              numérico Variable dicotómica que tom… "1 \r\n0"            
#> # ℹ 45 more rows

Perhaps even adding download.file(verdata_dict_fp)

Replace use of &&

Hi @pamadoa!

I saw that there are a couple of places in summary_observed.R, combine_replicates.R, and test-summary-observed.R where you use &&.

Some of the latest versions of R don't like the use of && and || (explained in this Stackoverflow post). Do you think you could rewrite the affected lines of code to remove the && symbols, likely replacing them with a single & and probably also making use of all(...).

Thank you!

replace `\dontrun` with `\donttest`

For any examples that can be run, replace \dontrun with \donttest. \dontrun should only be used in instances where the example really can't be run by the user, \donttest can handle the examples that take >5 seconds to run

Improve title for combine_estimates

Provide a better title for combine_estimates() right now it is the name of the function which makes for odd reading in the help page:

I suspect this can be changed to Combine Estimates silly, but important change :)

verdata/R/combine_estimates.R

Line 7 in 8bf2329

#' @title combine_estimates

Issue with lookup_estimates and mse

Hi @thegargiulian I have been working on figuring out why it takes so long to run mse for strata that should already have estimates. I had worried about the changes we made to the data for DANE or the code we were using in the examples, so I created a temporary branch in the CO-examples repository for all the regarding testing. This code summarizes what I have tried.

First, I downloaded the data directly from DANE and the estimates from the Commission site and used it for this tests.
Then, I used the same code we are using in the examples for the stratification and tested to the lookup_estimates function to see if it could find the estimates. It can. However, whenever I try to run the mse function it takes way too long, even though I know the estimates exist. So I created short_mse function, which is basically mse, except it doesn't run lcmcr if it doesn't find the estimates, just returns a tibble containing NA. When I run this, it takes 1, maybe 2 minutes, and returns the results of the estimation, not the NA tibble, so again, the estimates are there, and lookup_estimates can find them.

I can't figure out why mse takes so long. Is it possible that it is trying to run lcmcr even though the estimates exist? I checked the if_else in mse and it seems fine to me, so I don't know what the issue is.

[Doc] add example for estimates_exist()

There is no example associated with the estimates_exist() function.

Tests are not replicable

Note that these tests use data frome extdata which is not inclduded in the github repository. As such, tests are not reproducible.

verdata/tests/testthat/test-confirm_files.R

Line 7 in 8bf2329

    
           right_dir <- system.file("extdata", "right", package = "verdata", "verdata-reclutamiento-R1.parquet")

Possible to simplify proportions arguments in `summary_observed`/`combine_replicates`?

Our beta tester noted that some of the parameter combinations can be confusing and might result in undesired results (like NaNs). Is there a way to simplify these arguments?

Provide contribution instructions

In the README, could you please provide information on how to contribute to this package and how to report issues, bugs, errors, and other problems with the package?

This issue is with reference to the JOSS review here openjournals/joss-reviews#5844

Translate data dictionary to English

Fix documentation titles

Additional function reference titles that should be in title case and not the name of the function.

title	function
proportions_imputed	proportions_imputed
proportions_observed	proportions_observed
run_lcmcr	run_lcmcr

Updates to DESCRIPTION file

Defne acronyms in JEP-CEV-HRDAG
Add references to description field in the form authors (year) doi:...

`confirm_files` should rely on users to input a vector of files they'd like to read

Responding to an issue raised in the beta test, confirm_files (and read_replicates) should just take a vector from the user containing the number of the replicates they'd like to read. It's fine to note that we're imposing structure on the file naming convention to achieve this

Function that shows if estimates for a stratum already exist in the pre-calculated chunks

An auxiliary function that given stratum data, returns the stratum identifier and a column that is TRUE if the stratum has already been estimated and FALSE if it has not.

README contains unnecessary instructions

verdata requiere el paquete LCMCR como dependencia. La instalación de LCMCR requiere la instalación del GNU Scientific Library. Es posible que necesite instalar esta librería en su computadora por separado antes de instalar verdata.

This tells the user to install LCMCR and GNU Scientific Library. This is not necessary as the package LCMCR is available on CRAN and is listed, correctly, as a package Import. This suggestion may cause confusion for new users.

Set up continuous integration

Could you set up continuous integration, for example with GitHub Actions, to check that the tests and infrastructure are working correctly? The package {usethis} provides quick functionalities to get started. See here.

This issue is with reference to the JOSS review here openjournals/joss-reviews#5844

Replicates not found

If replicates aren't in the folder where they should be, the function should alert the user that a replicate was not found.

`confirm_files` should return an object with the results of the confirmation test

From points raised in the beta test:

Currently confirm_files prints a message for the user indicating they have the right files/an error if the file is incorrect. It would be more helpful for the function to output a data frame with two columns: file_path indicating the path to the file being checked and result indicating whether the file has been verified.

It would be helpful for the function to crash and list which files are broken rather than just stopping at the first broken file it encounters.

Provide guidance on how to download data

This package requires data to be downloaded from a government website. Naturally, it is not clear or straight forward how to download the data.

It would be very helpful if there was a vignette or short guide on how to download this data. The link provided in the readme is the dataset description and not the path to the downloads themselves.

Alternatively, it would be quite nice if the package provided a utility to actually download the data itself. For example I was able to download the secuestro data like so:

# create a temporary directory
tmp <- tempdir()

# create path to save to
ver_path <- file.path(tmp, "verdata.zip")

# download the file
download.file(
  "https://microdatos.dane.gov.co/index.php/catalog/795/download/22642",
  ver_path
)

# unzip it
unzip(ver_path, exdir = file.path(tmp, "verdata"))

# read it 
verdata::read_replicates(
  file.path(tmp, "verdata", "Secuestro.csv"),
  "secuestro",
  1
)

Function documentation in Spanish

Ideally, we would be able to produce nicely formatted documentation in both Spanish and English for all aspects of the code and have that organized in a easy to access way for all users.

I've asked around the R-Ladies Slack for best practices for multilingual documentation and was suggested to take a look at this proof of concept: https://github.com/eliocamp/translated

Function to automatically download data from DANE website

See suggestion in #20

Review test warnings

Currently test-estimate_mse.R, test-lookup_estimates.R, and test-read_replicates.R are giving testthat warnings

Only raise "Not filtering in is_forced_dis" if user is working with data about disappearances

Provide examples as vignettes

The examples provided here in a separate repo (https://github.com/HRDAG/verdata-examples) should be properly linked to this repository. One way to do that would be through package vignettes [See the vignettes section of the R Packages book]. To take this even further, it might be better to set up a website for this repository where the vignettes can be accessed as Articles. See the website chapter of the R Packages book.

I see there is a related issue here HRDAG/verdata-examples#2 to convert the examples to a human-readable format. I will leave this to the authors' discretion.

This issue is with reference to the JOSS review here openjournals/joss-reviews#5844

Remove `options(warn=-1)`

You are setting options(warn=-1) in your function. This is not allowed. Please rather use suppressWarnings() if really needed.
-> R/estimate_mse.R; tests/testthat/test-combine_estimates.R

Argument naming consistency

Many functions refer to the same/similar inputs or arguments and there should be consistency in naming across all function files. In my original review of the documentation I noticed the use of violation and violacion as arguments, there may be other instances where argument names (and associated documentation) can be harmonized

Violation type checking

Many functions take a violation type as an input, but most do not do any sort of checking to ensure that the violation type matches one of the four acceptable options.

Add plain text license

Add license file to .Rbuildignore

Move part of package description to README

In my opinion, the second sentence onwards might be better placed in the README as it seems to get lost in the package DESCRIPTION file.

verdata/DESCRIPTION

Line 20 in 9c5af3d

    
           Description: This package facilitates use and analysis of data about the armed conflict in Colombia resulting from the joint JEP-CEV-HRDAG project. The package has three main sets of features. First, researchers can use this package to verify that they are using unaltered versions of the data published by the Colombian National Administrative Department of Statistics. Second, they can use the package to replicate the main findings of the joint JEP-CEV-HRDAG project. Finally, they can use the package to design their own statistical analyses of patterns of violence that address the two forms of missing data present in the documented data.

This issue is in relation to the JOSS review here openjournals/joss-reviews#5844

Explicit checks of input structure

Not all functions provide explicit checks for correctness of inputs or helpful error messages if the inputs take incorrect forms. There should be explicit input checking for each function in verdata and associated testthat tests.

Fix documentation for estimates_exist

The additional #' has made the conversion to a .Rd corrupted. Please remove so that documentation is fixed. Additionally, please fix the name per #22 and #23

verdata/R/estimate_mse.R

Line 162 in 8bf2329

#' #' @return A list with two entries, `estimates_exist` and `estimates_path`.

hrdag / verdata Goto Github PK

verdata's People

Contributors

Stargazers

Watchers

Forkers

verdata's Issues

Recommend Projects

Recommend Topics

Recommend Org