Coder Social home page Coder Social logo

statistikat / statcuber Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 1.0 15.91 MB

R interface for the STATcube REST API and data.statistik.gv.at

Home Page: https://statistikat.github.io/STATcubeR/

License: GNU General Public License v2.0

R 93.89% CSS 2.26% JavaScript 3.84%
r api database open-data ogd sdmx

statcuber's People

Contributors

alexkowa avatar bernhard-da avatar gregordecillia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

matmo

statcuber's Issues

Time variables with `{prefix}-YYYY-Q`

The cube sc_table_saved("str:table:defaulttable_delufapi004") (external info page) uses time codes of the form {prefix}-YYYY-Q which are currently not parsed correctly into a date format because the parser expects {prefix}-YYYYQ.

x <- sc_table_saved("str:table:defaulttable_delufapi004")
x$field()
# STATcubeR metadata: 6 x 3
  code          label           parsed         
  <chr>         <chr>           <chr>          
1 APIQ10-2020-1 1. quarter 2020 1. quarter 2020
2 APIQ10-2020-2 2. quarter 2020 2. quarter 2020
3 APIQ10-2020-3 3. quarter 2020 3. quarter 2020
4 APIQ10-2020-4 4. quarter 2020 4. quarter 2020
5 APIQ10-2020-5 annual 2020     annual 2020    
6 APIQ10-2021-1 1. quarter 2021 1. quarter 2021

With a proper update of the parser, the parsed column should be of type <Date>. We should assume that the client skips the "annual" values by providing certain recodes in the JSON. The parser should therefore be able to work with strings of the form

c("APIQ10-2020-1", "APIQ10-2020-2", "APIQ10-2020-3", "APIQ10-2020-4", "APIQ10-2021-1")

Time type: Week

As mentioned in #11 (more precisely: here), it would be useful to add a new "Week" type for time variables. The dataset od_table("OGD_gest_kalwo_alter_GEST_KALWOCHE_5J_100") uses codes of the form {prefix}-YYYYWW for the time variable where WW is the calendar week. This is very similar to the {prefix}-YYYYMM notation for months which is why the following table fails at parsing the time correctly.

x <- od_table("OGD_gest_kalwo_alter_GEST_KALWOCHE_5J_100")
x$raw$extras$metadata_modified
#> [1] "2021-07-22T09:02:59"
x$meta$fields[, c(1, 3, 4, 5)]
#>   code         label_en                           nitems type        
#> 1 C-KALWOCHE-0 Calendar week                        1121 Time (month)
#> 2 C-B00-0      Province (NUTS 2 unit) of deceased      9 Category    
#> 3 C-ALTER5-0   5 years age group of deceased          20 Category    
#> 4 C-C11-0      Gender of deceased                      2 Category   
x$field("Calendar week")[11:14, c(1, 3, 4)]
#>   code        label_en                                                  parsed    
#> 2 KALW-200011 11. Calendar week 2000 (week from 13.3.2000 to 19.3.2000) 2000-11-01
#> 3 KALW-200012 12. Calendar week 2000 (week from 20.3.2000 to 26.3.2000) 2000-12-01
#> 4 KALW-200013 13. Calendar week 2000 (week from 27.3.2000 to 2.4.2000)  NA        
#> 5 KALW-200014 14. Calendar week 2000 (week from 3.4.2000 to 9.4.2000)   NA        

Implementing this will require modifications in

sc_field_parse_time <- function(field) {
and also sc_fields_type()

Allow table rendering

Add option to render the tables in a non-tidy format that is more similar to what we see in the STATcube GUI. Make sure everything looks nice even if there are several variables that are used as columns. This requires a rendering engine that allows cell merging. Options:

Example table with several variables in rows and columns:

image

naming things

  • Make sure all function arguments use a consistent naming scheme.
  • Always use uris when ids are used as function arguments.
  • Namespace functions for specific endpoints by their endpoints (sc_table_saved-list())
  • Decide whether certain logic should be only accessible as a R6-method rather than a regular function

Allow recoding of `sc_data` objects

Add functionalities that allow modifications of labels and similar. Define a new R6 class and put an instance into x$recode.

changes to public methods

x <- od_table()

# set labels for fields
x$recode$label(code_field, new_label, language)
# set labels for measures
x$recode$label(code_measure, new_label, language)
# set labels of levels
x$recode$level(code_field, code_level, new_label, language)

# set total codes similar to x$total_codes() but for programmatic usage
x$recode$total_code(code_field, code_level)
# codes_levels is a permutation of x$field(code_field)$code
x$recode$order(code_field, codes_levels)
# define which levels are included in $tabulate(). 
x$recode$visible(code_field, code_level, visible = TRUE)

# undo all recodes
x$recode$reset()

Implementation

All modifications should directly overwrite x$meta and x$fields(i). The functionality should be bilingual, i.e. it should be possible to define german and english labels.

# in initialize()
private$recoder <- recoder_class$new(self, private)
# in active bindings
recode = function(value) {
  private$recoder
}

It would probably be useful to add extra columns visible and order in x$field(i) to store this part of the "recode state". We could also just store them in $private$p_fields[[i]] and omit them in the active field

active = list(field = function(i) {
  privae$p_fields[[i]][, -c("order", "visible)]
})

pluralization

We could also add "pluralized" versions that implement the recodes such as

x$recode$labels(code_measures, new_labels, language)
x$recode$levels(code_field, code_levels, new_labels, language)
# ...

cache invalidation

Add support for caching of /schema responses via the Etag header. More generally, document and export the caching behavior of the API responses.

Color palettes in print outputs

Issue

Currently, the colors used in the print methods of STATcubeR only really work with dark editor themes. This is why there are setup-scripts like these to make the pkgdown-docs look nice despite having a light theme.

STATcubeR/R/zzz.R

Lines 15 to 17 in 4537d3e

options(cli.theme = list(
".field" = list("color" = "#0d0d73"),
".code" = list("color" = "blue"),

STATcubeR.schema_colors = list(
"FOLDER" = "#4400cc", "DATABASE" = "#186868", "TABLE" = "#624918",
"GROUP" = "#4400cc", "FIELD" = "cyan", "VALUESET" = "cadetblue",

Challenge

Since there is a substantial amount of R users using light editor themes, make sure that a freshly installed version of STATcubeR works with both light and dark editors. Additionaly, keep the current color palletes as a "dark-theme" and add some way to switch between the default theme and the dark theme. Simplify the pkgdown setup by just using the new default-theme.

Implementation

In order to make the theming system powerful engough to include all current "theme-adaptations" for pkgdown, it is necessary to provide

  • color palettes for schema types
  • color palettes for annotations (#39)
  • override some {cli} options. (possibly a bad idea, TBD)

There is already some prototyping which uses theme-definitions in inst/themes/{theme}.json with the following structure.

{
  "description": "default theme for STATcubeR",
  "schema": {"FOLDER": "#4400cc", "DATABASE": "#186868", "TABLE": "#624918", "...": "..."},
  "annotations": ["#4400cc", "#186868", "#624918", "..."],
  "cli": {".field": {"color": "#0d0d73"}, "...": "..."}
}

Defaults

It would be possible to autodetect wether a light or dark mode is approprite via rstudioapi::getThemeInfo(). But this would be only applicable for rstudio users. It is probably better to provide a neutral theme, which works in dark and light editors as a default and make optimized themes for dark and light mode opt-in.

Unit-Testing with `{httptest}`

There is already a first attempt to include unit tests for the STATcube API using {httptest} in #40 . The basic idea is to have a way to test the parsers and print methods for sc_table() and friends when submitting the package to CRAN.

One important question here is which cubes/databases should be used in the tests. One reccomendation is the "Gemeindedaten (Demo)" databse. However, in order to maximize code-coverage, some databases with annotations and missing values would be required. Unfortunately, the "Gemeindedaten (Demo)" database only provides missings/annotations of the kind "X: cross tabulation not allowed". Another useful thing would be to have different types of time variables (half year, month, week, quarter, year)

Canidate databases

  • Foreign Trade includes annotations "T: Total Suppression" and "G: Disclosure control".
  • LFS inclues annotations "S: samping error" and "N: value does not make sense" which are special annotations with underlying values.
  • This tourism database contains several types of time variables with a hierarchy.

use available resources in examples

  • Make sure all codes used in the documentation (examples and articles) can be run with any API key.
  • Also check, whether the json requests in inst/json_examples are restricted to certain user groups and replace them accordingly.
  • The function sc_example() should be exended so a list of all available examples can be displayed. Either do this with a function like sc_examples_list() or display an error message with availale examples if sc_example() is called with an invalid argument.

Handle annotations other than "X"

Currently, as.data.frame() inserts NA values whenever the annotation "X" is applied to a cell value.

annotations <- get_annotations(x, i)
if (recode_na)
values[annotations == "X"] <- NA

Figure out if this makes sense for other annotations and handle those cases in as.data.frame() accordingly.

Use development version of pkgdown

Since the pkgdown website of STATcubeR uses bootstrap 5, there are currently some issues related to r-lib/pkgdown#2207

  • The TOC in the sidebar is not rendered
  • search functionality is broken
  • "copy to clipboard" links in code chunks are not available

This should be resolved if the website is rebuilt with the development version of pkgdown

Did anybody try to request the data via Python?

So far, all I am able to do is produce "JSONDecodeError"s: "Expecting value: line 1 column 1 (char 0)"

I tried something like this:

import requests

api_url = "https://statcubeapi.statistik.at/statistik.at/ext/statcube/rest/v1e/table"

api_key = "<quitesomekeyhere>"

headers = {'APIKey': api_key, "Content-Type": "application/json"}

query = {
  "database" : "str:database:debevstprog",
  "measures" : [ "str:statfn:debevstprog:F-BEVSTPROG:F-S25V1:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V2:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V3:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V4:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V5:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V6:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V7:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V8:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V9:SUM", "str:statfn:debevstprog:F-BEVSTPROG:F-S25V10:SUM" ],
  "recodes" : {
    "str:field:debevstprog:F-BEVSTPROG:C-C11-0" : {
      "map" : [ [ "str:value:debevstprog:F-BEVSTPROG:C-C11-0:C-C11-0:C11-1" ], [ "str:value:debevstprog:F-BEVSTPROG:C-C11-0:C-C11-0:C11-2" ] ],
      "total" : False
    },
    "str:field:debevstprog:F-BEVSTPROG:C-A10-0" : {
      "map" : [ [ "str:value:debevstprog:F-BEVSTPROG:C-A10-0:C-A10-0:A10-2000" ], [ "str:value:debevstprog:F-BEVSTPROG:C-A10-0:C-A10-0:A10-2010" ], [ "str:value:debevstprog:F-BEVSTPROG:C-A10-0:C-A10-0:A10-2020" ], [ "str:value:debevstprog:F-BEVSTPROG:C-A10-0:C-A10-0:A10-2030" ] ],
      "total" : False
    }
  },
  "dimensions" : [ [ "str:field:debevstprog:F-BEVSTPROG:C-A10-0" ], [ "str:field:debevstprog:F-BEVSTPROG:C-C11-0" ] ]
}

response = requests.get(api_url, headers=headers, params=query)
response_data = response.json()
print(response_data.text)

Also had to replace the "false" in the JSON query with "False", as the IDE was throwing an error otherwise. I am not the most firm with REST APIs, can anybody offer some insight how this might work?

Avoid duplicate codes because of recodes

The following json file is not handled correctly by sc_table()

{
  "database" : "str:database:deenenea",
  "measures" : [ "str:statfn:deenenea:F-DATA:F-EBIL:SUM" ],
  "recodes" : {
    "str:field:deenenea:F-DATA:C-VERWEND0-0" : {
      "map" : [ 
        [ 
          "str:value:deenenea:F-DATA:C-VERWEND0-0:C-VERWEND0-0:VERWEND0-1", 
          "str:value:deenenea:F-DATA:C-VERWEND0-0:C-VERWEND0-0:VERWEND0-2" 
        ], 
        [ "str:value:deenenea:F-DATA:C-VERWEND0-0:C-VERWEND0-0:VERWEND0-1" ]
      ]
    }
  },
  "dimensions" : [ [ "str:field:deenenea:F-DATA:C-VERWEND0-0" ] ]
}

It results in duplicate codes for the field C-VERWEND0-0 which causes all kind of issues with $tabulate() because if implicit assumptions.

sc_table('test.json')$field("C-VERWEND0-0")
#> # STATcubeR metadata: 3 x 7
#>   code       label                   parsed                 
#>   <chr>      <chr>                   <chr>                  
#> 1 VERWEND0-1 Space and water heating Space and water heating
#> 2 VERWEND0-1 Space and water heating Space and water heating
#> 3 SC_TOTAL   Total                   Total                  
#> # … with 4 more columns: 'label_de', 'label_en', 'visible', 'order'

The reason for that is that the map field in the json contains several URIs and only the first URI is used to generate the code column in $field(). It should be made sure that unique codes are generated in this case, possibly by concatinating the codes of the individual uris. A fixed version might create a field definition like this

sc_table('test.json')$field("C-VERWEND0-0")
#> # STATcubeR metadata: 3 x 7
#>   code                  label                   parsed                 
#>   <chr>                 <chr>                   <chr>                  
#> 1 VERWEND0-1;VERWEND0-2 Space and water heating Space and water heating
#> 2 VERWEND0-1            Space and water heating Space and water heating
#> 3 SC_TOTAL              Total                   Total                  
#> # … with 4 more columns: 'label_de', 'label_en', 'visible', 'order'

Time variables, should be converted to type category in this case, with a warning. Labels could also be concatenated. However, this would lead to very long labels which might not be ideal.

Add sc_table_custom()

Implement a function that takes ids for a database, measures and fields and sends a json request. Here is a snippet on how to do that manually at the moment

# pick a dataset
db_id <- "detouextregsai"
db_schema <- sc_schema_db(db_id)
db_uid <- paste0("str:database:", db_id)

# browse the schema to obtain resource ids
id_arrivals <- db_schema$Facts$Arrivals$Arrivals$id
id_time <- db_schema$`Mandatory fields`$`Season/Tourism Month`$`Season/Tourism Month`$id

# get the response
json_list <- list(database = db_uid, measures = list(id_arrivals), dimensions = list(list(id_time)))
response <- httr::POST(
  url = paste0(STATcubeR:::base_url, "/table"),
  body = jsonlite::toJSON(json_list, auto_unbox = TRUE),
  encode = "raw",
  config = httr::add_headers(APIKey = sc_key())
)

# convert to class sc_table
my_table <- STATcubeR:::sc_table_class$new(response)

Transofm this snippet into a function

sc_table_custom(database_id, measures, fields)

Allow users to switch between labels and codes

The statcube api contains codes and labels for all variables. Currently, as.data.frame() always uses codes for column names and field entries. Make this behavior optional so users can also work with codes.

The conversion between codes and labels is pretty straightforward when sc_meta() or sc_meta_field() is used because those functions can be used as "translators".

json_path <- sc_example("bev_seit_1982.json")
my_response <- sc_get_response(json_path)

sc_meta(my_response)
## $database
##                                 label         code
## 1 Bevölkerung zu Jahresbeginn ab 1982 debevstandjb
## 
## $measures
##      label     code fun precision
## 1 Fallzahl F-ISIS-1 SUM         0
## 
## $fields
##         label        code nitems
## 1        Jahr     C-A10-0     40
## 2  Bundesland    C-BB00-0     11
## 3 Geburtsland C-GEBLAND-0      3

sc_meta_field(my_response, 2)
##                        label code       type
## 1          Burgenland <AT11>    1 RecodeItem
## 2             Kärnten <AT21>    2 RecodeItem
## 3    Niederösterreich <AT12>    3 RecodeItem
## 4      Oberösterreich <AT31>    4 RecodeItem
## 5            Salzburg <AT32>    5 RecodeItem
## 6          Steiermark <AT22>    6 RecodeItem
## 7               Tirol <AT33>    7 RecodeItem
## 8          Vorarlberg <AT34>    8 RecodeItem
## 9                Wien <AT13>    9 RecodeItem
## 10 Nicht klassifizierbar <0>    0 RecodeItem
## 11                  Zusammen           Total

Error when downloading specific dataset using od_table()

Describe the bug
od_table() throws an error when trying to download dataset OGD__steuer_lst_ab_2008_4_LST_4 , whereas all other available datasets discovered with od_list() worked.

To Reproduce

od_table("OGD__steuer_lst_ab_2008_4_LST_4")
Fehler in `$<-.data.frame`(`*tmp*`, "parsed", value = NA_character_) : 
  Ersetzung hat 1 Zeile, Daten haben 0

Expected behavior
I expected to get the corresponding R6-class object, e.g.

od_table("OGD__steuer_lst_ab_2008_2_LST_2")
Wage Tax Statistics from 2008: Extent of Employment, Sex and Economic Activities

Dataset: OGD__steuer_lst_ab_2008_2_LST_2 (data.statistik.gv.at)
Measures: Number of entities (= persons) subject to wage tax, Gross total income (EUR), Other income according to §67 par. 1-2 (EUR), Entity count: Other
  income according to §67 par. 1-2, Other income according to §67 par. 3-8 with fixed tax rate (EUR), Entity count: Other income according to §67 par. 3-8 with
  fixed tax rate, NTSONST (nach Tarif versteuerte sonstige Bezüge) (EUR), Z_NTSONST (Fallzahl NTSONST), LFBEZ (laufende Bezüge inkl. KZ220) (EUR), Z_LFBEZ
  (Fallzahl LFBEZ), … (48 more)
Fields: Year <15>, Gender <2> <2>, Duration of income <2> <2>, ÖNACE 2008 Abteilungen (2-Steller) <89> [teilw. ABO] (Ebene +1) <22>

Request: [2024-02-20 14:09:45.180562]
STATcubeR: 0.5.0 (@4537d3e)

Environment

> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21)
 os       Debian GNU/Linux 10 (buster)
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  de_AT.UTF-8
 ctype    de_AT.UTF-8
 tz       Europe/Vienna
 date     2024-02-20
 rstudio  2023.12.1+402 Ocean Storm (server)
 pandoc   NAPackages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version  date (UTC) lib source
 arrow       * 13.0.0.1 2023-09-22 [1] CRAN (R 4.3.0)
 assertthat    0.2.1    2019-03-21 [2] CRAN (R 4.0.3)
 bit           4.0.5    2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5    2020-08-30 [2] CRAN (R 4.0.3)
 cli           3.6.2    2023-12-11 [1] CRAN (R 4.3.0)
 colorspace    2.1-0    2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2    2022-09-29 [1] CRAN (R 4.3.0)
 curl          5.2.0    2023-12-08 [1] CRAN (R 4.3.0)
 data.table  * 1.15.0   2024-01-30 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.4    2023-11-17 [1] CRAN (R 4.3.0)
 fansi         1.0.6    2023-12-08 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0    2023-01-29 [1] CRAN (R 4.3.0)
 fs            1.6.3    2023-07-20 [1] CRAN (R 4.3.0)
 generics      0.1.3    2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.4    2023-10-12 [1] CRAN (R 4.3.0)
 glue          1.7.0    2024-01-09 [1] CRAN (R 4.3.0)
 gtable        0.3.4    2023-08-21 [1] CRAN (R 4.3.0)
 here          1.0.1    2020-12-13 [2] CRAN (R 4.2.1)
 hms           1.1.3    2023-03-21 [1] CRAN (R 4.3.0)
 httr          1.4.7    2023-08-15 [1] CRAN (R 4.3.0)
 jsonlite      1.8.8    2023-12-04 [1] CRAN (R 4.3.0)
 lifecycle     1.0.4    2023-11-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.3    2023-09-27 [1] CRAN (R 4.3.0)
 magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0    2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0    2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3    2019-09-22 [2] CRAN (R 4.0.3)
 pkgload       1.3.4    2024-01-16 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2    2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.5    2024-01-10 [1] CRAN (R 4.3.0)
 remotes       2.4.2.1  2023-07-18 [1] CRAN (R 4.3.0)
 rlang         1.1.3    2024-01-10 [1] CRAN (R 4.3.0)
 rprojroot     2.0.4    2023-11-05 [1] CRAN (R 4.3.0)
 rstudioapi    0.15.0   2023-07-07 [1] CRAN (R 4.3.0)
 rvest       * 1.0.3    2022-08-19 [2] CRAN (R 4.2.1)
 scales        1.3.0    2023-11-28 [1] CRAN (R 4.3.0)
 selectr       0.4-2    2019-11-20 [2] CRAN (R 4.2.1)
 sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.3.0)
 STATcubeR   * 0.5.0    2023-06-12 [1] Github (statistikat/STATcubeR@4537d3e)
 stringi       1.8.3    2023-12-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.1    2023-11-14 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1    2023-03-20 [1] CRAN (R 4.3.0)
 tidyjson      0.3.2    2023-01-07 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.1    2024-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0    2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.3.0    2024-01-18 [1] CRAN (R 4.3.0)
 tzdb          0.4.0    2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.4    2023-10-22 [1] CRAN (R 4.3.0)
 vctrs         0.6.5    2023-12-01 [1] CRAN (R 4.3.0)
 withr         3.0.0    2024-01-16 [1] CRAN (R 4.3.0)
 xml2          1.3.6    2023-12-04 [1] CRAN (R 4.3.0)

 [1] /home/zenz/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

build pkgdown via travis

  • Add a travis project and push the rendered pkgdown resources into a gh-pages branch as in surveysd
  • Update the links in the README to point to this new location

special handling of national accounts cubes

national accounts cubes such as sc_example("foreign_trade") do not provide values for total codes.

"database" : "str:database:denatec06",

Therefore, they should be aggregated directly in $tabulate() because otherwise the result would be a table filled with NAs in all measure columns.

sc_example("foreign_trade") %>%
  sc_table() %$%
  tabulate("Reference year")
# A STATcubeR tibble: 11 x 5
   `Reference year` `Import, number… `Import, value … `Export, number… `Export, value …
 * <date>                      <dbl>            <dbl>            <dbl>            <dbl>
 1 2008-01-01                     NA               NA               NA               NA
 2 2009-01-01                     NA               NA               NA               NA
 3 2010-01-01                     NA               NA               NA               NA
 4 2011-01-01                     NA               NA               NA               NA
 5 2012-01-01                     NA               NA               NA               NA
 6 2013-01-01                     NA               NA               NA               NA
 7 2014-01-01                     NA               NA               NA               NA
 8 2015-01-01                     NA               NA               NA               NA
 9 2016-01-01                     NA               NA               NA               NA
10 2017-01-01                     NA               NA               NA               NA
11 2018-01-01                     NA               NA               NA               NA

In one of our internal projects, we currently use the condition

"T" %in% table$annotation_legend$annotation

to determine wether a direct aggregation via rowsum() should be applied.

Upade URLs for the upcoming API release

  • Make sure the new base url is used by default
  • Update Links to the API reference documentation to always use version 9.12
  • update links in sc_browse() and sc_browse_preferences()
  • Update the (deep-)links in the $edit() method to
    • use the external server by default
    • use one of our internal servers in case the response was not generated by the external server
  • Update links to the STATcube GUI in the docs

Add support for SDMX

It would be very useful if {STATcubeR} could support "SDMX archives" which are generated from STATcube. sdmx archives consist of a metadata component called the "structure definition" and a data part which contains the actual cell values. In order to support that, we would need to add parsers for the xml-based data format.

The generated archives are more or less compatible with the CRAN package rsdmx: https://cran.r-project.org/package=rsdmx, which could be used as a starting point to develop parsers.

Possible usuage: parser function sdmx_table() which generates an object of class sc_data (the parent class for OGD and STATcube-API datasets)

x <- STATcubeR::sdmx_table("path/to/sdmx_archive.zip")
class(x)
#> [2] "sdmx_table" "sc_data" "R6"

There are several advantages of the sdmx format compared to the API

  • The structure definition contains information about hierarchical classifications, which are not available via the API
  • The download option "sdmx archive" is available even if STATcube is used as a "guest user"
  • SDMX is used by other SuperSTAR products such as SuperCROSS

The last point is probably the most compelling one since a direct interface to SuperCROSS would be very helpful for the internal workflows of statistics austria

Make pkgdown articles available as vignettes

It might be a nice addition to make the current pkgdown articles (or some of them) also available as vignettes so that they can be used offline. Currently, the articles use some customizations (in vignettes/R/) that might make this not 100% straight forward

Persumably, the tooltips should be disabled in the offline version because tippy.js is currently loaded via a CDN. Dependencies to {fansi} should also be avoided and possibly replaced with cli::ansi_html().

Document and export caching of API responses

For quite some time there is a hidden feature that allows caching of API responses from the STATcube REST API. This is very useful for our internal web application and we will have to decide how we deal with this in the upcoming CRAN release.

Leaving it as a hidden feature might create a bad impression during reviews. Removing it would make it necessary to implement the caching logic elsewhere, which might be tricky. Therefore, it is probably best to document and export the behavior. Documentation is already available in ?sc_cache.

One problem with the current implementation is that the hashes are created via serialize() and therefore they are not reusable in different R versions. It would be very handy to use something like digest::digest() but adding another dependency package just for the hashes seems unnecessary. Maybe tools::md5sum() could be used in a different way to get a satisfying result.

TODOs

  • review reference documentation
  • think about creation of hashes
  • Add @export to make the caching available without environment variables
  • Remove @internal to include the man pages in the index page of the documentation
  • Maybe add something similar to od_cache_summary() which provides an overview about the cache contents. This would probably require some kind of cache_index.csv so we don't need to parse the cache entries.

Expose API parameter "accept-language"

Allow users to switch to english responses (as opposed to german, which is the server standard) with a new parameter language in sc_get_response() and sc_saved_table().

Hopefully, this will only affect variable descriptions so the parsers (as.data.frame(), sc_meta(), ...) should not need any updates due to those changes.

Add support for the editing server of data.stattistik.at

Similar to #25 but for od_table(). Currently, the caches use something like ~/.cache/STATcubeR/open_data/{id}.csv which basically mimicks the file format from the servers. We will need a second cache directory or disable caching for the editing server.

  • add server parameter to od_table() to switch between the external server and the editing server
  • support caching for both servers with separate caching directories
  • documentation
  • parameter checking

Error when using od_catalogue()

Describe the bug
Function od_catalogue() throws an error, examples in Documentation don't work

To Reproduce

catalogue <- od_catalogue()
Fehler in strsplit(., "?id=") : nicht-character Argument

Expected behavior
Expected a data.frame containing metadata on the various datasets.

Environment

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21)
 os       Debian GNU/Linux 10 (buster)
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  de_AT.UTF-8
 ctype    de_AT.UTF-8
 tz       Europe/Vienna
 date     2024-02-22
 rstudio  2023.12.1+402 Ocean Storm (server)
 pandoc   NAPackages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 data.table  * 1.15.0  2024-01-30 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.0)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.0)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.0)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.0)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [2] CRAN (R 4.2.1)
 highr         0.10    2022-12-22 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.0.3)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.0)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.0)
 rprojroot     2.0.4   2023-11-05 [1] CRAN (R 4.3.0)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 STATcubeR   * 0.5.0   2023-06-12 [1] Github (statistikat/STATcubeR@4537d3e)
 stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.3.0   2024-01-18 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.0)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.0)
 withr         3.0.0   2024-01-16 [1] CRAN (R 4.3.0)
 xfun          0.41    2023-11-01 [1] CRAN (R 4.3.0)

 [1] /home/zenz/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Automatically add total_codes for OGD Data

The classification files of OGD datasets contain an optional column "FK" (Foreign Key) which can point to the parent element of the clasification element. This allows a definition of hierarchical classifications. Currently, the FK column is ignored by STATcubeR but it could be used to automatically detect "total codes". Example:

Code Name FK
WEST Label for West TOTAL
EAST Label for East TOTAL
TOTAL Label for Total

Here, there is a singe classification element (Code: TOTAL) and all other elements point to that element via FK. In cases like this, it is reasonable to regard TOTAL as the total code for this classification.

Add support for multiple REST API servers

There are now multiple internal STATcube API servers running inside our firewalls. Extend the package in order to

  • manage several keys in sc_key()
  • Automatically detect the correct server based on naming conventions
  • Adapt the generation of hashes to vary depending on the server

re-implement annotatons

In early versions, STATcubeR used to include annotations in the output of as.dataframe.sc_table(). This was dropped when support for OGD Datasets was introduced in #11 . Back then, the annotations were included using separate columns.

It is planned to re-implement this feature in a slightly different manner using {tibble} and {vctrs} by providing a custom vector class that acts as a "annotated numeric". The result of printing those values should look something like this

image

Annotations should either replace the values while printing or use color coding to reference a specific annotation

image

The "annotation legend" (which color corresponds to which annotation) can then be included in the footer of the tibble. Some technical details

  • In order to keep things backwards compatible, the default behavior of sc_tabulate() and as.data.frame.sc_table() should be to return simple tibbles that only include columns of type numeric and factor. Adding annotations should be "opt-in"
  • Annotated cell values containing a zero can usually be interpreted as not available. Therefore, it makes sense to show the annotation code instead of the zero value (first screenshot). For annotated non-zero values, the values will be color-coded based on the annotation (second screnshot)
  • The "annotated numeric" class used to represent the columns will have a as.numeric() method which drops the annotations and returns a canonical double-type
  • Aggregating annotations will not be pursued. If a sc_tabulate() is called in a way where aggregation via rowsums() is necessary and annotations is set to TRUE, an error will be thrown.
  • Color-coding values with multiple annotations will not be pursued. Instead, one of the annotations will be selected for the color.

filters in sc_table_custom()

There have now been several requests to support filtering in sc_table_custom(). Currently, the only way to do this is to generate the request.json by hand.

Example
library(STATcubeR)

schema <- sc_schema_db("detouextregsai")
region <- schema$`Other Classifications`$
  `Tourism commune [ABO]`$`Regionale Gliederung (Ebene +1)`

request <- list(
  database = schema$id,
  dimensions = list(I(region$id)),
  recodes = setNames(
    list(list(
      map = list(
        I(region$Bregenzerwald$id),
        I(region$`Vorarlberg Rest`$id),
        I(region$`Bodensee-Vorarlberg`$id)
      )
    )),
    region$id
  )
)

jsonlite::write_json(request, "request.json", pretty = TRUE, auto_unbox = TRUE)
readLines("request.json") %>% cat(sep = "\n")
x <- sc_table("request.json", add_totals = FALSE)
x$tabulate()

It might be sensible to extend the functionality of sc_table_custom() to support filters (or possibly other recodes) via additional parameters. The syntax might look like this

library(STATcubeR)

schema <- sc_schema_db("detouextregsai")
region <- schema$`Other Classifications`$
  `Tourism commune [ABO]`$`Regionale Gliederung (Ebene +1)`

sc_table_custom(
  schema,
  region,
  sc_recode(region, c(region$Bregenzerwald, 
      region$`Vorarlberg Rest`, region$`Bodensee-Vorarlberg`)) 
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.